How do we know our classification is better than humans?
Ricardo Marino (Google Paris)
Large language models have taken the world by storm in the past year. Many are used for classification tasks and their performance is compared to non-expert crowdsourced human annotators. But if they disagree, how can we tell who is actually wrong? And how to do this without relying on experts, who are too expensive and less available?