Human tests for machine models: What lies “Beyond the Imitation Game”?

Noya Kohavi, Anna Weichselbraun

Published online on January 05, 2026

Abstract

["Journal of Linguistic Anthropology, Volume 36, Issue 1, May 2026. ", "\nAbstract\nBenchmarking large language models (LLMs) is a key practice for evaluating their capabilities and risks. This paper considers the development of “BIG Bench,” a crowdsourced benchmark designed to test LLMs “Beyond the Imitation Game.” Drawing on linguistic anthropological and ethnographic analysis of the project's GitHub repository, we examine how contributors developed tasks based on their lay understandings of language, cognition, and intelligence. By tracing how contributors make implicit judgments about what constitutes a meaningful test of intelligence, we show how widespread language ideologies shape the evaluation of LLMs and the imaginaries that guide their development.\n\nAbstract\nBenchmarking af store sprogmodeller (LLM'er) er en vigtig praksis til evaluering af deres kapaciteter og risici. Denne artikel behandler udviklingen af »BIG Bench«, et crowd‐sourcet benchmark designet til at teste LLM'er »Beyond the Imitation Game«. Baseret på en lingvistisk antropologisk og etnografisk analyse af BIG Bench's GitHub‐repository undersøger vi, hvordan de multiple bidragsydere bag BIG Bench udviklede opgaver baseret på deres lægmandskendskab til sprog, kognition og intelligens. Ved at spore, hvordan bidragsyderne har foretaget implicitte vurderinger af, hvad der udgør en meningsfuld test af intelligens, viser vi, hvordan udbredte sprogideologier former evalueringen af LLM'er og de forestillinger, der styrer deres udvikling.\n"]