To risk stating the obvious: LLM benchmarks are bullshit not because LLM vendors game them, but because the benchmarks are proxies for general human ability and intelligence. They are in no way proxies for the general abilities of statistical models, even if they weren’t gamed.

Miles Metcalfe @mmetcalfe