Home Technology Debates over AI benchmarking have reached Pokémon

Debates over AI benchmarking have reached Pokémon

by admin
0 comments 18 views

Not even Pokémon is safe from the controversy of the comparative evaluation of AI.

Last week, a Publish in x He went viral, claiming that Google’s last Gemini model overcame the Claude Model of Anthrope in the original Pokémon video game trilogy. As reported, Gemini had reached Lavender Town in the contraction current of a developer; Claude was caught in Mount Moon at the end of February.

But what the publication did not mention is that Gemini had an advantage.

As Users in Reddit Noted, the developer who maintains the Gemini transmission built a personalized minimum that helps the model to identify “mosaics” in the game as corner trees. This reduces the need for Gemini to analyze screenshots before making game decisions.

Now, Pokémon is a reference point of semi-series in the best of cases, few would argue that it is a very informative test of the capabilities of a model. But is An instructional example of how different implementations of a reference point can influence the results.

For example, anthropic reported Two scores for its recent anthropic model 3.7 of the sonnet in the verified reference bank, which is designed to evaluate the coding skills of a model. Claude 3.7 The sonnet achieved an accuracy of 62.3% in the verified SWE banks, but 70.3% with a “personalized scaffolding” that anthropic developed.

More recently, a Fine target-a version of one of its newest models, calls 4 maverick, to function well at a particular reference point, LM Arena. The vanilla version of the model is significantly worse in the same evaluation.

Since the AI ​​reference points, including Pokémon, are imperfect measures to begin with personalized and non -standard implementations threaten to further cloudwater. That is, it does not seem likely to be easier to compare models as they are launched.

You may also like

Leave a Comment