One of AI’s new flagship models launched on Saturday, Maverick, It takes second place in the LM sandA test that makes human evaluators compare models outputs and choose which ones prefer. But it seems that the Maverick version that Meta implemented in LM Arena differs from the version that is widely available for developers.
As several AI researchers He pointed out in X, Meta said in his announcement that the Maverick in the LM Arena is an “experimental chat version.” A graph in the Official Llamas WebsiteMeanwhile, it reveals that the target LM sand tests were performed using “call 4 maverick optimized for conversation.”
As we have written before, for several reasons, LM Arena has never been the most reliable measure of the performance of an AI model. But artificial intelligence companies have generally not customized or have not adjusted their models to get better in the LM Arena, or at least they have not admitted to doing so.
The problem with the adaptation of a model to a reference point, retaining it and then releases a variant of “vanilla” of that same model is that it is difficult for developers to predict exactly how good the model will work in particular contexts. It is also misleading. Ideally, the reference points, unfortunately inappropriate as they are, provide a snapshot of the strengths and weaknesses of a single model in a variety of tasks.
In fact, researchers in X have Observed Stark Differences in behavior of the maverick publicly discharged compared to the model housed in LM Arena. The LM Arena version seems to use many emojis and give incredibly long answers.
Okl calls 4 is definitely a cooked cooking hahaha, what is this city of YAP? pic.twitter.com/y3GVHBVZ65
– Nathan Lambert (@natolembert) April 6, 2025
For some reason, the model calls 4 in Arena uses many more emojis
in together. ai, it seems better: pic.twitter.com/f74odx4ztt
– Tech device (@techdevnotes) April 6, 2025
We have communicated with Meta and Chatbot Arena, the organization that maintains the LM Arena, to comment.