Earlier this week, Meta landed in hot water to use an experimental and unpublished version of its Maverick model calls 4 to achieve a high score at a crowdsourced reference point, LM Arena. The incident caused the mainer of LM Arena to apologizeChange your policies and write down the unmodified vanilla maverick.
It turns out that it is not very competitive.
The unmodified maverick, “llama-4-maverick-17b-128e-instruct”, models were then classified Including the OpenAI GPT-4, the Sonnet Claude 3.5 from Anthrope and Gemini 1.5 Pro from Google starting Friday. Many of these models have months.
The flame launching version 4 has been added to Lmarena after it was discovered that they cheated, but you probably did not see it because you have to move down to place 32, which is where it is rank. pic.twitter.com/a0Bxkdx4lx
– ρ: ɡeσn (@pigeon__s) April 11, 2025
Why low performance? MAVERICK EXPERIMENTAL META, CALL-4-MAVERICK-03-26-Experimental, was “optimized for conversation,” the company explained in a Published table last Saturday. Those optimizations evidently played well in the LM Arena, which makes human evaluators compare models and choose which ones prefer.
As we have written before, for several reasons, LM Arena has never been the most reliable measure of the performance of an AI model. Even so, adapting a model to a reference point, in addition to being misleading, makes it a challenge for developers to predict exactly how good the model will work in different contexts.
In a statement, a target spokesman told TechCrunch that Meta experiences with “all kinds of personalized variants.”
“Call-4-Maverick-03-26-Experimental ‘is an optimized chat version with which we experience that it also works well in Lmarena,” said the spokesman. “Now we have launched our open source version and we will see how developers customize flame 4 for their own cases of use. We are excited to see what they will build and expect their continuous comments.”