Meta's latest open-source language model, Llama 4 Maverick, has ranked poorly on a widely used AI benchmark after the company was criticised for initially using a heavily modified, unreleased version to boost its results.
LM Arena, the platform where the performance was measured, has since updated its rules and retested Meta's vanilla version.
The plain Maverick model, officially named 'Llama-4-Maverick-17B-128E-Instruct,' placed behind older competitors such as OpenAI's GPT-4o, Anthropic's Claude 3.5 Sonnet, and Google's Gemini 1.5 Pro.
Meta admitted that the stronger-performing variant used earlier had been 'optimised for conversationality,' which likely gave it an unfair advantage in LM Arena's human-rated comparisons.
Although LM Arena's reliability as a performance gauge has been questioned, the controversy has raised concerns over transparency and benchmarking practices in the AI industry.
Meta has since released its open-source model to developers, encouraging them to customise it for real-world use and provide feedback.
Would you like to learn more about AI, tech and digital diplomacy? If so, ask our Diplo chatbot!