Arthur AI tested top AI models from Meta, OpenAI, Cohere and Anthropic on their accuracy and reliability.

The models are called large language models (LLMs) and they can generate realistic text and images based on input data.

The researchers found that some models make up facts, or hallucinate, significantly more than others, which can lead to misinformation and risk.

Cohere’s AI hallucinated most, followed by Meta’s Llama 2. They also gave confident wrong answers without warning phrases.

OpenAI’s GPT-4 performed the best of all models tested, and hallucinated less than its previous version, GPT-3.5. It was also best at math questions.

Anthropic’s Claude 2 was second best in accuracy, and best at knowing its limits. It answered only questions it had training data to support.

The researchers suggested that users and businesses should test the AI models on their exact workloads and understand how they perform for their specific goals.

The report is the first to take a comprehensive look at rates of hallucination, rather than just provide a single number that talks about where they are on an LLM leaderboard.

The report also showed that GPT-4 had a 50% relative increase in hedging compared to GPT-3.5, which made it more frustrating to use.