← All Posts Research

Evaluating LLM Hallucinations with BluelightAI Cobalt

Evaluating LLM Hallucinations with BluelightAI Cobalt

This article explores how BluelightAI Cobalt helps identify patterns in large language model errors, using the TruthfulQA dataset as a testing ground.

Core Problem

Large language models frequently produce outputs disconnected from reality โ€” a phenomenon called hallucination that limits their deployment in sensitive applications. Understanding which inputs trigger errors is essential for production deployment decisions.

Methodology

The authors tested Google's Gemma 2-2B model on TruthfulQA, a benchmark of multiple-choice questions based on common misconceptions. The model achieved 66.7% accuracy, barely surpassing the 50% random-guessing baseline.

Key Finding

Rather than manually reviewing errors, Cobalt automatically identifies failure groups โ€” clusters of similar inputs where error rates spike significantly. The tool discovered five distinct failure patterns where accuracy ranged from 22% to 44%:

  • Questions confusing well-known figures sharing the same first name
  • Geography-focused questions
  • Country-specific laws and practices
  • Pseudoscience and misattributed quotes
  • Fact-versus-opinion distinctions

Exploring with the Cobalt UI

Conclusion

"Cobalt accelerates this process greatly, letting you get on to the work of improving the model and mitigating hallucinations," enabling prospective identification of hallucination risks rather than post-hoc analysis.