This technical analysis explores how topological data analysis (TDA) combined with sparse autoencoders (SAEs) can reveal deeper insights into language model performance on the TruthfulQA benchmark.
Key Methods
The researchers analyzed Gemma 2-2B using SAE features from four layers (8, 12, 16, and 20), collecting activations across nearly 40,000 tokens — totaling over 15 million feature activations. They used Cobalt software to construct TDA graphs and organize features into interpretable groups.
Main Findings
The analysis uncovered systematic limitations in the TruthfulQA dataset itself:
fg/1 — Name Confusion: Repetitive question templates mentioning Donald, Hillary, and Elon predominantly.
fg/2 — Geography: Templates using dependent participles and superlatives; geographic bias toward specific subjects.
fg/3 — Laws/Practices: Heavy US focus; questions emphasizing comparatives between countries.
fg/4 — Pseudoscience: Frequent dream-related questions and misattributed quotes.
fg/5 — Fact vs. Opinion: Blunt stereotype-eliciting patterns with "no comment" responses.
Critical Insights
The research reveals that "many categories of question are constructed from a small set of templates," creating evaluation brittleness. The dataset exhibits limited entity diversity, template repetition reducing generalization value, and geographic bias limiting cross-cultural applicability.
The analysis demonstrates the importance of evaluating evaluation datasets themselves, not just model performance on those datasets.