This post explains how researchers use Cross-Layer Transcoders (CLTs) to trace computational circuits within the Qwen3 language model, revealing how it arrives at specific predictions.
What Are Circuits?
Circuits are "collections of components in the model through which we can causally trace the model's logic as it produces an output." Effective circuits should be understandable, tell a coherent computational story, and allow for intervention testing.
Methodology
The methodology employs attribution graphs — networks linking CLT features where edges represent causal influences between them. The researchers used the open-source circuit-tracer library alongside newly released Qwen3 transcoders.
Case Study: Presidential Name Prediction
The analysis examines how Qwen3-1.7B completes the prompt "Fact: The president of the United States in 2010 was" by predicting " Barack" with 74% confidence.
Four feature groups emerged:
- Features detecting "president"
- Features recognizing the year "2010"
- Features identifying the token " was" and encouraging political figure names
- Features representing events around 2010
By selectively disabling feature groups, researchers demonstrated predictable effects on the model's output, validating their circuit interpretation.
Conclusion
The work demonstrates feasibility of circuit-based model interpretation while acknowledging that "building attribution graphs still requires significant manual effort." The team is developing tools to automate this process.