Graph Modeling for Mechanistic Interpretability

Turning Complex AI Models into Searchable Graphs

The problem of extracting human-understandable information from large, complex, and noisy text or image data is one of the fundamental challenges facing artificial intelligence. Modern AI models (esp. LLMs) can deliver amazing outputs, but their decision-making processes often remain hidden, creating a “black box” that makes it difficult to know why models fail or succeed.

BluelightAI’s flagship interpretability platform, Cobalt, directly addresses this AI black box problem by leveraging Topological Data Analysis (“TDA”) as a foundational technology, a critical differentiator unmatched by existing evaluation platforms. The fundamental idea behind TDA is that for many kinds of data, traditional algebraic tools are not flexible enough to represent data as effectively as we would like. To give a sense of this, remember that Galileo and Newton used simple algebra to formulate the laws of physics. Physics is an area in which there is a very solid theory which permits the results of data analysis in simple algebraic formulae. These methods still have applicability in many simple data analytic problems, such as the prediction of sales based on advertising spend by linear regression. However, the analytic problem usually does not have a simple solution in the form of equations, because data comes in different “shapes”, not all of which are linear or even algebraic. For example, as we have found in a PNAS breast cancer study, gene expression profiles of breast cancer form a “Y” shape, in which the three segments, or “flares”, correspond to three different situations, namely (1) mild disease or near normal tissue, (2) very severe disease such as triple negative cancer with poor prognosis, and (3) disease with excellent prognosis, so that patients whose profile lie near the tip of the corresponding flare have perfect survival in the study producing the data. This kind of analysis is called “disease stratification”, which is vitally important in the development of therapies. TDA constructs graph models of the data which represent its shape. It can handle all data shapes, which means that it can make simple representations of data which would require many variables to describe algebraically. The TDA toolkit is very flexible and adaptable to any data type where there is a notion of similarity of data points, and this covers almost any kind of data. The graph model is a new kind of model for data, and it is the backbone of our technology. Graph modeling is very flexible, in that it can be adapted in various ways for various data types, problem tasks, and AI modeling tasks. For example, there are TDA-based techniques for anomaly detection, time series analysis (periodic and not), clustering, and the location of local maxima and minima of quantities of interest. Building topological representations of data has the added benefit of making the data readily searchable.

As an example, one can study prompts in a generative AI setting, where one has “thumbs up” or “thumbs down” responses to images created from a prompt. BluelightAI’s Cobalt software can identify significant groups with high concentrations of thumbs down responses using the kind of graph modeling described above. In a synthetic example we constructed, there were three major groups of failures. One consisted of “negative constraint failures”, in which the prompt included a specification not to include something. As an example, one might be requesting a picture of a kitchen without a dishwasher, and the AI model would take that as a request to include a dishwasher. Or, to be more extreme, requesting a kitchen which is 100% voice of elephants in it would produce an image of a kitchen with an elephant. The second group consisted of errors in comparison accuracy, where a request had been made to produce a lamp and a candle, with the lamp dimmer, and the reverse (I.e. a bright lamp and a dim candle) were produced. Finally, the third group consisted of failures to create realistic crowd scenes. A request for “thousands of excited fans at an outdoor concert” resulted in an image in which the same person appeared in many places in the image.

BluelightAI is building a conversational interface for Cobalt, where one can ask questions about the groups and the data. You can see in the responses to the queries that Cobalt is a tool for mechanistic interpretability, in that it identifies layers and features that are the most relevant to a particular failure group.

Full disclosure, we are working with synthetically generated data, but want to indicate how the interface works in practice.

The same graph modeling method is used to identify high level groups of sparse autoencoder features, which characterize the types of prompts which are driving the failures. In their recent “Biology of Large Language Models” paper, Anthropic highlights the importance of these feature groups (which they call supernodes), but laments that finding them requires significant manual analysis: “…we often work around this issue in an ad-hoc way by manually grouping together features with related meanings into “supernodes” of an attribution graph. While this technique has proven quite helpful, the manual step is labor-intensive, subjective, and likely loses information.” Performing the grouping task automatically is a key ingredient in making sparse autoencoders and follow-on technologies useful, and these technologies are in turn critical to making mechanistic interpretability work. Our TDA technology performs this grouping automatically at multiple resolution scales, and provides access to the the underlying graph model representation to the user, which makes the set of features readily searchable and explorable.

To add some perspective to the discussion, we should observe that algebraic models, at least those with relatively few variables, are quite interpretable, but not always powerful. In order to obtain additional power, neural networks and deep learning were invented. They can be viewed as *very *large algebraic models, and because of their size they are difficult to interpret. The goal of mechanistic interpretability is to use techniques such as sparse autoencoders (SAEs) and cross layer transcoders (CLTs) to begin to make them more interpretable. We believe that TDA is the next step in the evolution of deep learning which will allow us to have the best of both the algebraic and deep learning world.

Here’s how Cobalt, powered by TDA, transforms this challenge into a commercial and technical advantage:

Explaining Model Failures: For those AI models where outputs can be scored for correctness or user satisfaction, Cobalt autonomously discovers meaningful groupings of systematic failures without human intervention. In demonstrative use cases, it exposed distinct patterns, but suggest targeted interventions from prompt engineering to new training data synthesis.
Mechanistic Interpretability at Scale: Cobalt’s integration with advanced techniques like sparse autoencoders (SAEs) and cross-layer transcoders (CLTs) allows it to map not only data, but the internal mechanisms of the AI models (LLMs). TDA is used to cluster and relate these interpretive features (“supernodes”), making it possible to automatically surface, search, and audit the inner reasons behind model behavior, something previously attempted only manually and at small scale by industry peers.
From Black Box to Auditable Map: Because of TDA, Cobalt visualizes and outputs the underlying knowledge graphs to users, creating an auditable, interpretable map of both datasets and the model’s internal logic. This map empowers users to compare models, probe feature activations, diagnose specific strengths, blind spots, or bias.
Industry-Specific Adaptation: Cobalt’s TDA-powered engine identifies not just performance metrics, but why models behave as they do in production environments, highlighting underrepresented features and failure patterns specific to each industry’s evolving needs. Furthermore, the inherent flexibility of TDA methods enables efficient adaptation to new domains and use cases.

BluelightAI’s Cobalt with its native, end-to-end implementation of Topological Data Analysis is not just another dashboard for model evaluation. It is a breakthrough in mechanistic interpretability, enabling an unprecedented level of transparency, adaptability, and actionable oversight over AI models, solving the black box problem. Cobalt’s TDA-based approach is the key ingredient that unlocks this new frontier and stands out as a critical differentiator for practitioners seeking scalable AI infrastructure.

Graph Modeling for Mechanistic Interpretability

Turning Complex AI Models into Searchable Graphs

Ready to try Cobalt? Go to BluelightAI.com/Cobalt