The Cobalt Workspace

The Workspace is the central object in Cobalt for analyzing your data. It encapsulates a CobaltDataset, optional dataset splits, and provides methods for building TDA graphs, finding failure groups, detecting drift, and clustering.

A Workspace is created by passing in a CobaltDataset and an optional DatasetSplit:

workspace = cobalt.Workspace(dataset, split)

When a Workspace is created, it automatically builds a TDA graph from the dataset’s default embedding. You can disable this by passing auto_graph=False.

TDA Graphs

The core of Cobalt’s analysis is the TDA (Topological Data Analysis) graph. This graph represents your data as a network of nodes and edges, where each node corresponds to a group of data points and edges connect groups that are similar in embedding space. The graph is multiscale, meaning you can view it at different levels of coarseness.

When a Workspace is initialized, a default graph is created automatically. You can create additional graphs using Workspace.new_graph():

workspace.new_graph(name="test_graph", subset="test", embedding=0)

You can also specify the distance metric and control the initial view parameters:

workspace.new_graph(
    name="custom_graph",
    subset=my_subset,
    embedding="fc2",
    metric="cosine",
    init_max_nodes=500,
    init_max_degree=15.0
)

All created graphs are accessible via the Workspace.graphs property, which returns a dictionary mapping graph names to MultiResolutionGraph objects.

You can extract groups from specific levels of a graph using Workspace.get_graph_level() and Workspace.get_graph_levels():

groups = workspace.get_graph_level(graph="test_graph", level=5)

Saving and Retrieving Groups

Groups of data points can be saved to the Workspace either through the UI or programmatically using Workspace.add_group():

workspace.add_group("my_group", subset)

Saved groups can be retrieved using Workspace.get_groups(), which returns a GroupCollection object:

groups = workspace.get_groups()

The Workspace.saved_groups property also provides access to saved groups. Note that this does not include groups discovered by algorithms like Workspace.find_failure_groups() — only groups saved manually in the UI or with Workspace.add_group().

You can also import and export groups as DataFrames:

workspace.import_groups_from_dataframe(df)
df = workspace.export_groups_as_dataframe()

Group Algorithms

Cobalt provides three main algorithms for discovering meaningful groups in your data: failure group detection, drift detection, and clustering. Each algorithm stores its results in the Workspace and can optionally display them in the UI.

All three algorithms accept the following common parameters:

run_name: A name under which to store the results. If not provided, it will be chosen automatically.
visible: Whether to show the results of this analysis in the UI. Defaults to True.
generate_group_descriptions: Whether to generate statistical and textual descriptions of returned groups. Defaults to True, but consider setting to False for large datasets with many columns, as this process can be time consuming.

Failure Groups

Workspace.find_failure_groups() identifies groups of data points where a model performs poorly. It uses topological methods to find connected regions in the embedding space with high error rates.

Basic Usage

workspace.find_failure_groups()

By default, this runs on the entire dataset. You can restrict the analysis to a subset of the data using the subset parameter. For example, to find failure groups only on the test set:

workspace.find_failure_groups(subset=split["test"])

Or on a random sample:

workspace.find_failure_groups(subset=dataset.sample(5000))

Parameters

method: Algorithm to use for finding failure groups. Currently only "superlevel" is supported.
subset: The subset of the data on which to perform the analysis. If not provided, the entire dataset is used.
model: Index or name of the model for which failure groups should be found.
embedding: The embedding to use for the analysis. If not provided, the default dataset embedding is used.
failure_metric: The performance metric to use. Can be a string (name of a model performance metric) or a Pandas Series with one value per data point.
min_size: The minimum size for a returned failure group. Smaller groups will be discarded.
max_size: The maximum size for a returned failure group. Larger groups will be split into smaller groups by applying a clustering algorithm.
min_failures: The minimum number of failures for a returned failure group. Defaults to 3.

Results

The method returns a GroupResultsCollection object containing the discovered failure groups and the parameters used by the algorithm. Results are also stored in workspace.failure_groups under the specified run_name.

Drifted Groups

Workspace.find_drifted_groups() identifies groups of data points in a comparison dataset that are underrepresented in a reference dataset. This is useful for detecting distribution shift between training and test data, or between training data and production data.

Basic Usage

workspace.find_drifted_groups(
    reference_group="train",
    comparison_group="test"
)

This finds regions of the embedding space where test data points are overrepresented relative to training data points.

Parameters

reference_group: The reference subset of the data, e.g. the training set. Can be a string name from the dataset split or a CobaltDataSubset.
comparison_group: The subset of the data that may have regions not well represented in the reference set. This may be a test dataset or production data.
embedding: The embedding to use for the analysis.
relative_prevalence_threshold: How much more common points from the comparison group need to be in a group relative to the overall average for it to be considered drifted. Defaults to 2. Under the default, the interpretation is roughly that for any returned group, points from the comparison subset are at least twice as common as they would be in a random sample.
p_value_threshold: Used in a significance test that the prevalence of points from the comparison group is at least as high as required. Defaults to 0.05.
min_size: The minimum number of data points that need to be in the drifted region. Defaults to 5.
model: Index or name of the model whose error metric will be shown with the returned groups.

Results

The method returns a GroupResultsCollection object containing the discovered drifted groups and the parameters used by the algorithm. Results are also stored in workspace.drifted_groups.

Clustering

Workspace.find_clusters() identifies natural clusters in the dataset using the TDA graph structure. Unlike failure groups and drifted groups, clustering does not require model performance information.

Basic Usage

workspace.find_clusters()

Parameters

method: Algorithm to use for finding clusters. Currently only "modularity" is supported.
subset: The subset of the data on which to perform the analysis. If not provided, the entire dataset is used.
graph: A graph to use for the clustering. If not provided, a new graph will be created based on the specified embedding. If a graph is provided, it must be built on the subset specified by the subset parameter.
embedding: The embedding to use to create a graph if none is provided.
min_group_size: The minimum size for a returned cluster. If a value between 0 and 1 is provided, it will be interpreted as a fraction of the subset size.
max_group_size: The maximum size for a returned cluster. If a value between 0 and 1 is provided, it will be interpreted as a fraction of the subset size.
max_n_groups: The maximum number of clusters to return. Defaults to 10000.
min_n_groups: The minimum number of clusters to return. Defaults to 1.

Results

The method returns a GroupResultsCollection object containing the discovered clusters and the parameters used by the algorithm. Results are also stored in workspace.clustering_results.