Approach Cobalt Research Team Docs Get In Touch

Docs › API Reference › API Reference

API Reference

Workspace

class cobalt.Workspace(dataset: CobaltDataset, split: DatasetSplit | SplitDescriptor | None = None, auto_graph: bool = True, run_server: bool | None = None)

Encapsulates analysis done with a dataset and models.

Attributes:

ui

Type: UI — A user interface that can be used to interact with the data, models, and other analysis.

run_auto_group_analysis

Type: bool — Whether to automatically run a group analysis of the data and models when the UI is opened, if no analysis has yet been run.

Methods:

add_column()

add_column(key: str, data, is_categorical: bool | Literal['auto'] = 'auto')

Add or replace a column in the dataset.

Parameters:

  • key — Name of the column to add.
  • data — ArrayLike of values to store in the column. Must have length equal to the length of the dataset.
  • is_categorical — Whether the column values should be treated as categorical. If “auto” (the default), will autodetect.

add_evaluation_metric_values()

add_evaluation_metric_values(name: str, metric_values: ArrayLike, model: int | str | ModelMetadata = 0, lower_values_are_better: bool = True)

Add values for a custom evaluation metric.

Parameters:

  • name — A name for this evaluation metric. This will be used to name a column in the dataset where these values will be stored, as well as to name the metric itself.
  • metric_values — An arraylike with one value for each data point in the dataset.
  • model — The name or index of the model in self.dataset that this metric evaluates.
  • lower_values_are_better — If True, Cobalt will interpret lower values of this metric as positive; otherwise, it will interpret higher values as positive.

add_graph()

add_graph(name: str, graph: MultiResolutionGraph, subset: CobaltDataSubset, init_max_nodes: int = 500, init_max_degree: float = 15.0)

Add a graph to self.graphs.

Parameters:

  • name — A name for the graph.
  • graph — The graph to add.
  • subset — The subset of self.dataset this graph is constructed from.
  • init_max_nodes — The maximum number of nodes to show in the initial view of this graph.
  • init_max_degree — The maximum average node degree for the initial view of this graph.

add_group()

add_group(name: str, group: CobaltDataSubset)

Add a group to the collection of saved groups.

analyze()

static analyze(subset: CobaltDataSubset) -> Tuple[DataFrame, DataFrame]

Compute numerical and categorical statistics for the given subset.

Returns: A tuple (numerical_statistics, categorical_statistics) giving summary statistics for numerical and categorical features in the dataset.

auto_analysis()

auto_analysis(ref: str | CobaltDataSubset, cmp: str | CobaltDataSubset, model: int | str | ModelMetadata = 0, embedding: int | str | Embedding = 0, failure_metric: str | Series | None = None, min_size: int = 3, min_failures: int = 3, config: Dict[str, Dict] | None = None, run_name: str | None = None, manual: bool = True, visible: bool = True)

Returns an analysis of errors and warnings with the data and model.

Parameters:

  • ref — The subset of the data on which to do the reference analysis. Users should typically pass in the training dataset.
  • cmp — The subset of the data on which to do the comparison analysis. Users may pass in a test dataset, or a production dataset.
  • model — The index or name of the model object you want to consider.
  • embedding — The embedding to use to create a graph if none is provided. If none is provided, will use the default dataset embedding.
  • failure_metric — The failure metric to use to find error patterns based on.
  • min_size — The minimum size of a returned group.
  • min_failures — The minimum number of failures in a failure group, for a classification task.
  • config — A dictionary containing further configuration parameters that will be passed to the underlying algorithm.
  • run_name — A name under which to store the results. If one is not provided, it will be chosen automatically.
  • manual — Used internally to signal whether the clustering analysis was created by the user.
  • visible — Whether to show the results of this analysis in the UI.

Returns: A dictionary with keys “summaries” and “groups”. Under “summaries” is a tuple of two DataFrames. Under “groups” is a tuple of two lists of CobaltDataSubsets.

clustering_results

Type: Dict[str, GroupResultsCollection] — Results from all previous runs of the clustering algorithm.

drifted_groups

Type: Dict[str, GroupResultsCollection] — The collection of all drifted group analysis results.

export_groups_as_dataframe()

export_groups_as_dataframe() -> DataFrame

Exports saved groups as a DataFrame. The columns of the resulting DataFrame are named after the saved groups, and the column for each group contains a boolean mask indicating which data points in the dataset belong to that group.

failure_groups

Type: Dict[str, GroupResultsCollection] — The collection of all failure group analysis results.

feature_compare()

feature_compare(group_1: str | CobaltDataSubset, group_2: str | CobaltDataSubset | Literal['all', 'rest', 'neighbors'], numerical_features: List[str] | None = None, categorical_features: List[str] | None = None, numerical_test: Literal['t-test', 'perm'] = 't-test', categorical_test: Literal['G-test'] = 'G-test', include_nan: bool = False, neighbor_graph: str | MultiResolutionGraph | None = None)

Compare the distributions of features between two subsets.

find_clusters()

find_clusters(method: Literal['modularity'] = 'modularity', subset: CobaltDataSubset | str | None = None, graph: MultiResolutionGraph | None = None, embedding: int | str | Embedding = 0, min_group_size: int | float = 1, max_group_size: int | float = inf, max_n_groups: int = 10000, min_n_groups: int = 1, config: Dict[str, Any] | None = None, run_name: str | None = None, manual: bool = True, visible: bool = True, generate_group_descriptions: bool = True) -> GroupResultsCollection

Run an analysis to find natural clusters in the dataset. Saves the results in self.clustering_results under run_name.

Parameters:

  • method — Algorithm to use for finding clusters. Currently only “modularity” is supported.
  • subset — The subset of the data on which to perform the analysis. If none is provided, will use the entire dataset.
  • graph — A graph to use for the clustering. If none is provided, will create a new graph based on the specified embedding.
  • embedding — The embedding to use to create a graph if none is provided.
  • min_group_size — The minimum size for a returned cluster. Values between 0 and 1 are interpreted as fractions.
  • max_group_size — The maximum size for a returned cluster. Values between 0 and 1 are interpreted as fractions.
  • max_n_groups — The maximum number of clusters to return.
  • min_n_groups — The minimum number of clusters to return.
  • config — Further configuration parameters for the underlying algorithm.
  • run_name — A name under which to store the results.
  • manual — Used internally to signal whether the clustering analysis was created by the user.
  • visible — Whether to show the results of this analysis in the UI.
  • generate_group_descriptions — Whether to generate statistical and textual descriptions of returned clusters.

Returns: A GroupResultsCollection object containing the discovered clusters.

find_drifted_groups()

find_drifted_groups(reference_group: str | CobaltDataSubset, comparison_group: str | CobaltDataSubset, embedding: int = 0, relative_prevalence_threshold: float = 2, p_value_threshold: float = 0.05, min_size: int = 5, run_name: str | None = None, config: Dict[str, Any] | None = None, manual: bool = True, visible: bool = True, generate_group_descriptions: bool = True, model: int | str | ModelMetadata = 0) -> GroupResultsCollection

Return groups in the comparison group that are underrepresented in the reference group.

Parameters:

  • reference_group — The reference subset of the data, e.g. the training set.
  • comparison_group — The subset of the data that may have regions not well represented in the reference set.
  • embedding — The embedding to use for the analysis.
  • relative_prevalence_threshold — How much more common points from comparison_group need to be in a group relative to the overall average for it to be considered drifted.
  • p_value_threshold — Used in a significance test for prevalence.
  • min_size — The minimum number of data points in a drifted region.
  • run_name — A name under which to store the results.
  • config — Further configuration parameters for the underlying algorithm.
  • manual — Used internally to signal whether the analysis was created by the user.
  • visible — Whether to show the results of this analysis in the UI.
  • generate_group_descriptions — Whether to generate statistical and textual descriptions of returned groups.
  • model — Index or name of the model whose error metric will be shown with the returned groups.

Returns: A GroupResultsCollection object containing the discovered drifted groups.

find_failure_groups()

find_failure_groups(method: Literal['superlevel'] = 'superlevel', subset: CobaltDataSubset | str | None = None, model: int | str | ModelMetadata = 0, embedding: int | str | Embedding = 0, failure_metric: str | Series | None = None, min_size: int = 1, max_size: int | float = inf, min_failures: int = 3, config: Dict[str, Dict] | None = None, run_name: str | None = None, manual: bool = True, visible: bool = True, generate_group_descriptions: bool = True) -> GroupResultsCollection

Run an analysis to find failure groups in the dataset. Saves the results in self.failure_groups under run_name.

Parameters:

  • method — Algorithm to use for finding failure groups. Currently only “superlevel” is supported.
  • subset — The subset of the data on which to perform the analysis.
  • model — Index or name of the model for which failure groups should be found.
  • embedding — The embedding to use for the analysis.
  • failure_metric — The performance metric to use.
  • min_size — The minimum size for a returned failure group.
  • max_size — The maximum size for a returned failure group.
  • min_failures — The minimum number of failures for a returned failure group (classification tasks only).
  • config — Further configuration parameters for the underlying algorithm.
  • run_name — A name under which to store the results.
  • manual — Used internally to signal whether the analysis was created by the user.
  • visible — Whether to show the results of this analysis in the UI.
  • generate_group_descriptions — Whether to generate descriptions of returned groups.

Returns: A GroupResultsCollection object containing the discovered failure groups.

from_arrays()

static from_arrays(model_inputs: List | ndarray | DataFrame, model_predictions: ndarray, ground_truth: ndarray | None, task: str = 'classification', embedding: ndarray | None = None, embeddings: List[ndarray] | None = None, embedding_metric: str | None = None, embedding_metrics: List[str] | None = None, split: DatasetSplit | SplitDescriptor | None = None)

Returns a Workspace object constructed from user-defined arrays.

Parameters:

  • model_inputs — The data evaluated by the model.
  • model_predictions — The model’s predictions corresponding to model_inputs.
  • ground_truth — Ground truths for model_inputs.
  • task — Model task, pass in “classification”.
  • embedding — Embedding array to include.
  • embeddings — List of embedding arrays to use.
  • embedding_metric — Embedding metric corresponding to embedding.
  • embedding_metrics — List of metrics corresponding to embeddings.
  • split — An optional dataset split.

get_graph_level()

get_graph_level(graph: str | MultiResolutionGraph, level: int, name: str | None = None) -> GroupCollection

Create a GroupCollection from a specified level of a graph. This method is experimental and its interface may change.

Parameters:

  • graph — Name of the graph to use, or the graph object itself.
  • level — The level of the graph to use for the groups.
  • name — An optional name for the GroupCollection.

get_graph_levels()

get_graph_levels(graph: str | MultiResolutionGraph, min_level: int, max_level: int, name_prefix: str | None = None) -> Dict[int, GroupCollection]

Create GroupCollections for a range of levels of a graph. This method is experimental and its interface may change.

Parameters:

  • graph — Name of the graph to use, or the graph object itself.
  • min_level — The lowest level of the graph to use.
  • max_level — The highest level of the graph to use.
  • name_prefix — If provided, the GroupCollection for level i will be named “{name_prefix}_{i}“.

get_group_neighbors()

get_group_neighbors(group: CobaltDataSubset | str, graph: MultiResolutionGraph | str, size_ratio: float = 1.0) -> CobaltDataSubset

Find a set of data points that are neighbors of a group. Returns a set of data points well connected to the given group in the graph, excluding points from the original group. This method is experimental.

Parameters:

  • group — A CobaltDataSubset or name of a saved group to find the neighbors of.
  • graph — A MultiResolutionGraph or name of a graph in which to find the neighbors.
  • size_ratio — Approximate relative size of the group of neighbors.

get_groups()

get_groups() -> GroupCollection

Get a GroupCollection object with the currently saved groups.

graphs

Type: Dict[str, MultiResolutionGraph] — The graphs that have been created and saved.

import_groups_from_dataframe()

import_groups_from_dataframe(df: DataFrame)

Imports groups from a DataFrame with one column for each group. The name of each column will be used as the group name, and the entries will be interpreted as boolean values indicating membership.

new_graph()

new_graph(name: str | None = None, subset: CobaltDataSubset | str | None = None, embedding: int | str | Embedding = 0, metric: str | None = None, init_max_nodes: int = 500, init_max_degree: float = 15.0, **kwargs) -> MultiResolutionGraph

Create a new graph from a specified subset. The resulting graph will be returned and added to the Workspace.

Parameters:

  • name — The name to give the graph in self.graphs. If None, autoname it.
  • subset — The subset of the dataset to include in the graph.
  • embedding — The embedding to use to generate the graph.
  • metric — The distance metric to use when constructing the graph.
  • init_max_nodes — The maximum number of nodes to show in the initial view.
  • init_max_degree — The maximum average node degree for the initial view.
  • **kwargs — Additional keyword parameters interpreted as parameters to construct a GraphSpec object.

saved_groups

Type: GroupCollection — An object that represents the currently saved groups. Does not include groups selected by algorithms like find_failure_groups(), only groups saved manually in the UI or with Workspace.add_group().

view_table()

view_table(subset: List[int] | CobaltDataSubset | None = None, display_columns: List[str] | None = None, max_rows: int | None = None)

Returns a visualization of the dataset table.


UI

class cobalt.UI(workspace: Workspace, dataset: CobaltDataset, table_image_size: Tuple[int, int] = (80, 80))

An interactive UI visualizing the data in a Workspace.

Parameters:

  • workspace — The Workspace object that this UI will visualize.
  • dataset — The CobaltDataset being analyzed.
  • table_image_size — For datasets with images, the (height, width) size in pixels for the data table.

build()

build()

Construct the UI. This normally happens automatically when the UI object appears as an output in a notebook cell.

get_current_graph()

get_current_graph() -> MultiResolutionGraph

Return the currently shown graph.

get_current_graph_source_data()

get_current_graph_source_data() -> CobaltDataSubset

Return the current dataset being displayed in the current graph. Note that if sub-sampling is enabled, this may not be the entire dataset.

get_graph_and_clusters()

get_graph_and_clusters() -> Tuple[Graph, List[CobaltDataSubset]]

Return the current graph and the datapoints that belong to each node.

Returns: A tuple (Graph, List[CobaltDataSubset]) representing the current graph as networkx, and a list of the datapoints that each node represents.

get_graph_selection()

get_graph_selection() -> CobaltDataSubset

Return the current subset selected in the graph.


CobaltDataset

class cobalt.CobaltDataset(dataset: DataFrame, metadata: DatasetMetadata | None = None, models: List[ModelMetadata] | None = None, embeddings: List[Embedding] | None = None, name: str | None = None, arrays: Dict[str, ndarray] | None = None)

Foundational object for a Cobalt analysis. Encapsulates all necessary information regarding the data, metadata, and model outputs associated with an analysis.

name

Type: str | None — Optional string for dataset name.

add_array()

add_array(key: str, array: ndarray | csr_array)

Add a new array to the dataset. Will raise an error if an array with the given name already exists.

add_embedding()

add_embedding(embedding: Embedding)

Add an Embedding object.

add_embedding_array()

add_embedding_array(embedding: ndarray | Any, metric: str = 'euclidean', name: str | None = None)

Add an embedding to the dataset.

Parameters:

  • embedding — An array or arraylike object containing the embedding values. Should be two-dimensional and have the same number of rows as the dataset.
  • metric — The preferred distance metric to use with this embedding. Defaults to “euclidean”; “cosine” is another useful option.
  • name — An optional name for the embedding.

add_media_column()

add_media_column(paths: List[str], local_root_path: str | None = None, column_name: str | None = None)

Add a media column to the dataset.

Parameters:

  • paths — A list containing the paths to the media file for each data point.
  • local_root_path — A root path for all the paths.
  • column_name — The name for the column that should store the media file paths.

add_model()

add_model(input_columns: str | List[str] | None = None, target_column: str | List[str] | None = None, prediction_column: str | List[str] | None = None, task: str | ModelTask = 'custom', performance_columns: List[str | dict] | None = None, name: str | None = None)

Add a new model.

Parameters:

  • input_columns — The column(s) in the dataset that the model takes as input.
  • target_column — The column(s) with the target values for the model outputs.
  • prediction_column — The column(s) with the model’s outputs.
  • task — The task the model performs (“custom”, “regression”, or “classification”).
  • performance_columns — Columns containing pointwise model performance metrics.
  • name — An optional name for the model.

add_text_column_embedding()

add_text_column_embedding(source_column: str, embedding_model: str = 'all-MiniLM-L6-v2', embedding_name: str | None = None, device: str | None = None)

Create text embeddings from a column of the dataset using a sentence_transformers model.

Parameters:

  • source_column — The column containing the text to embed.
  • embedding_model — The name of the sentence_transformers model to use.
  • embedding_name — The name to save the embedding with.
  • device — The torch device to run the embedding model on.

array_names

Type: List[str] — Names of the arrays stored in this dataset.

as_subset()

as_subset()

Returns all rows of this CobaltDataset as a CobaltDataSubset.

compute_model_performance_metrics()

compute_model_performance_metrics()

Compute the performance metrics for each model in the dataset. Adds columns to the dataset storing the computed metrics.

create_rich_media_table()

create_rich_media_table(break_newlines: bool = True, highlight_terms: Dict[str, List[str]] | None = None, run_server: bool | None = False) -> DataFrame

Returns media table with image columns as HTML column.

df

Type: DataFrame — Returns a pd.DataFrame of the underlying data for this dataset.

embedding_arrays

Type: List[ndarray] — A list of the raw arrays for each embedding. Deprecated: use CobaltDataset.embedding_metadata and Embedding.get() instead.

embedding_metadata

Type: List[Embedding] — The Embedding objects associated with this dataset.

embedding_names()

embedding_names() -> List[str]

Return the available embedding names.

filter()

filter(condition: str) -> CobaltDataSubset

Returns subset where condition evaluates to True in the DataFrame.

Parameters:

  • condition — String predicate evaluated using pd.eval.

Returns: Selected Subset of type CobaltDataSubset.

from_dict()

classmethod from_dict(data) -> CobaltDataset

Instantiate a CobaltDataset from a dictionary representation.

from_json()

classmethod from_json(serialized_data: str) -> CobaltDataset

Deserialize a JSON string into a dataset.

get_array()

get_array(key: str) -> ndarray

Get an array from the dataset.

get_embedding()

get_embedding(index: int | str = 0) -> ndarray

Return the embedding associated with this CobaltDataset.

get_image_columns()

get_image_columns() -> List[str]

Gets image columns.

get_model_performance_data()

get_model_performance_data(metric: str, model_index: int | str) -> ndarray

Returns computed performance metric.

get_summary_statistics()

get_summary_statistics(categorical_max_unique_count: int = 10) -> Tuple[DataFrame, DataFrame]

Returns summary statistics for each feature in the dataset.

load()

classmethod load(file_path: str) -> CobaltDataset

Load a saved dataset from a .json file.

mask()

mask(m: ArrayLike) -> CobaltDataSubset

Return a CobaltDataSubset consisting of rows at indices where m is nonzero.

metadata

Type: DatasetMetadata — A DatasetMetadata object containing the metadata for this dataset.

models

Type: ModelMetadataCollection — The models associated with this dataset. Each ModelMetadata object represents potential outcome, prediction, and error columns.

overall_model_performance_score()

overall_model_performance_score(metric: str, model_index: int | str) -> float

Computes the mean model performance score.

overall_model_performance_scores()

overall_model_performance_scores(model_index: int | str) -> Dict[str, float]

Computes performance score for each available metric.

sample()

sample(max_samples: int, random_state: int | None = None) -> CobaltDataSubset

Return a CobaltDataSubset containing up to max_samples sampled rows without replacement.

Parameters:

  • max_samples — The maximum number of samples to pull.
  • random_state — An optional integer seed for random sampling.

save()

save(file_path: str | PathLike) -> str

Write this dataset to a .json file. Returns the path written to.

select_col()

select_col(col: str) -> Series

Return the values for column col of this dataset.

set_column()

set_column(key: str, data, is_categorical: bool | Literal['auto'] = 'auto')

Add or replace a column in the dataset.

Parameters:

  • key — Name of the column to add.
  • data — ArrayLike of values. Must have length equal to the length of the dataset.
  • is_categorical — Whether the column values should be treated as categorical.

set_column_text_type()

set_column_text_type(column: str, input_type: TextDataType)

Set the type for a text column in the dataset. Options include “long_text” (keyword analysis, no coloring) and “short_text” (no keyword analysis, allows categorical coloring).

subset()

subset(indices: ArrayLike) -> CobaltDataSubset

Returns a CobaltDataSubset consisting of rows indexed by indices.

time_range()

time_range(start_time: Timestamp, end_time: Timestamp) -> CobaltDataSubset

Return a CobaltDataSubset within a time range [start_time, end_time).

Parameters:

  • start_time — A pd.Timestamp marking the start of the time window.
  • end_time — A pd.Timestamp marking the end of the time window.

to_dict()

to_dict() -> dict

Save all information in this dataset to a dict.

to_json()

to_json() -> str

Serialize this dataset to a JSON string.


CobaltDataSubset

class cobalt.CobaltDataSubset(source: CobaltDataset, indices: ndarray | List[int])

Represents a subset of a CobaltDataset. Should generally be constructed by calling subset() or similar methods on a CobaltDataset or CobaltDataSubset.

source_dataset

Type: CobaltDataset — The CobaltDataset of which this is a subset.

indices

Type: ndarray — np.ndarray of integer row indices defining the subset.

as_mask()

as_mask() -> ndarray[bool]

Returns mask of self on self.source_dataset.

as_mask_on()

as_mask_on(base_subset: CobaltDataSubset) -> ndarray[bool]

Returns mask of self on another subset. Raises ValueError if self is not a subset of base_subset.

complement()

complement() -> CobaltDataSubset

Returns the complement of this set in its source dataset.

concatenate()

concatenate(dataset: CobaltDataSubset) -> CobaltDataSubset

Add another data subset to this one. Does not check for overlaps. Raises ValueError if subsets have different parent datasets.

create_rich_media_table()

create_rich_media_table(break_newlines: bool = True, highlight_terms: Dict[str, List[str]] | None = None, run_server: bool | None = False) -> DataFrame

Returns media table with image columns as HTML column.

df

Type: DataFrame — Returns a pd.DataFrame of the data represented by this data subset.

difference()

difference(dataset: CobaltDataSubset) -> CobaltDataSubset

Returns the subset of self that is not contained in dataset. Raises ValueError if subsets have different parent datasets.

embedding_arrays

Type: List[ndarray] — A list of the raw arrays for each embedding. Deprecated: use CobaltDataset.embedding_metadata and Embedding.get() instead.

embedding_names()

embedding_names() -> List[str]

Return the available embedding names.

filter()

filter(condition: str) -> CobaltDataSubset

Returns subset where condition evaluates to True in the DataFrame.

Parameters:

  • condition — String predicate evaluated using pd.eval.

get_classifier()

get_classifier(model_type: Literal['svm', 'knn', 'rf'] | Callable = 'knn', embedding_index: int = 0, global_set: CobaltDataSubset | None = None, params: Dict | None = None)

Build a Classifier to distinguish this subset from the rest of the data. This is an experimental method.

Parameters:

  • model_type — A string representing the type of model to be trained.
  • embedding_index — Which embedding to use as inputs.
  • global_set — The ambient dataset to distinguish from. Defaults to self.source_dataset.
  • params — Keyword arguments passed to the classifier constructor.

get_embedding()

get_embedding(index: int | str = 0) -> ndarray

Return the embedding associated with this CobaltDataset.

get_image_columns()

get_image_columns() -> List[str]

Gets image columns.

get_model_performance_data()

get_model_performance_data(metric: str, model_index: int | str) -> ndarray

Returns computed performance metric.

get_model_performance_metrics()

get_model_performance_metrics() -> dict

Retrieve and aggregate performance metrics for each model in the subset.

Returns: A dictionary structured as {model_name: {metric_name: metric_value}}.

get_summary_statistics()

get_summary_statistics(categorical_max_unique_count: int = 10) -> Tuple[DataFrame, DataFrame]

Returns summary statistics for each feature in the dataset.

intersect()

intersect(dataset: CobaltDataSubset) -> CobaltDataSubset

Returns the intersection of self with dataset. Raises ValueError if subsets have different parent datasets.

intersection_size()

intersection_size(dataset: CobaltDataSubset) -> int

Returns the size of the intersection of self with dataset. More efficient than len(self.intersect(dataset)).

mask()

mask(m: ArrayLike) -> CobaltDataSubset

Return a CobaltDataSubset consisting of rows at indices where m is nonzero.

metadata

Type: DatasetMetadata — A DatasetMetadata object containing the metadata for this dataset.

models

Type: ModelMetadataCollection — The models associated with this dataset.

overall_model_performance_score()

overall_model_performance_score(metric: str, model_index: int | str) -> float

Computes the mean model performance score.

overall_model_performance_scores()

overall_model_performance_scores(model_index: int | str) -> Dict[str, float]

Computes performance score for each available metric.

sample()

sample(max_samples: int, random_state: int | None = None) -> CobaltDataSubset

Return a CobaltDataSubset containing up to max_samples sampled rows without replacement.

Parameters:

  • max_samples — The maximum number of samples to pull.
  • random_state — An optional integer seed for random sampling.

select_col()

select_col(col: str) -> Series

Return the pd.Series for column col of this data subset.

subset()

subset(indices: ArrayLike) -> CobaltDataSubset

Returns a subset obtained via indexing into self.df. Tracks the dependency on self.source_dataset.

to_dataset()

to_dataset() -> CobaltDataset

Converts this subset to a standalone CobaltDataset.


ModelMetadata

class cobalt.ModelMetadata(outcome_columns: List[str], prediction_columns: List[str], task: ModelTask, input_columns: List[str] | None = None, error_columns: List[str] | None = None, evaluation_metrics: Sequence[EvaluationMetric | Dict] | None = None, name: str | None = None)

Information about a model and its relationship to a dataset. Stores information about the model’s inputs, outputs, ground truth data, and provides access to model performance metrics.

name

Type: str | None — An optional name for the model.

task

Type: ModelTask — The task performed by the model (“classification”, “regression”, or “custom”).

input_columns

Type: List[str] — A list of column(s) containing the input data for the model.

prediction_columns

Type: List[str] — A list of column(s) containing the outputs produced by the model.

outcome_columns

Type: List[str] — A list of column(s) containing the target outputs for the model.

add_metric_column()

add_metric_column(metric_name: str, column: str, lower_values_are_better: bool = True)

Add a column from the dataset as a performance metric for this model.

Parameters:

  • metric_name — The name for the metric. Use the same name across models for comparison.
  • column — The name of the column in the dataset that contains the metric values.
  • lower_values_are_better — Whether lower or higher values indicate better performance.

get_confusion_matrix()

get_confusion_matrix(dataset: DatasetBase, normalize_mode: bool | Literal['all', 'index', 'columns'] = 'index', selected_classes: List[str] | None = None) -> pd.DataFrame | None

Calculate the confusion matrix for the model if applicable.

Parameters:

  • dataset — The dataset containing the outcomes and predictions.
  • normalize_mode — Specifies the normalization mode for the confusion matrix.
  • selected_classes — Specifies the classes to include, with all others aggregated as “other”.

Returns: Confusion matrix as a DataFrame, or None if not applicable.

get_statistic_metrics()

get_statistic_metrics(dataset: DatasetBase, selected_classes: List[str] | None = None) -> pd.DataFrame

Return a DataFrame containing recall, precision, F1 score, and accuracy for each class.

Parameters:

  • dataset — The dataset to compute the confusion matrix.
  • selected_classes — List of classes to include. If None, all classes are calculated.

outcome_column

Type: str | None — Returns the first outcome column if len(outcome_columns) > 0, else None.

performance_metrics()

performance_metrics() -> Dict[str, EvaluationMetric]

Return the relevant performance metrics for this model. The returned objects have a calculate() method (pointwise metrics) and an overall_score() method (overall performance for a group).

prediction_column

Type: str | None — Returns the first prediction column if len(prediction_columns) > 0, else None.


DatasetMetadata

class cobalt.DatasetMetadata(media_columns: List[MediaInformationColumn] | None = None, timestamp_columns: List[str] | None = None, hidable_columns: List[str] | None = None, default_columns: List[str] | None = None, other_metadata_columns: List[str] | None = None, default_topic_column: str | None = None)

Encapsulates various metadata about a CobaltDataset.

media_columns

Type: List[MediaInformationColumn] | None — Optional list of MediaInformationColumns.

timestamp_columns

Type: List[str] | None — Optional list of timestamp column name strings.

hidable_columns

Type: List[str] | None — Optional list of hidable column name strings.

default_columns

Type: List[str] | None — Optional list containing the names of columns to display by default in an interactive data table.

other_metadata_columns

Type: List[str] | None — Optional list of column name strings.

data_types

Type: Dict — Dict mapping column names to DatasetColumnMetadata objects.

default_topic_column

Type: str | None — Default column to use for topic analysis. Will be None if len(self.long_text_columns) == 0.

long_text_columns

Type: List[str] — Columns containing large amounts of text data. Candidates for topic or keyword analysis.

timestamp_column()

timestamp_column(index=0) -> str

Return the (string) name of the indexth timestamp column.


MediaInformationColumn

class cobalt.MediaInformationColumn(column_name: str, file_type: str, host_directory: str, is_remote=False)

Represent a column containing information about media files.

column_name

Type: str — Column name in dataframe.

file_type

Type: str — A string indicating the file type, e.g. its extension.

host_directory

Type: str — Path or URL where the file is located.

is_remote

Type: bool — Whether the file is remote.

autoname_media_visualization_column()

autoname_media_visualization_column() -> dict

Autoname media column.


Embedding

class cobalt.Embedding(name=None)

Encapsulates metadata about a dataset embedding. Base class (ABC).

admissible_distance_metrics

Type: List[str] — Distance metrics that are reasonable to use with this embedding.

default_distance_metric

Type: str — Default distance metric to use with this embedding. (abstract)

dimension

Type: int — The dimension of the embedding. (abstract)

distance_metrics

Type: List[str] — Suggested distance metrics for use with this embedding.

get()

get(dataset: DatasetBase) -> np.ndarray

Get the values of this embedding for a dataset. (abstract)

get_available_distance_metrics()

get_available_distance_metrics() -> List[str]

Return the list of distance metrics that could be used. (abstract)


ArrayEmbedding

class cobalt.ArrayEmbedding(array_name: str, dimension: int, metric: str, name: str | None = None)

An embedding stored in an array associated with a Dataset. Inherits from Embedding.

array_name

Type: str — The name of the array in the dataset storing the embedding values.

admissible_distance_metrics

Type: List[str] — Distance metrics that are reasonable to use with this embedding.

default_distance_metric

Type: str — Default distance metric to use with this embedding.

dimension

Type: int — The dimension of the embedding.

distance_metrics

Type: List[str] — Suggested distance metrics for use with this embedding.

get()

get(dataset: DatasetBase) -> np.ndarray

Return a np.ndarray of the embedding rows at specified indices.

Parameters:

  • dataset — Data(sub)set for which to get the embedding values.

get_available_distance_metrics()

get_available_distance_metrics() -> List[str]

Return the list of distance metrics that could be used.


ColumnEmbedding

class cobalt.ColumnEmbedding(columns: List[str], metric: str, name=None)

Represents an embedding as a column range. Inherits from Embedding.

columns

Type: List[str] — List of strings naming the columns to include in this embedding.

admissible_distance_metrics

Type: List[str] — Distance metrics that are reasonable to use with this embedding.

default_distance_metric

Type: str — Default distance metric to use with this embedding.

dimension

Type: int — The dimension of the embedding.

distance_metrics

Type: List[str] — Suggested distance metrics for use with this embedding.

get()

get(dataset: DatasetBase) -> np.ndarray

Return a np.ndarray of the embedding rows at specified indices. Only columns specified in the columns attribute are included.

Parameters:

  • dataset — Data(sub)set for which to get the embedding values.

get_available_distance_metrics()

get_available_distance_metrics() -> List[str]

Return the list of distance metrics that could be used.


DatasetSplit

class cobalt.DatasetSplit(dataset: CobaltDataset, split: Sequence[int] | Sequence[CobaltDataSubset | List[int] | ndarray] | Dict[str, CobaltDataSubset | List[int] | ndarray] | None = None, train: CobaltDataSubset | List[int] | ndarray | None = None, test: CobaltDataSubset | List[int] | ndarray | None = None, prod: CobaltDataSubset | List[int] | ndarray | None = None)

A dictionary-like container for user-defined subsets of data. Inherits from dict.

The DatasetSplit can contain any number of named subsets. Special subset names “train”, “test”, and “prod” are given extra meaning by Cobalt for automated analyses.

clear()

clear() -> None

Remove all items.

comparable_subset_pairs

Type: List[Tuple[Tuple[str, CobaltDataSubset], Tuple[str, CobaltDataSubset]]] — Returns a list of pairs of disjoint subsets in this split, with names.

copy()

copy()

Return a shallow copy.

fromkeys()

classmethod fromkeys(iterable, value=None)

Create a new dictionary with keys from iterable and values set to value.

get()

get(key, default=None)

Return the value for key if key is in the dictionary, else default.

has_multiple_subsets

Type: bool — Whether this split has multiple disjoint subsets that can be compared.

items()

items()

Return a set-like object providing a view on items.

keys()

keys()

Return a set-like object providing a view on keys.

names

Type: List[str] — Names of subsets in this split.

pop()

pop(k[, d]) -> v

Remove specified key and return the corresponding value. If the key is not found, return the default if given; otherwise, raise a KeyError.

popitem()

popitem()

Remove and return a (key, value) pair as a 2-tuple. Pairs are returned in LIFO order. Raises KeyError if the dict is empty.

prod

Type: CobaltDataSubset | None — The production subset, if it exists.

setdefault()

setdefault(key, default=None)

Insert key with a value of default if key is not in the dictionary. Return the value for key.

test

Type: CobaltDataSubset | None — The testing subset, if it exists.

train

Type: CobaltDataSubset | None — The training subset, if it exists.

update()

update([E, ] **F) -> None

Update from mapping/iterable E and F.

values()

values()

Return an object providing a view on values.


ProblemGroup

class cobalt.ProblemGroup(subset: CobaltDataSubset, name: str | None = None, metrics: Dict[str, float] = <factory>, description: str | None = None, display_info: GroupDisplayInfo = <factory>, keywords: Dict[str, GroupKeywords] = <factory>, comparison_stats: Dict[str, GroupComparisonStats] = <factory>, group_type: GroupType = GroupType.any, problem_description: str = '', severity: float = 1.0, primary_metric: str | None = None)

A group representing a problem with a model. Inherits from GroupMetadata and Group.

description

Type: str | None — A short description of the contents of the group.

group_type

Type: GroupType — Describes the semantic meaning of the group in context.

name

Type: str | None — The group’s name. Should be unique within a SubsetCollection.

primary_metric

Type: str | None — The main metric used to evaluate this group.

problem_description

Type: str — A brief description of the problem.

severity

Type: float — A score representing the degree of seriousness of the problem. Used to sort a collection of groups.

subset

Type: CobaltDataSubset — The data points included in this group.

metrics

Type: Dict[str, float] — Relevant numeric metrics for this group.

display_info

Type: GroupDisplayInfo — Information to be displayed in the group explorer in the UI.

keywords

Type: Dict[str, GroupKeywords] — Distinctive keywords found in text columns in the group.

comparison_stats

Type: Dict[str, GroupComparisonStats] — Results of statistical tests comparing this group with others.


SubsetCollection

class cobalt.SubsetCollection(source_dataset: CobaltDataset, indices: Sequence[Sequence[int]], name: str | None = None)

A collection of subsets of a CobaltDataset.

aggregate_col()

aggregate_col(col: str, method: Literal['mean', 'sum', 'mode'] | Callable[[Series], Any] | None = None) -> Sequence[float]

Aggregate the values of a column within each subset using the specified method.

concatenate()

concatenate() -> CobaltDataSubset

Concatenate all subsets in the collection.

get_array()

get_array(key: str) -> Sequence[ndarray]

Retrieve the slice of an array for each subset.

is_pairwise_disjoint()

is_pairwise_disjoint()

Return True if there are no overlaps between subsets, False otherwise.

select_col()

select_col(col: str) -> Sequence[Series]

Retrieve the values of a column on each subset.


GroupCollection

class cobalt.GroupCollection(source_dataset: CobaltDataset, indices: Sequence[Sequence[int]], name: str | None = None, group_type: GroupType = GroupType.any)

A collection of groups from a source CobaltDataset. Inherits from SubsetCollection.

A group consists of a subset of data points together with metadata (name, keywords, model performance metrics, distinctive features). Groups can be accessed by index (e.g. collection[0]) or by name (e.g. collection["group name"]). To access metadata, index into collection.metadata.

aggregate_col()

aggregate_col(col: str, method: Literal['mean', 'sum', 'mode'] | Callable[[Series], Any] | None = None) -> Sequence[float]

Aggregate the values of a column within each subset using the specified method.

compare_models()

compare_models(models: Sequence[ModelMetadata | str], metrics: List[str], select_best_model: bool = True, statistical_test: Literal['t-test', 'wilcoxon'] | None = None) -> DataFrame

Produce a dataframe comparing two or more models on each group. Evaluates each specified metric for each model on each group.

compute_group_keywords()

compute_group_keywords(col: str | Sequence[str] | None = None, use_all_text_columns: bool = True, n_keywords: int = 10, set_descriptions: bool = True, set_names: bool = False, warn_if_no_data: bool = True, **kwargs)

Find distinctive keywords for each group and store them in the group metadata.

Parameters:

  • col — The column or columns containing text from which to extract keywords.
  • n_keywords — The number of keywords to find for each group.
  • set_names — If True, will set each group’s name based on the discovered keywords.

concatenate()

concatenate() -> CobaltDataSubset

Concatenate all subsets in the collection.

evaluate_model()

evaluate_model(model: ModelMetadata | str, metrics: Sequence[str] | None = None) -> DataFrame

Produce a dataframe containing model performance metrics for each group.

Parameters:

  • model — Name of the model to evaluate, or a ModelMetadata object.
  • metrics — Names of the metrics to evaluate. Defaults to all metrics defined for the model.

from_groups()

classmethod from_groups(groups: Sequence[GroupMetadata])

Create a GroupCollection from a list of GroupMetadata objects.

from_subset_collection()

classmethod from_subset_collection(subsets: SubsetCollection)

Promote a SubsetCollection to a GroupCollection. This allows adding metadata to each subset.

get_array()

get_array(key: str) -> Sequence[ndarray]

Retrieve the slice of an array for each subset.

is_pairwise_disjoint()

is_pairwise_disjoint()

Return True if there are no overlaps between subsets, False otherwise.

metadata

Type: GroupMetadataIndexer — Get a group together with its metadata.

select_col()

select_col(col: str) -> Sequence[Series]

Retrieve the values of a column on each subset.

set_names_from_keywords()

set_names_from_keywords(col: str, n_keywords: int = 3, delimiter: str = ', ', min_match_rate: float = 0.0)

Set names for each group based on already-computed keywords.

Parameters:

  • col — The column whose keywords should be used to create the group names.
  • n_keywords — The number of keywords to use to form each name.
  • delimiter — The character(s) that should separate keywords.
  • min_match_rate — The minimum fraction of data points in the group that should contain a keyword.

set_names_sequential()

set_names_sequential(prefix: str | None = None, prefix_source: Literal['group_type', 'collection_name'] = 'group_type', sep: str = ' ')

Set names for each group sequentially with a prefix string.


GroupResultsCollection

class cobalt.GroupResultsCollection(name: str, run_type: RunType, source_data: CobaltDataSubset, group_type: GroupType, algorithm: str, params: dict, groups=None, visible: bool = True, run_id: UUID | None = None)

Contains the results of a group analysis on a dataset. Inherits from GroupCollection.

aggregate_col()

aggregate_col(col: str, method: Literal['mean', 'sum', 'mode'] | Callable[[Series], Any] | None = None) -> Sequence[float]

Aggregate the values of a column within each subset using the specified method.

compare_models()

compare_models(models: Sequence[ModelMetadata | str], metrics: List[str], select_best_model: bool = True, statistical_test: Literal['t-test', 'wilcoxon'] | None = None) -> DataFrame

Produce a dataframe comparing two or more models on each group.

compute_group_keywords()

compute_group_keywords(col: str | Sequence[str] | None = None, use_all_text_columns: bool = True, n_keywords: int = 10, set_descriptions: bool = True, set_names: bool = False, warn_if_no_data: bool = True, **kwargs)

Find distinctive keywords for each group and store them in the group metadata.

concatenate()

concatenate() -> CobaltDataSubset

Concatenate all subsets in the collection.

evaluate_model()

evaluate_model(model: ModelMetadata | str, metrics: Sequence[str] | None = None) -> DataFrame

Produce a dataframe containing model performance metrics for each group.

from_groups()

classmethod from_groups(groups: Sequence[GroupMetadata])

Create a GroupCollection from a list of GroupMetadata objects.

from_subset_collection()

classmethod from_subset_collection(subsets: SubsetCollection)

Promote a SubsetCollection to a GroupCollection.

get_array()

get_array(key: str) -> Sequence[ndarray]

Retrieve the slice of an array for each subset.

groups

Type: List[ProblemGroup] — The groups in this collection as ProblemGroup objects.

is_pairwise_disjoint()

is_pairwise_disjoint()

Return True if there are no overlaps between subsets, False otherwise.

metadata

Type: GroupMetadataIndexer — Get a group together with its metadata.

raw_groups

Type: List[CobaltDataSubset] — The groups as a list of CobaltDataSubset objects. Omits the descriptive metadata.

select_col()

select_col(col: str) -> Sequence[Series]

Retrieve the values of a column on each subset.

set_names_from_keywords()

set_names_from_keywords(col: str, n_keywords: int = 3, delimiter: str = ', ', min_match_rate: float = 0.0)

Set names for each group based on already-computed keywords.

set_names_sequential()

set_names_sequential(prefix: str | None = None, prefix_source: Literal['group_type', 'collection_name'] = 'group_type', sep: str = ' ')

Set names for each group sequentially with a prefix string.

summary()

summary(model: ModelMetadata | None = None, production_subset: CobaltDataSubset | None = None) -> DataFrame

Create a tabular summary of the groups in this collection.

Parameters:

  • model — A ModelMetadata object whose performance metrics will be computed for the groups.
  • production_subset — If provided, will calculate the fraction of data points in each group that fall in this subset.

name

Type: str — A name for the collection of results (the “run name”).

source_data

Type: CobaltDataSubset — The data(sub)set used for the analysis.

group_type

Type: GroupType — What each group in the collection represents (e.g. failure group or cluster).

algorithm

Type: str — The algorithm used to produce the groups.

params

Type: dict — Parameters passed to the group-finding algorithm.

run_type

Type: RunType — Whether the algorithm was run manually by the user or automatically by Cobalt.

visible

Type: bool — Whether the groups should be displayed in the UI.

run_id

Type: UUID — A unique ID for this collection of groups.


MultiResolutionGraph

class cobalt.MultiResolutionGraph(*args, **kwargs)

A graph with multiple resolution scales (Protocol class). There are n_levels different graphs arranged hierarchically. Each node of the graph at level i represents a subset of data points and is a subset of some node at each level j > i.

levels

Type: List[AbstractGraph] — A list of graphs representing the dataset at multiple resolution scales. (abstract)


AbstractGraph

class cobalt.AbstractGraph(*args, **kwargs)

A graph whose nodes represent disjoint subsets of a source dataset (Protocol class).

edge_list

Type: List[Tuple[int, int]] — List of tuples (i, j) representing edges i->j. Edges are undirected; only the direction with i < j is included. (abstract)

edge_mtx

Type: ndarray — A list of edges in numpy array form of shape (n_edges, 2). (abstract)

edge_weights

Type: ndarray — Nonnegative weights for each edge. (abstract)

edges

Type: List[Dict] — A list of dictionaries representing data for each edge. Contains at least “source”, “target”, and “weight” keys. (abstract)

n_edges

Type: int — Number of edges in the graph. (abstract)

nodes

Type: Sequence[Collection] — A list where entry i contains the data point ids represented in node i. (abstract)


GraphSpec

class cobalt.GraphSpec(X: ndarray, metric: str, M: int | None = None, K: int | None = None, min_nbrs: int | None = None, L_coarseness: int = 20, L_connectivity: int = 20, filters: Sequence[FilterSpec] = ())

A set of parameters for creating a graph.

K

Type: int | None — The number of mutual nearest neighbors to keep for each data point.

L_coarseness

Type: int — The number of neighbors to keep for each data point when clustering data points into graph nodes. Default: 20.

L_connectivity

Type: int — The number of neighbors to keep for each data point when connecting nodes in the graph. Default: 20.

M

Type: int | None — The number of nearest neighbors to compute for each data point.

filters

Type: Sequence[FilterSpec] — A (possibly empty) list of FilterSpec objects describing filters to apply to the graph.

min_nbrs

Type: int | None — The minimum number of neighbors to keep for each data point.

X

Type: ndarray — The source data. Should be an array of shape (n_points, n_dims).

metric

Type: str — The name of the distance metric to use to create the graph.


FilterSpec

class cobalt.FilterSpec(f_vals: ndarray, n_bins: int = 10, bin_method: Literal['rng', 'uni'] = 'rng', pruning_method: Literal['bin', 'pct'] = 'bin', pruning_threshold: int | float = 1)

A set of parameters for a filter on a graph. Separates the dataset into n_bins bins based on the values of f_vals for each data point. Data points within each bin are clustered to form nodes, and are linked together if they are in nearby bins.

bin_method

Type: Literal['rng', 'uni'] — Either “rng” (equal width bins) or “uni” (equal count bins). Default: “rng”.

n_bins

Type: int — The number of bins to separate the dataset into. Default: 10.

pruning_method

Type: Literal['bin', 'pct'] — Either “bin” (edges between nearby bins only) or “pct” (edges within percentile threshold). Default: “bin”.

pruning_threshold

Type: int | float — The maximum distance two nodes can be apart while still being connected. Default: 1.

f_vals

Type: ndarray — An array of values, one for each data point.


load_tabular_dataset()

cobalt.load_tabular_dataset(df: DataFrame, embeddings: DataFrame | ndarray | List[str] | Literal['numeric_cols', 'rf'] | None = None, rf_source_columns: List[str] | None = None, metadata_df: DataFrame | None = None, timestamp_col: str | None = None, outcome_col: str | None = None, prediction_col: str | None = None, other_metadata: List[str] | None = None, hidden_cols: List[str] | None = None, baseline_column: str | None = None, baseline_end_time: Timestamp | None = None, baseline_indices: List[int] | ndarray | None = None, split_column: str | None = None, embedding_metric: str = 'euclidean', task: Literal['classification', 'regression'] | None = None, model_name: str | None = None) -> Tuple[CobaltDataset, DatasetSplit]

Loads tabular data from a pandas DataFrame into a CobaltDataset.

Note: This function is deprecated. Users should transition to constructing a CobaltDataset directly from a DataFrame.

Parameters:

  • df — A pandas.DataFrame containing the tabular source data.
  • embeddings — Specifies which data to use as embedding columns. May be a DataFrame, ndarray, List[str], “numeric_cols”, or “rf”.
  • rf_source_columns — Columns to use in the random forest embedding.
  • metadata_df — Optional DataFrame containing additional metadata columns.
  • timestamp_col — String name of the timestamp column.
  • outcome_col — String name of the outcome variable column.
  • prediction_col — String name of the model predictions column.
  • other_metadata — Optional list of other metadata column names.
  • hidden_cols — Optional list of columns that will not be displayed in TableViews.
  • baseline_column — Optional indicator column marking baseline set membership.
  • baseline_end_time — Optional pd.Timestamp; datapoints with timestamps <= this value are baseline.
  • baseline_indices — Optional list of row indices for baseline set.
  • split_column — Name of a categorical column containing split labels.
  • embedding_metric — Distance metric for the embedding. Default: “euclidean”.
  • task — Model task type (“regression” or “classification”).
  • model_name — A string name for the model being analyzed.

Returns: A (CobaltDataset, DatasetSplit) tuple.


get_tabular_embeddings()

cobalt.get_tabular_embeddings(df: DataFrame, model_name: Literal['rf'] | None = None, outcome: str | None = None) -> Tuple[ndarray, str, str]

Create an embedding array based on the given DataFrame and embedding method. Currently supports generating embeddings via a random forest model.

Parameters:

  • df — pandas.DataFrame containing the data.
  • model_name — String indicating whether the model to be used is “rf”.
  • outcome — String name of the desired outcome column.

Returns: A tuple (embedding_array, metric, name).


settings

class cobalt.settings

Settings that affect global behavior.

graph_decay_node_repulsion

Type: bool — Whether to decay repulsive forces between nodes beyond a certain distance. Default: True. Must be set before graph creation.

graph_layout_singletons_separately

Type: bool — Whether to lay out singleton nodes separately from all other components. Default: False. Must be set before graph creation.

graph_prevent_node_overlaps

Type: bool — Whether to prevent nodes from overlapping. Default: True. Must be set before graph creation.

graph_use_rich_node_labels

Type: bool — Default node hover label format for graphs. Default: False. Must be set before graph creation.

table_max_base64_total_size

Type: int — The maximum amount of image data to base64 encode in the table data payload. Default: 20000000.


check_license()

cobalt.check_license()

Check the configured license key and print the result.

setup_api_client()

cobalt.setup_api_client()

Set up the API client by updating or adding the API key to the JSON config file.

get_api_client()

cobalt.get_api_client(api_name: str = 'openai')

Get the API client by loading the API key from the JSON config or environment variables.

setup_license()

cobalt.setup_license()

Prompts for a license key and sets it in the configuration file. The license key will be saved in ~/.config/cobalt/cobalt.json.

register_license()

cobalt.register_license(force: bool = False)

Registers this installation of Cobalt for noncommercial or trial usage. Requests your name and email address and configures a license key.


Lab Functionality

The lab submodule contains preliminary and experimental functionality. APIs in this module are subject to change without warning.

describe_groups_multiresolution()

cobalt.lab.describe_groups_multiresolution(ds: CobaltDataset, text_column_name: str, n_gram_range: str | Tuple, aggregation_columns: List[str] | None = None, min_level: int = 0, max_level: int | None = None, max_keywords: int = 3, aggregation_method: Literal['all', 'mean'] | List[Callable] = 'mean', return_intermediates: bool = False) -> Tuple[DataFrame, Workspace, Dict[int, Dict[int, str]]]

Returns a summary of groups in a set of texts. Builds a multiresolution graph from the embeddings provided in the input dataset and computes keyword descriptions of text contained in each node.

Parameters:

  • ds — Dataset containing an embedding of the text data.
  • text_column_name — Column containing text data for keyword analysis.
  • n_gram_range — Whether to analyze keywords with unigrams, bigrams, or a combination.
  • aggregation_columns — Columns in ds to aggregate.
  • min_level — Minimum graph level to output cluster labels for.
  • max_level — Maximum graph level to output cluster labels for.
  • max_keywords — Maximum number of keywords to find for each cluster.
  • aggregation_method — Method(s) to aggregate columns by.
  • return_intermediates — Whether to return intermediate results.

Returns: A tuple of (DataFrame per level with labels, Workspace object, raw labels per level per node).