amulety.utils

Main module.

Functions

`batch_loader`(data, batch_size)	This function generates batches from the provided data.
`check_dependencies`()	Check if optional embedding dependencies are installed and provide installation instructions.
`concatenate_heavylight`(data, sequence_col, ...)	Concatenates heavy and light chain per cell using AMULETY's unified H/L interface.
`get_cdr3_sequence_column`(airr, ...)	Get the best CDR3 sequence column for TCR data.
`insert_space_every_other_except_cls`(input_string)	This function inserts a space after every character in the input string, except for the '[CLS]' token.
`process_airr`(airr_df, chain_mode[, ...])	Processes AIRR-seq data and returns a pandas DataFrame containing sequences to embed.
`process_h_plus_l`(data, sequence_col, cell_id_col)	Processes both heavy and light chains separately for H+L, H, or L formats.

Classes

ConditionalFormatter([fmt, datefmt, style, ...])

class ConditionalFormatter(fmt=None, datefmt=None, style='%', validate=True, *, defaults=None)[source]

Bases: Formatter

format(record)[source]

Format the specified record as text.

The record’s attribute dictionary is used as the operand to a string formatting operation which yields the returned string. Before formatting the dictionary, a couple of preparatory steps are carried out. The message attribute of the record is computed using LogRecord.getMessage(). If the formatting string uses the time (as determined by a call to usesTime(), formatTime() is called to format the event time. If there is exception information, it is formatted using formatException() and appended to the message.

batch_loader(data: Iterable, batch_size: int)[source]

This function generates batches from the provided data.

Parameters: data (Iterable): The data to be batched. batch_size (int): The size of each batch.

Yields: tuple: A tuple containing the start index, end index, and the batch of data.

check_dependencies()[source]

Check if optional embedding dependencies are installed and provide installation instructions.

This function checks all model types (BCR, TCR, and protein language models) for missing dependencies.

Returns:: List of tuples (model_name, installation_command) for missing dependencies
Return type:: list

concatenate_heavylight(data: DataFrame, sequence_col: str, cell_id_col: str, duplicate_col: str = 'duplicate_count', order: str = 'HL', mode: str = 'concat')[source]

Concatenates heavy and light chain per cell using AMULETY’s unified H/L interface.

Concatenates sequences as: Heavy<cls><cls>Light (HL order) or Light<cls><cls>Heavy (LH order) for both BCR (IGH + IGL/IGK) and TCR (TRB/TRD + TRA/TRG) data. See embed_airr() documentation for chain mappings.

If a cell contains multiple chains of the same type, selects the one with highest value in the selection column.

Parameters:

order (str) – Chain concatenation order, either “HL” (Heavy-Light) or “LH” (Light-Heavy). Default: “HL”.
data (pandas.DataFrame) – Input data containing heavy and light chain information. Must include columns: cell_id_col, “chain”, selection_col, sequence_col
sequence_col (str) – The name of the column containing the amino acid sequences to embed.
cell_id_col (str) – The name of the column containing the single-cell barcode.
selection_col (str) – The name of the numeric column used to select the best chain when multiple chains of the same type exist per cell. Default: “duplicate_count”.
mode (str) – Mode to use in concatenating sequences. By default it concatenates the sequences (concat), it can also tabulate the sequences alone (tab) or together with the locus and segment (tab_locus_gene).

Returns:

Dataframe with concatenated heavy and light chains per cell.: Format: HEAVY<cls><cls>LIGHT for each cell.

Return type:

pandas.DataFrame

Raises:

ValueError – If required columns are missing or duplicate_col is not numeric.

get_cdr3_sequence_column(airr: DataFrame, default_sequence_col: str)[source]

Get the best CDR3 sequence column for TCR data.

Parameters:

airr (pd.DataFrame) – AIRR DataFrame
default_sequence_col (str) – Default sequence column name

Returns:

The best CDR3 sequence column name

Return type:

str

insert_space_every_other_except_cls(input_string: str)[source]

This function inserts a space after every character in the input string, except for the ‘[CLS]’ token.

Parameters: input_string (str): The input string where spaces are to be inserted.

Returns: str: The modified string with spaces inserted.

process_airr(airr_df: DataFrame, chain_mode: str, sequence_col: str = 'sequence_vdj_aa', cell_id_col: str = 'cell_id', duplicate_col: str = 'duplicate_count', receptor_type: str = 'all', mode: str = 'concat')[source]

Processes AIRR-seq data and returns a pandas DataFrame containing sequences to embed.

Uses AMULETY’s unified H/L/HL interface for both BCR and TCR data. See embed_airr() function documentation for detailed chain parameter explanations.

Parameters:

airr_df (pandas.DataFrame) – Input AIRR rearrangement table as a pandas DataFrame.
chain_mode (str) – The input chain, one of [“H”, “L”, “HL”, “LH”, “H+L”].
sequence_col (str) – The name of the column containing the amino acid sequences to embed.
cell_id_col (str) – The name of the column containing the single-cell barcode.
receptor_type (str) – The receptor type to validate, one of [“BCR”, “TCR”, “all”]. - “BCR”: validates only BCR chains (IGH, IGL, IGK) are present - “TCR”: validates only TCR chains (TRA, TRB, TRG, TRD) are present - “all”: allows both BCR and TCR chains in the same file
duplicate_col (str) – The name of the numeric column used to select the best chain when multiple chains of the same type exist per cell. Default: “duplicate_count”.
mode (str) – Mode to use in concatenating sequences. By default it concatenates the sequences when the HL chain is provided (concat), it can also tabulate the sequences alone (tab) or together with the locus and segment (tab_locus_gene).

Returns:

Dataframe with formatted sequences.

Return type:

pandas.DataFrame

Raises:

ValueError – If chain is not one of [“H”, “L”, “HL”, “LH”, “H+L”] or receptor_type validation fails.

process_h_plus_l(data: DataFrame, sequence_col: str, cell_id_col: str, duplicate_col: str = 'duplicate_count', mode: str = 'tab')[source]

Processes both heavy and light chains separately for H+L, H, or L formats.

Returns a DataFrame with heavy and/or light chain sequences for each cell, keeping them as separate entries rather than concatenating them. Supports different output modes including tab_locus_gene format.

If a cell contains multiple chains of the same type, selects the one with highest value in the selection column.

Parameters:

data (pandas.DataFrame) – Input data containing chain information.
sequence_col (str) – The name of the column containing the amino acid sequences.
cell_id_col (str) – The name of the column containing the single-cell barcode.
duplicate_col (str) – The name of the numeric column used to select the best chain.
mode (str) – Output mode - “tab” for simple tabular format, “tab_locus_gene” for extended format with V/J gene information.

Returns:

Dataframe with processed chain sequences in the specified format.

Return type:

pandas.DataFrame