amulety.amulety

Console script for amulety

Functions

check_deps(log_file, ] = None, verbose, ...)

Check if optional embedding dependencies and tools are installed.

check_igblast_available()

Check if IgBlast is available in the system.

common_options(log_file, ] = None, verbose, ...)

AMULETY: Adaptive imMUne receptor Language model Embedding tool for TCR and antibodY

embed(input_airr, ], chain, ], model, ], ...)

Embeds sequences from an AIRR rearrangement file using the specified model.

embed_airr(airr, chain, model[, ...])

Embeds sequences from an AIRR DataFrame using the specified model.

main()

Main entry point for the AMULETY CLI application.

setup_logging([log_file, verbose])

Configure logging for the application.

translate_airr(airr, tmpdir, reference_dir)

Translates nucleotide sequences to amino acid sequences using IgBlast.

translate_igblast(input_file_path, ], ...)

Translates nucleotide sequences to amino acid sequences using IgBlast.

check_deps(log_file: Annotated[str, <typer.models.OptionInfo object at 0x77a58bd74710>]=None, verbose: Annotated[bool, <typer.models.OptionInfo object at 0x77a58bd747d0>]=False)[source]

Check if optional embedding dependencies and tools are installed.

check_igblast_available()[source]

Check if IgBlast is available in the system.

common_options(log_file: Annotated[str, <typer.models.OptionInfo object at 0x77a594f97410>]=None, verbose: Annotated[bool, <typer.models.OptionInfo object at 0x77a597b67250>]=False)[source]

AMULETY: Adaptive imMUne receptor Language model Embedding tool for TCR and antibodY

Global logging options can be specified before any command.

embed(input_airr: ~typing.Annotated[str, <typer.models.OptionInfo object at 0x77a58bd33e90>], chain: ~typing.Annotated[str, <typer.models.OptionInfo object at 0x77a58bd33f50>], model: ~typing.Annotated[str, <typer.models.OptionInfo object at 0x77a58bd33750>], output_file_path: ~typing.Annotated[str, <typer.models.OptionInfo object at 0x77a58bd31c90>], cache_dir: ~typing.Annotated[str, <typer.models.OptionInfo object at 0x77a58bd32c10>] = '/tmp/amulety-cache', sequence_col: ~typing.Annotated[str, <typer.models.OptionInfo object at 0x77a58bd32e90>] = 'sequence_vdj_aa', cell_id_col: ~typing.Annotated[str, <typer.models.OptionInfo object at 0x77a58bd33390>] = 'cell_id', batch_size: ~typing.Annotated[int, <typer.models.OptionInfo object at 0x77a58bd32d50>] = 50, model_path: ~typing.Annotated[str, <typer.models.OptionInfo object at 0x77a58bd32e50>] = None, embedding_dimension: ~typing.Annotated[int, <typer.models.OptionInfo object at 0x77a58bd33450>] = None, max_length: ~typing.Annotated[int, <typer.models.OptionInfo object at 0x77a58bd335d0>] = None, duplicate_col: ~typing.Annotated[str, <typer.models.OptionInfo object at 0x77a58bd33690>] = 'duplicate_count', installation_path: ~typing.Annotated[str, <typer.models.OptionInfo object at 0x77a58bd33f90>] = None, residue_level: ~typing.Annotated[bool, <typer.models.OptionInfo object at 0x77a58bd741d0>] = False, log_file: ~typing.Annotated[str, <typer.models.OptionInfo object at 0x77a58bd74390>] = None, verbose: ~typing.Annotated[bool, <typer.models.OptionInfo object at 0x77a58bd744d0>] = False)[source]

Embeds sequences from an AIRR rearrangement file using the specified model. It returns the embeddings in the specified output format along with the filtered input AIRR data.

Example usage:

amulety embed –input-airr airr_rearrangement.tsv –chain HL –model antiberta2 –output-file-path out.tsv

embed_airr(airr: DataFrame, chain: str, model: str, sequence_col: str = 'sequence_vdj_aa', cell_id_col: str = 'cell_id', cache_dir: str = '/tmp/amulety', batch_size: int = 50, embedding_dimension: int = None, max_length: int = None, model_path: str = None, output_type: str = 'pickle', duplicate_col: str = 'duplicate_count', installation_path: str = None, residue_level: bool = False)[source]

Embeds sequences from an AIRR DataFrame using the specified model.

Parameters:
  • airr (pd.DataFrame) – Input AIRR rearrangement table as a pandas DataFrame.

  • chain (str) – The input chain, which can be one of [“H”, “L”, “HL”, “LH”, “H+L”]. For BCR: H=Heavy, L=Light, HL=Heavy-Light pairs, LH=Light-Heavy pairs, H+L=Both chains separately For TCR: H=Beta/Delta, L=Alpha/Gamma, HL=Beta-Alpha/Delta-Gamma pairs, LH=Alpha-Beta/Gamma-Delta pairs, H+L=Both chains separately

  • model (str) – The embedding model to use. BCR models: [“ablang”, “antiberta2”, “antiberty”, “balm-paired”] TCR models: [“tcr-bert”, “tcrt5”] Immune models (BCR & TCR): [“immune2vec”] Protein models: [“esm2”, “prott5”, “custom”] Use “custom” for fine-tuned models (requires model_path, embedding_dimension, max_length)

  • sequence_col (str) – The name of the column containing the amino acid sequences to embed.

  • cell_id_col (str) – The name of the column containing the single-cell barcode.

  • cache_dir (Optional[str]) – Cache dir for storing the pre-trained model weights.

  • batch_size (int) – The batch size of sequences to embed.

  • embedding_dimension (int) – The embedding dimension for custom models.

  • max_length (int) – The maximum sequence length for custom models.

  • model_path (str) – The path to the custom model.

  • output_type (str) – The type of output to return. Can be “df” for a pandas DataFrame or “pickle” for a serialized torch object.

  • duplicate_col (str) – The name of the numeric column used to select the best chain when multiple chains of the same type exist per cell. Default: “duplicate_count”.

  • installation_path (str) – Custom path to Immune2Vec installation directory (optional).

  • residue_level (bool) – If True, returns residue-level embeddings of dimension sequence length x embedding dimension (L x D) instead of sequence-level (1 x D).

Returns:

embeddings (df/pickle/anndata): The embeddings as a pandas DataFrame (if output_type=”df”), a serialized torch object (if output_type=”pickle”) or an anndata object (if output_type=”anndata”). metadata (pd.DataFrame): The filtered input AIRR DataFrame with the metadata.

Return type:

Tuple (tuple)

main()[source]

Main entry point for the AMULETY CLI application.

setup_logging(log_file: str = None, verbose: bool = False)[source]

Configure logging for the application.

Parameters:
  • log_file (str, optional) – Path to log file. If None, logs to stdout.

  • verbose (bool) – If True, enables verbose logging (DEBUG level).

translate_airr(airr: DataFrame, tmpdir: str, reference_dir: str, reference_prefix: str = 'imgt_', reference_species: str = 'human', keep_regions: bool = False, sequence_col: str = 'sequence', nproc: int = 1)[source]

Translates nucleotide sequences to amino acid sequences using IgBlast.

Requires IgBlast to be installed and available in PATH. Install with: conda install -c bioconda igblast

Parameters:
  • airr (pd.DataFrame) – Input AIRR rearrangement table as a pandas DataFrame.

  • tmpdir (str) – Temporary directory for intermediate files.

  • reference_dir (str) – The directory to the pre-built igblast germline reference data.

  • reference_prefix (str) – The prefix for the igblast germline reference files (default: “imgt_”).

  • reference_species (str) – The species for the igblast germline reference (default: “human”).

  • keep_regions (bool) – If True, keeps the region translations in the output airr file. If False, it removes them.

  • sequence_col (str) – The name of the column containing the nucleotide sequences to translate.

  • nproc (int) – Number of processors to use for IgBlast.

Returns:

AIRR DataFrame with added amino acid translation columns.

Return type:

airr (pd.DataFrame)

translate_igblast(input_file_path: ~typing.Annotated[str, <typer.models.OptionInfo object at 0x77a58c6af950>], reference_dir: ~typing.Annotated[str, <typer.models.OptionInfo object at 0x77a58c6af090>], output_dir: ~typing.Annotated[str, <typer.models.OptionInfo object at 0x77a58c499790>] = '.', reference_prefix: ~typing.Annotated[str, <typer.models.OptionInfo object at 0x77a58cfc1bd0>] = 'imgt_', reference_species: ~typing.Annotated[str, <typer.models.OptionInfo object at 0x77a58bd33910>] = 'human', keep_regions: ~typing.Annotated[bool, <typer.models.OptionInfo object at 0x77a58bd33a10>] = False, sequence_col: ~typing.Annotated[str, <typer.models.OptionInfo object at 0x77a58bd33ad0>] = 'sequence', nproc: ~typing.Annotated[int, <typer.models.OptionInfo object at 0x77a58bd33c10>] = 1, log_file: ~typing.Annotated[str, <typer.models.OptionInfo object at 0x77a58bd33cd0>] = None, verbose: ~typing.Annotated[bool, <typer.models.OptionInfo object at 0x77a58bd33d50>] = False)[source]

Translates nucleotide sequences to amino acid sequences using IgBlast.

This function takes a AIRR file in TSV format containing nucleotide sequences and translates them into amino acid sequences using IgBlast, a tool for analyzing BCR and TCR sequences. It performs the following steps:

  1. Reads the input TSV file containing nucleotide sequences.

  2. Writes the nucleotide sequences into a FASTA file, required as input for IgBlast.

  3. Runs IgBlast on the FASTA file to perform sequence alignment and translation.

  4. Reads the IgBlast output, which includes the translated amino acid sequences.

  5. Removes gaps introduced by IgBlast from the sequence alignment.

  6. Saves the translated data into a new TSV file in the specified output directory.

Example usage:

amulety translate-igblast –input-file input.tsv –output-dir ./output –reference-dir /path/to/igblast/references