amulety.amulety
Console script for amulety
Functions
|
Check if optional embedding dependencies and tools are installed. |
Check if IgBlast is available in the system. |
|
|
AMULETY: Adaptive imMUne receptor Language model Embedding tool for TCR and antibodY |
|
Embeds sequences from an AIRR rearrangement file using the specified model. |
|
Embeds sequences from an AIRR DataFrame using the specified model. |
|
Main entry point for the AMULETY CLI application. |
|
Configure logging for the application. |
|
Translates nucleotide sequences to amino acid sequences using IgBlast. |
|
Translates nucleotide sequences to amino acid sequences using IgBlast. |
- check_deps(log_file: Annotated[str, <typer.models.OptionInfo object at 0x772db9ad8550>]=None, verbose: Annotated[bool, <typer.models.OptionInfo object at 0x772db9ad8110>]=False)[source]
Check if optional embedding dependencies and tools are installed.
- common_options(log_file: Annotated[str, <typer.models.OptionInfo object at 0x772dba2b8990>]=None, verbose: Annotated[bool, <typer.models.OptionInfo object at 0x772dc118e050>]=False)[source]
AMULETY: Adaptive imMUne receptor Language model Embedding tool for TCR and antibodY
Global logging options can be specified before any command.
- embed(input_airr: ~typing.Annotated[str, <typer.models.OptionInfo object at 0x772dc4f4ce10>], chain: ~typing.Annotated[str, <typer.models.OptionInfo object at 0x772dbb66dfd0>], model: ~typing.Annotated[str, <typer.models.OptionInfo object at 0x772db9ad90d0>], output_file_path: ~typing.Annotated[str, <typer.models.OptionInfo object at 0x772db9ad91d0>], cache_dir: ~typing.Annotated[str, <typer.models.OptionInfo object at 0x772db9ad9250>] = '/tmp/amulety-cache', sequence_col: ~typing.Annotated[str, <typer.models.OptionInfo object at 0x772db9ad8b10>] = 'sequence_vdj_aa', cell_id_col: ~typing.Annotated[str, <typer.models.OptionInfo object at 0x772db9ad93d0>] = 'cell_id', batch_size: ~typing.Annotated[int, <typer.models.OptionInfo object at 0x772db9ad94d0>] = 50, model_path: ~typing.Annotated[str, <typer.models.OptionInfo object at 0x772db9ad9690>] = None, embedding_dimension: ~typing.Annotated[int, <typer.models.OptionInfo object at 0x772db9ad9810>] = None, max_length: ~typing.Annotated[int, <typer.models.OptionInfo object at 0x772db9ad9910>] = None, duplicate_col: ~typing.Annotated[str, <typer.models.OptionInfo object at 0x772db9ad9a50>] = 'duplicate_count', installation_path: ~typing.Annotated[str, <typer.models.OptionInfo object at 0x772db9ad9bd0>] = None, residue_level: ~typing.Annotated[bool, <typer.models.OptionInfo object at 0x772db9ad9d10>] = False, log_file: ~typing.Annotated[str, <typer.models.OptionInfo object at 0x772db9ad8210>] = None, verbose: ~typing.Annotated[bool, <typer.models.OptionInfo object at 0x772db9ad8750>] = False)[source]
Embeds sequences from an AIRR rearrangement file using the specified model. It returns the embeddings in the specified output format along with the filtered input AIRR data.
Example usage:
amulety embed –input-airr airr_rearrangement.tsv –chain HL –model antiberta2 –output-file-path out.tsv
- embed_airr(airr: DataFrame, chain: str, model: str, sequence_col: str = 'sequence_vdj_aa', cell_id_col: str = 'cell_id', cache_dir: str = '/tmp/amulety', batch_size: int = 50, embedding_dimension: int = None, max_length: int = None, model_path: str = None, output_type: str = 'pickle', duplicate_col: str = 'duplicate_count', installation_path: str = None, residue_level: bool = False)[source]
Embeds sequences from an AIRR DataFrame using the specified model.
- Parameters:
airr (pd.DataFrame) – Input AIRR rearrangement table as a pandas DataFrame.
chain (str) – The input chain, which can be one of [“H”, “L”, “HL”, “LH”, “H+L”]. For BCR: H=Heavy, L=Light, HL=Heavy-Light pairs, LH=Light-Heavy pairs, H+L=Both chains separately For TCR: H=Beta/Delta, L=Alpha/Gamma, HL=Beta-Alpha/Delta-Gamma pairs, LH=Alpha-Beta/Gamma-Delta pairs, H+L=Both chains separately
model (str) – The embedding model to use. BCR models: [“ablang”, “antiberta2”, “antiberty”, “balm-paired”] TCR models: [“tcr-bert”, “tcrt5”] Immune models (BCR & TCR): [“immune2vec”] Protein models: [“esm2”, “prott5”, “custom”] Use “custom” for fine-tuned models (requires model_path, embedding_dimension, max_length)
sequence_col (str) – The name of the column containing the amino acid sequences to embed.
cell_id_col (str) – The name of the column containing the single-cell barcode.
cache_dir (Optional[str]) – Cache dir for storing the pre-trained model weights.
batch_size (int) – The batch size of sequences to embed.
embedding_dimension (int) – The embedding dimension for custom models.
max_length (int) – The maximum sequence length for custom models.
model_path (str) – The path to the custom model.
output_type (str) – The type of output to return. Can be “df” for a pandas DataFrame or “pickle” for a serialized torch object.
duplicate_col (str) – The name of the numeric column used to select the best chain when multiple chains of the same type exist per cell. Default: “duplicate_count”.
installation_path (str) – Custom path to Immune2Vec installation directory (optional).
residue_level (bool) – If True, returns residue-level embeddings of dimension sequence length x embedding dimension (L x D) instead of sequence-level (1 x D).
- Returns:
embeddings (df/pickle/anndata): The embeddings as a pandas DataFrame (if output_type=”df”), a serialized torch object (if output_type=”pickle”) or an anndata object (if output_type=”anndata”). metadata (pd.DataFrame): The filtered input AIRR DataFrame with the metadata.
- Return type:
Tuple (tuple)
- setup_logging(log_file: str = None, verbose: bool = False)[source]
Configure logging for the application.
- translate_airr(airr: DataFrame, tmpdir: str, reference_dir: str, reference_prefix: str = 'imgt_', reference_species: str = 'human', keep_regions: bool = False, sequence_col: str = 'sequence', nproc: int = 1)[source]
Translates nucleotide sequences to amino acid sequences using IgBlast.
Requires IgBlast to be installed and available in PATH. Install with: conda install -c bioconda igblast
- Parameters:
airr (pd.DataFrame) – Input AIRR rearrangement table as a pandas DataFrame.
tmpdir (str) – Temporary directory for intermediate files.
reference_dir (str) – The directory to the pre-built igblast germline reference data.
reference_prefix (str) – The prefix for the igblast germline reference files (default: “imgt_”).
reference_species (str) – The species for the igblast germline reference (default: “human”).
keep_regions (bool) – If True, keeps the region translations in the output airr file. If False, it removes them.
sequence_col (str) – The name of the column containing the nucleotide sequences to translate.
nproc (int) – Number of processors to use for IgBlast.
- Returns:
AIRR DataFrame with added amino acid translation columns.
- Return type:
airr (pd.DataFrame)
- translate_igblast(input_file_path: ~typing.Annotated[str, <typer.models.OptionInfo object at 0x772dbb66c4d0>], reference_dir: ~typing.Annotated[str, <typer.models.OptionInfo object at 0x772db9ab6fd0>], output_dir: ~typing.Annotated[str, <typer.models.OptionInfo object at 0x772db9ab7150>] = '.', reference_prefix: ~typing.Annotated[str, <typer.models.OptionInfo object at 0x772db9bd9950>] = 'imgt_', reference_species: ~typing.Annotated[str, <typer.models.OptionInfo object at 0x772dbb29f450>] = 'human', keep_regions: ~typing.Annotated[bool, <typer.models.OptionInfo object at 0x772db9ad8b90>] = False, sequence_col: ~typing.Annotated[str, <typer.models.OptionInfo object at 0x772db9ad8cd0>] = 'sequence', nproc: ~typing.Annotated[int, <typer.models.OptionInfo object at 0x772db9ad8e10>] = 1, log_file: ~typing.Annotated[str, <typer.models.OptionInfo object at 0x772db9ad8d10>] = None, verbose: ~typing.Annotated[bool, <typer.models.OptionInfo object at 0x772db9ad8f50>] = False)[source]
Translates nucleotide sequences to amino acid sequences using IgBlast.
This function takes a AIRR file in TSV format containing nucleotide sequences and translates them into amino acid sequences using IgBlast, a tool for analyzing BCR and TCR sequences. It performs the following steps:
Reads the input TSV file containing nucleotide sequences.
Writes the nucleotide sequences into a FASTA file, required as input for IgBlast.
Runs IgBlast on the FASTA file to perform sequence alignment and translation.
Reads the IgBlast output, which includes the translated amino acid sequences.
Removes gaps introduced by IgBlast from the sequence alignment.
Saves the translated data into a new TSV file in the specified output directory.
Example usage:
amulety translate-igblast –input-file input.tsv –output-dir ./output –reference-dir /path/to/igblast/references