amulety.amulety
Console script for amulety
Functions
|
Check if optional embedding dependencies and tools are installed. |
Check if IgBlast is available in the system. |
|
|
AMULETY: Adaptive imMUne receptor Language model Embedding tool for TCR and antibodY |
|
Embeds sequences from an AIRR rearrangement file using the specified model. |
|
Embeds sequences from an AIRR DataFrame using the specified model. |
|
Main entry point for the AMULETY CLI application. |
|
Configure logging for the application. |
|
Translates nucleotide sequences to amino acid sequences using IgBlast. |
|
Translates nucleotide sequences to amino acid sequences using IgBlast. |
- check_deps(log_file: Annotated[str, <typer.models.OptionInfo object at 0x70c3af444f50>]=None, verbose: Annotated[bool, <typer.models.OptionInfo object at 0x70c3af4445d0>]=False)[source]
Check if optional embedding dependencies and tools are installed.
- common_options(log_file: Annotated[str, <typer.models.OptionInfo object at 0x70c3b0163010>]=None, verbose: Annotated[bool, <typer.models.OptionInfo object at 0x70c3af7c3f90>]=False)[source]
AMULETY: Adaptive imMUne receptor Language model Embedding tool for TCR and antibodY
Global logging options can be specified before any command.
- embed(input_airr: ~typing.Annotated[str, <typer.models.OptionInfo object at 0x70c3af41a3d0>], chain: ~typing.Annotated[str, <typer.models.OptionInfo object at 0x70c3af41b7d0>], model: ~typing.Annotated[str, <typer.models.OptionInfo object at 0x70c3af41b6d0>], output_file_path: ~typing.Annotated[str, <typer.models.OptionInfo object at 0x70c3af41b450>], cache_dir: ~typing.Annotated[str, <typer.models.OptionInfo object at 0x70c3af41b390>] = '/tmp/amulety-cache', sequence_col: ~typing.Annotated[str, <typer.models.OptionInfo object at 0x70c3af41bc10>] = 'sequence_vdj_aa', cell_id_col: ~typing.Annotated[str, <typer.models.OptionInfo object at 0x70c3af41bc90>] = 'cell_id', batch_size: ~typing.Annotated[int, <typer.models.OptionInfo object at 0x70c3b1940390>] = 50, model_path: ~typing.Annotated[str, <typer.models.OptionInfo object at 0x70c3af4440d0>] = None, embedding_dimension: ~typing.Annotated[int, <typer.models.OptionInfo object at 0x70c3af444290>] = None, max_length: ~typing.Annotated[int, <typer.models.OptionInfo object at 0x70c3af444390>] = None, duplicate_col: ~typing.Annotated[str, <typer.models.OptionInfo object at 0x70c3af444510>] = 'duplicate_count', installation_path: ~typing.Annotated[str, <typer.models.OptionInfo object at 0x70c3af444790>] = None, residue_level: ~typing.Annotated[bool, <typer.models.OptionInfo object at 0x70c3af444990>] = False, log_file: ~typing.Annotated[str, <typer.models.OptionInfo object at 0x70c3af444b10>] = None, verbose: ~typing.Annotated[bool, <typer.models.OptionInfo object at 0x70c3af444cd0>] = False)[source]
Embeds sequences from an AIRR rearrangement file using the specified model. It returns the embeddings in the specified output format along with the filtered input AIRR data.
Example usage:
amulety embed –input-airr airr_rearrangement.tsv –chain HL –model antiberta2 –output-file-path out.tsv
- embed_airr(airr: DataFrame, chain: str, model: str, sequence_col: str = 'sequence_vdj_aa', cell_id_col: str = 'cell_id', cache_dir: str = '/tmp/amulety', batch_size: int = 50, embedding_dimension: int = None, max_length: int = None, model_path: str = None, output_type: str = 'pickle', duplicate_col: str = 'duplicate_count', installation_path: str = None, residue_level: bool = False)[source]
Embeds sequences from an AIRR DataFrame using the specified model.
- Parameters:
airr (pd.DataFrame) – Input AIRR rearrangement table as a pandas DataFrame.
chain (str) – The input chain, which can be one of [“H”, “L”, “HL”, “LH”, “H+L”]. For BCR: H=Heavy, L=Light, HL=Heavy-Light pairs, LH=Light-Heavy pairs, H+L=Both chains separately For TCR: H=Beta/Delta, L=Alpha/Gamma, HL=Beta-Alpha/Delta-Gamma pairs, LH=Alpha-Beta/Gamma-Delta pairs, H+L=Both chains separately
model (str) – The embedding model to use. BCR models: [“ablang”, “antiberta2”, “antiberty”, “balm-paired”] TCR models: [“tcr-bert”, “tcrt5”] Immune models (BCR & TCR): [“immune2vec”] Protein models: [“esm2”, “prott5”, “custom”] Use “custom” for fine-tuned models (requires model_path, embedding_dimension, max_length)
sequence_col (str) – The name of the column containing the amino acid sequences to embed.
cell_id_col (str) – The name of the column containing the single-cell barcode.
cache_dir (Optional[str]) – Cache dir for storing the pre-trained model weights.
batch_size (int) – The batch size of sequences to embed.
embedding_dimension (int) – The embedding dimension for custom models.
max_length (int) – The maximum sequence length for custom models.
model_path (str) – The path to the custom model.
output_type (str) – The type of output to return. Can be “df” for a pandas DataFrame or “pickle” for a serialized torch object.
duplicate_col (str) – The name of the numeric column used to select the best chain when multiple chains of the same type exist per cell. Default: “duplicate_count”.
installation_path (str) – Custom path to Immune2Vec installation directory (optional).
residue_level (bool) – If True, returns residue-level embeddings of dimension sequence length x embedding dimension (L x D) instead of sequence-level (1 x D).
- Returns:
embeddings (df/pickle/anndata): The embeddings as a pandas DataFrame (if output_type=”df”), a serialized torch object (if output_type=”pickle”) or an anndata object (if output_type=”anndata”). metadata (pd.DataFrame): The filtered input AIRR DataFrame with the metadata.
- Return type:
Tuple (tuple)
- setup_logging(log_file: str = None, verbose: bool = False)[source]
Configure logging for the application.
- translate_airr(airr: DataFrame, tmpdir: str, reference_dir: str, keep_regions: bool = False, sequence_col: str = 'sequence', nproc: int = 1)[source]
Translates nucleotide sequences to amino acid sequences using IgBlast.
Requires IgBlast to be installed and available in PATH. Install with: conda install -c bioconda igblast
- Parameters:
airr (pd.DataFrame) – Input AIRR rearrangement table as a pandas DataFrame.
tmpdir (str) – Temporary directory for intermediate files.
reference_dir (str) – The directory to the igblast references.
keep_regions (bool) – If True, keeps the region translations in the output airr file. If False, it removes them.
sequence_col (str) – The name of the column containing the nucleotide sequences to translate.
nproc (int) – Number of processors to use for IgBlast.
- Returns:
AIRR DataFrame with added amino acid translation columns.
- Return type:
airr (pd.DataFrame)
- translate_igblast(input_file_path: ~typing.Annotated[str, <typer.models.OptionInfo object at 0x70c3afd4d5d0>], output_dir: ~typing.Annotated[str, <typer.models.OptionInfo object at 0x70c3af78a510>], reference_dir: ~typing.Annotated[str, <typer.models.OptionInfo object at 0x70c3b18b4c50>], keep_regions: ~typing.Annotated[bool, <typer.models.OptionInfo object at 0x70c3afc20090>] = False, sequence_col: ~typing.Annotated[str, <typer.models.OptionInfo object at 0x70c3af41be10>] = 'sequence', nproc: ~typing.Annotated[int, <typer.models.OptionInfo object at 0x70c3af41bf50>] = 1, log_file: ~typing.Annotated[str, <typer.models.OptionInfo object at 0x70c3af41bf90>] = None, verbose: ~typing.Annotated[bool, <typer.models.OptionInfo object at 0x70c3af419b10>] = False)[source]
Translates nucleotide sequences to amino acid sequences using IgBlast.
This function takes a AIRR file in TSV format containing nucleotide sequences and translates them into amino acid sequences using IgBlast, a tool for analyzing BCR and TCR sequences. It performs the following steps:
Reads the input TSV file containing nucleotide sequences.
Writes the nucleotide sequences into a FASTA file, required as input for IgBlast.
Runs IgBlast on the FASTA file to perform sequence alignment and translation.
Reads the IgBlast output, which includes the translated amino acid sequences.
Removes gaps introduced by IgBlast from the sequence alignment.
Saves the translated data into a new TSV file in the specified output directory.
Example usage:
amulety translate-igblast –input-file input.tsv –output-dir ./output –reference-dir /path/to/igblast/references