amulety
- embed_airr(airr: DataFrame, chain: str, model: str, sequence_col: str = 'sequence_vdj_aa', cell_id_col: str = 'cell_id', cache_dir: str = '/tmp/amulety', batch_size: int = 50, embedding_dimension: int = None, max_length: int = None, model_path: str = None, output_type: str = 'pickle', duplicate_col: str = 'duplicate_count', installation_path: str = None, residue_level: bool = False)[source]
Embeds sequences from an AIRR DataFrame using the specified model.
- Parameters:
airr (pd.DataFrame) – Input AIRR rearrangement table as a pandas DataFrame.
chain (str) – The input chain, which can be one of [“H”, “L”, “HL”, “LH”, “H+L”]. For BCR: H=Heavy, L=Light, HL=Heavy-Light pairs, LH=Light-Heavy pairs, H+L=Both chains separately For TCR: H=Beta/Delta, L=Alpha/Gamma, HL=Beta-Alpha/Delta-Gamma pairs, LH=Alpha-Beta/Gamma-Delta pairs, H+L=Both chains separately
model (str) – The embedding model to use. BCR models: [“ablang”, “antiberta2”, “antiberty”, “balm-paired”] TCR models: [“tcr-bert”, “tcrt5”] Immune models (BCR & TCR): [“immune2vec”] Protein models: [“esm2”, “prott5”, “custom”] Use “custom” for fine-tuned models (requires model_path, embedding_dimension, max_length)
sequence_col (str) – The name of the column containing the amino acid sequences to embed.
cell_id_col (str) – The name of the column containing the single-cell barcode.
cache_dir (Optional[str]) – Cache dir for storing the pre-trained model weights.
batch_size (int) – The batch size of sequences to embed.
embedding_dimension (int) – The embedding dimension for custom models.
max_length (int) – The maximum sequence length for custom models.
model_path (str) – The path to the custom model.
output_type (str) – The type of output to return. Can be “df” for a pandas DataFrame or “pickle” for a serialized torch object.
duplicate_col (str) – The name of the numeric column used to select the best chain when multiple chains of the same type exist per cell. Default: “duplicate_count”.
installation_path (str) – Custom path to Immune2Vec installation directory (optional).
residue_level (bool) – If True, returns residue-level embeddings of dimension sequence length x embedding dimension (L x D) instead of sequence-level (1 x D).
- Returns:
embeddings (df/pickle/anndata): The embeddings as a pandas DataFrame (if output_type=”df”), a serialized torch object (if output_type=”pickle”) or an anndata object (if output_type=”anndata”). metadata (pd.DataFrame): The filtered input AIRR DataFrame with the metadata.
- Return type:
Tuple (tuple)
- translate_airr(airr: DataFrame, tmpdir: str, reference_dir: str, reference_prefix: str = 'imgt_', reference_species: str = 'human', keep_regions: bool = False, sequence_col: str = 'sequence', nproc: int = 1)[source]
Translates nucleotide sequences to amino acid sequences using IgBlast.
Requires IgBlast to be installed and available in PATH. Install with: conda install -c bioconda igblast
- Parameters:
airr (pd.DataFrame) – Input AIRR rearrangement table as a pandas DataFrame.
tmpdir (str) – Temporary directory for intermediate files.
reference_dir (str) – The directory to the pre-built igblast germline reference data.
reference_prefix (str) – The prefix for the igblast germline reference files (default: “imgt_”).
reference_species (str) – The species for the igblast germline reference (default: “human”).
keep_regions (bool) – If True, keeps the region translations in the output airr file. If False, it removes them.
sequence_col (str) – The name of the column containing the nucleotide sequences to translate.
nproc (int) – Number of processors to use for IgBlast.
- Returns:
AIRR DataFrame with added amino acid translation columns.
- Return type:
airr (pd.DataFrame)
Modules
Console script for amulety |
|
BCR embedding functions using various models. |
|
Protein sequence embedding functions using various models. |
|
TCR embedding functions using various models. |
|
Main module. |