CLI reference
Translate sequence to amino acids
amulety translate-igblast
Usage: amulety translate-igblast [OPTIONS]
Translates nucleotide sequences to amino acid sequences using IgBlast.
This function takes a AIRR file in TSV format containing nucleotide sequences
and translates them into amino acid sequences using IgBlast, a tool for analyzing
BCR and TCR sequences. It performs the following steps:
1. Reads the input TSV file containing nucleotide sequences.
2. Writes the nucleotide sequences into a FASTA file, required as input for IgBlast.
3. Runs IgBlast on the FASTA file to perform sequence alignment and translation.
4. Reads the IgBlast output, which includes the translated amino acid sequences.
5. Removes gaps introduced by IgBlast from the sequence alignment.
6. Saves the translated data into a new TSV file in the specified output directory.
Example usage:
amulety translate-igblast --input-file input.tsv --output-dir ./output
--reference-dir /path/to/igblast/references
╭─ Options ──────────────────────────────────────────────────────────────────────────────╮
│ * --input-file -i TEXT The path to the input data file. The data │
│ file should be in TSV format following the │
│ AIRR specifications. │
│ [required] │
│ * --reference-dir -r TEXT The directory to the pre-built igblast │
│ germline references. │
│ [required] │
│ --output-dir -o TEXT The directory where the generated embeddings │
│ will be saved. │
│ [default: .] │
│ --reference-prefix -p TEXT The prefix for the igblast germline reference │
│ files (default: 'imgt_'). │
│ [default: imgt_] │
│ --reference-species -s TEXT The species for the igblast germline │
│ reference (default: 'human'). │
│ [default: human] │
│ --keep-regions -k If True, keeps the region translations in the │
│ output airr file. If False, it removes them. │
│ --sequence-col -s TEXT The name of the column containing the │
│ nucleotide sequences to translate. │
│ [default: sequence] │
│ --nproc -n INTEGER Number of processors to use for IgBlast. │
│ [default: 1] │
│ --log-file -l TEXT Path to log file. If not provided, logs will │
│ be printed to stdout. │
│ --verbose -v Enable verbose logging (DEBUG level). │
│ --help Show this message and exit. │
╰────────────────────────────────────────────────────────────────────────────────────────╯
Embed sequences with different models
amulety embed
Usage: amulety embed [OPTIONS]
Embeds sequences from an AIRR rearrangement file using the specified model. It returns
the embeddings in the specified output format along with the filtered input AIRR data.
Example usage:
amulety embed --input-airr airr_rearrangement.tsv --chain HL --model antiberta2
--output-file-path out.tsv
╭─ Options ──────────────────────────────────────────────────────────────────────────────╮
│ * --input-airr -i TEXT The path to the input data file. The data │
│ file should be in AIRR format. │
│ [required] │
│ * --chain -c TEXT Input chain to embed. For BCR: H=Heavy, │
│ L=Light, HL=Heavy-Light pairs, │
│ LH=Light-Heavy pairs, H+L=Both chains │
│ separately. For TCR: H=Beta/Delta, │
│ L=Alpha/Gamma, HL=Beta-Alpha/Delta-Gamma │
│ pairs, LH=Alpha-Beta/Gamma-Delta pairs, │
│ H+L=Both chains separately. │
│ [required] │
│ * --model -m TEXT The embedding model to use. BCR: ['ablang', │
│ 'antiberta2', 'antiberty', 'balm-paired']. │
│ TCR: ['tcr-bert', 'tcrt5']. Immune (BCR & │
│ TCR): ['immune2vec']. Protein: ['esm2', │
│ 'prott5', 'custom']. Use 'custom' for │
│ fine-tuned models with --model-path, │
│ --embedding-dimension, and --max-length │
│ parameters. │
│ [required] │
│ * --output-file-path -o TEXT The path where the generated embeddings │
│ will be saved. The file extension should be │
│ .csv, or .tsv. for a dataframe, .pt for a │
│ pickled torch object, or .h5ad for an │
│ anndata object. │
│ [required] │
│ --cache-dir -d TEXT Cache dir for storing the pre-trained model │
│ weights. │
│ [default: /tmp/amulety-cache] │
│ --sequence-col -s TEXT The name of the column containing the amino │
│ acid sequences to embed. │
│ [default: sequence_vdj_aa] │
│ --cell-id-col -u TEXT The name of the column containing the │
│ single-cell barcode. │
│ [default: cell_id] │
│ --batch-size -b INTEGER The batch size of sequences to embed. │
│ [default: 50] │
│ --model-path -p TEXT Path to custom model (HuggingFace model │
│ name or local path). Required for 'custom' │
│ model. │
│ --embedding-dimension -e INTEGER Embedding dimension for custom model. │
│ Required for 'custom' model. │
│ --max-length -x INTEGER Maximum sequence length for custom model. │
│ Required for 'custom' model. │
│ --duplicate-col -z TEXT The name of the numeric column used to │
│ select the best chain when multiple chains │
│ of the same type exist per cell. Default: │
│ 'duplicate_count'. Custom columns must be │
│ numeric and user-defined. │
│ [default: duplicate_count] │
│ --installation-path -j TEXT Custom path to model installation │
│ directory. Currently applies to │
│ 'immune2vec' model. │
│ --residue-level -r If True, returns residue-level embeddings │
│ of dimension sequence length x embedding │
│ dimension (L x D) instead of sequence-level │
│ (1 x D). │
│ --log-file -l TEXT Path to log file. If not provided, logs │
│ will be printed to stdout. │
│ --verbose -v Enable verbose logging (DEBUG level). │
│ --help Show this message and exit. │
╰────────────────────────────────────────────────────────────────────────────────────────╯
Options
Chain type requirements
Different models have specific input chain requirements based on how they were trained in the original publications. AMULETY supports the following chain types:
H: Heavy chains (BCR) or Beta/Delta chains (TCR) - individual chain embedding
L: Light chains (BCR) or Alpha/Gamma chains (TCR) - individual chain embedding
HL: Paired chains - concatenated Heavy-Light (BCR) or Beta-Alpha/Delta-Gamma (TCR) sequences
LH: Reverse paired chains - concatenated Light-Heavy (BCR) or Alpha-Beta/Gamma-Delta (TCR) sequences
H+L: Both chains separately - processes H and L chains individually without pairing
Custom light chain selection
When using paired chains (–chain HL), AMULETY automatically selects the best light chain when multiple light chains exist for the same cell. By default, it uses the duplicate_count column, but you can specify a custom numeric column using the –duplicate-col option. The column must contain numeric values (integers or floats), and AMULETY selects the chain with the highest value.
Example usage:
# Default behavior: use duplicate_count
amulety embed --chain HL --model antiberta2 --output-file-path embeddings.pt input.tsv
# Custom selection: use a quality score column
amulety embed --chain HL --model antiberta2 --duplicate-col quality_score --output-file-path embeddings.pt input.tsv
# Custom selection: use UMI count
amulety embed --chain HL --model antiberta2 --duplicate-col umi_count --output-file-path embeddings.pt input.tsv