amulety.tcr_embeddings

TCR embedding functions using various models.

Functions

check_tcr_dependencies()

Check if optional TCR embedding dependencies are installed and provide installation instructions.

tcr_bert(sequences[, cache_dir, batch_size, ...])

Embeds T-Cell Receptor (TCR) sequences using the TCR-BERT model.

tcrt5(sequences[, cache_dir, batch_size, ...])

Embeds T-Cell Receptor (TCR) sequences using the TCRT5 model.

check_tcr_dependencies()[source]

Check if optional TCR embedding dependencies are installed and provide installation instructions.

tcr_bert(sequences, cache_dir: str | None = None, batch_size: int = 32, residue_level: bool = False)[source]

Embeds T-Cell Receptor (TCR) sequences using the TCR-BERT model.

Parameters:
  • sequences – Input TCR sequences (pd.Series for single chain or pd.DataFrame for H+L mode)

  • cache_dir – Directory to cache model files

  • batch_size – Number of sequences to process in each batch

Note:

Pretrained on 88,403 human TRA/TRB sequences from VDJdb and PIRD. Non-fine-tuned version focused on human TCR data only. The maximum length of the sequences to be embedded is 64.

tcrt5(sequences, cache_dir: str | None = None, batch_size: int = 32, residue_level: bool = False)[source]

Embeds T-Cell Receptor (TCR) sequences using the TCRT5 model.

Parameters:
  • sequences – Input TCR sequences (pd.Series for single chain or pd.DataFrame for H+L mode)

  • cache_dir – Directory to cache model files

  • batch_size – Number of sequences to process in each batch

Note:

TCRT5 was pre-trained on masked span reconstruction using ~14M CDR3 β sequences from TCRdb and ~780k peptide-pseudosequence pairs from IEDB. This model only supports beta chains (H chains for TCR). Maximum sequence length: 20 amino acids. Embedding dimension: 256.

Reference: https://huggingface.co/dkarthikeyan1/tcrt5_pre_tcrdb