amulety.protein_embeddings

Protein sequence embedding functions using various models.

Functions

`custommodel`(sequences, model_path, ...[, ...])	Embeds sequences using a custom model specified by the user.
`esm2`(sequences[, cache_dir, batch_size, ...])	Embeds sequences using the ESM2 model.
`immune2vec`(sequences[, cache_dir, ...])	Embeds sequences using Immune2Vec model.
`prott5`(sequences[, cache_dir, batch_size, ...])	Embeds BCR or TCR sequences using the ProtT5-XL protein language model (Rostlab/prot_t5_xl_uniref50).

custommodel(sequences, model_path: str, embedding_dimension: int, max_seq_length: int, cache_dir: str | None = '/tmp/amulety', residue_level: bool = False, batch_size: int | None = 50)[source]: Embeds sequences using a custom model specified by the user. The maximum length of the sequences to be embedded is specified by the user.

esm2(sequences, cache_dir: str | None = None, batch_size: int = 50, model_name: str = 'facebook/esm2_t33_650M_UR50D', residue_level: bool = False)[source]

Embeds sequences using the ESM2 model. The maximum length of the sequences to be embedded is 512. The embedding dimension is 1280.

Parameters:

sequences – Input protein sequences (pd.Series for single chain or pd.DataFrame for H+L mode)
cache_dir – Directory to cache model files
batch_size – Number of sequences to process in each batch
model_name – HuggingFace model name or path to fine-tuned model

immune2vec(sequences, cache_dir: str | None = None, batch_size: int = 50, n_dim: int = 100, n_gram: int = 3, pretrained_model_path: str | None = None, data_fraction: float = 1.0, window: int = 25, min_count: int = 1, workers: int = 3, random_seed: int = 42, installation_path: str | None = None)[source]

Embeds sequences using Immune2Vec model.

Immune2Vec is a Word2Vec-based embedding method specifically designed for immune receptor sequences (both BCR and TCR). It uses n-gram decomposition of amino acid sequences to learn vector representations.

Parameters:

sequences – Input protein sequences (pd.Series for single chain or pd.DataFrame for H+L mode)
cache_dir – Directory to cache model files
batch_size – Number of sequences to process (not used for Immune2Vec but kept for consistency)
n_dim – Embedding dimension (default: 100)
n_gram – N-gram size for sequence decomposition (default: 3)
pretrained_model_path – Path to a pre-trained Immune2Vec model (optional)
data_fraction – Fraction of data to use for training (default: 1.0)
window – Context window size for Word2Vec (default: 25)
min_count – Minimum count for words to be included (default: 1)
workers – Number of worker threads (default: 3)
random_seed – Random seed for reproducibility (default: 42)
installation_path – Custom path to Immune2Vec installation directory (required if not in PYTHONPATH)

Returns:

Embeddings of shape (n_sequences, n_dim)

Return type:

torch.Tensor

prott5(sequences, cache_dir: str | None = None, batch_size: int = 32, residue_level: bool = False)[source]

Embeds BCR or TCR sequences using the ProtT5-XL protein language model (Rostlab/prot_t5_xl_uniref50). The maximum sequence length to embed is 1024 amino acids, and the generated embeddings have a dimension of 1024.

Parameters:

sequences – Input protein sequences (pd.Series for single chain or pd.DataFrame for H+L mode)
cache_dir – Directory to cache model files
batch_size – Number of sequences to process in each batch