amulety.protein_embeddings
Protein sequence embedding functions using various models.
Functions
|
Embeds sequences using a custom model specified by the user. |
|
Embeds sequences using the ESM2 model. |
|
Embeds sequences using Immune2Vec model. |
|
Embeds BCR or TCR sequences using the ProtT5-XL protein language model (Rostlab/prot_t5_xl_uniref50). |
- custommodel(sequences, model_path: str, embedding_dimension: int, max_seq_length: int, cache_dir: str | None = '/tmp/amulety', residue_level: bool = False, batch_size: int | None = 50)[source]
Embeds sequences using a custom model specified by the user. The maximum length of the sequences to be embedded is specified by the user.
- esm2(sequences, cache_dir: str | None = None, batch_size: int = 50, model_name: str = 'facebook/esm2_t33_650M_UR50D', residue_level: bool = False)[source]
Embeds sequences using the ESM2 model. The maximum length of the sequences to be embedded is 512. The embedding dimension is 1280.
- Parameters:
sequences – Input protein sequences (pd.Series for single chain or pd.DataFrame for H+L mode)
cache_dir – Directory to cache model files
batch_size – Number of sequences to process in each batch
model_name – HuggingFace model name or path to fine-tuned model
- immune2vec(sequences, cache_dir: str | None = None, batch_size: int = 50, n_dim: int = 100, n_gram: int = 3, pretrained_model_path: str | None = None, data_fraction: float = 1.0, window: int = 25, min_count: int = 1, workers: int = 3, random_seed: int = 42, installation_path: str | None = None)[source]
Embeds sequences using Immune2Vec model.
Immune2Vec is a Word2Vec-based embedding method specifically designed for immune receptor sequences (both BCR and TCR). It uses n-gram decomposition of amino acid sequences to learn vector representations.
- Parameters:
sequences – Input protein sequences (pd.Series for single chain or pd.DataFrame for H+L mode)
cache_dir – Directory to cache model files
batch_size – Number of sequences to process (not used for Immune2Vec but kept for consistency)
n_dim – Embedding dimension (default: 100)
n_gram – N-gram size for sequence decomposition (default: 3)
pretrained_model_path – Path to a pre-trained Immune2Vec model (optional)
data_fraction – Fraction of data to use for training (default: 1.0)
window – Context window size for Word2Vec (default: 25)
min_count – Minimum count for words to be included (default: 1)
workers – Number of worker threads (default: 3)
random_seed – Random seed for reproducibility (default: 42)
installation_path – Custom path to Immune2Vec installation directory (required if not in PYTHONPATH)
- Returns:
Embeddings of shape (n_sequences, n_dim)
- Return type:
torch.Tensor
- prott5(sequences, cache_dir: str | None = None, batch_size: int = 32, residue_level: bool = False)[source]
Embeds BCR or TCR sequences using the ProtT5-XL protein language model (Rostlab/prot_t5_xl_uniref50). The maximum sequence length to embed is 1024 amino acids, and the generated embeddings have a dimension of 1024.
- Parameters:
sequences – Input protein sequences (pd.Series for single chain or pd.DataFrame for H+L mode)
cache_dir – Directory to cache model files
batch_size – Number of sequences to process in each batch