amulety.bcr_embeddings
BCR embedding functions using various models.
Functions
|
Embeds antibody sequences using the AbLang model. |
|
Embeds sequences using the antiBERTa2 RoFormer model. |
|
Embeds sequences using the AntiBERTy model. |
|
Embeds sequences using the BALM-paired model. |
- ablang(sequences, batch_size: int = 50, residue_level: bool = False)[source]
Embeds antibody sequences using the AbLang model.
Note:
AbLang consists of two models: one for heavy chains and one for light chains. Each AbLang model has two parts: AbRep (creates representations) and AbHead (predicts amino acids). Trained on antibody sequences in the OAS database, demonstrating power in restoring missing residues. This is a key capability for B-cell receptor repertoire sequencing data. Maximum sequence length: 160 amino acids. Reference: https://github.com/oxpig/AbLang
- Parameters:
sequences – pd.Series for single chain or pd.DataFrame for H+L mode
batch_size – int: Number of sequences to process in each batch.
residue_level – bool: If True, returns residue-level embeddings.
- antiberta2(sequences, cache_dir: str | None = None, residue_level: bool = False, batch_size: int = 50)[source]
Embeds sequences using the antiBERTa2 RoFormer model. The maximum length of the sequences to be embedded is 256.
- Parameters:
sequences – pd.Series for single chain or pd.DataFrame for H+L mode
cache_dir – Optional[str]: Directory to cache the model files.
residue_level – bool: If True, returns residue-level embeddings.
batch_size – int: Number of sequences to process in each batch.
- antiberty(sequences, cache_dir: str | None = None, batch_size: int = 50, residue_level: bool = False)[source]
Embeds sequences using the AntiBERTy model. The maximum length of the sequences to be embedded is 510.
- Parameters:
sequences – pd.Series for single chain or pd.DataFrame for H+L mode
- balm_paired(sequences, cache_dir: str = '/tmp/amulety', residue_level: bool = False, batch_size: int = 50)[source]
Embeds sequences using the BALM-paired model. The maximum length of the sequences to be embedded is 1024. The embedding dimension is 1024.
- Parameters:
sequences – pd.Series for single chain or pd.DataFrame for H+L mode
cache_dir – Optional[str]: Directory to cache the model files.
residue_level – bool: If True, returns residue-level embeddings.
batch_size – int: Number of sequences to process in each batch.