amulety.bcr_embeddings

BCR embedding functions using various models.

Functions

`ablang`(sequences[, batch_size, residue_level])	Embeds antibody sequences using the AbLang model.
`antiberta2`(sequences[, cache_dir, ...])	Embeds sequences using the antiBERTa2 RoFormer model.
`antiberty`(sequences[, cache_dir, ...])	Embeds sequences using the AntiBERTy model.
`balm_paired`(sequences[, cache_dir, ...])	Embeds sequences using the BALM-paired model.

ablang(sequences, batch_size: int = 50, residue_level: bool = False)[source]

Embeds antibody sequences using the AbLang model.

Note:

AbLang consists of two models: one for heavy chains and one for light chains. Each AbLang model has two parts: AbRep (creates representations) and AbHead (predicts amino acids). Trained on antibody sequences in the OAS database, demonstrating power in restoring missing residues. This is a key capability for B-cell receptor repertoire sequencing data. Maximum sequence length: 160 amino acids. Reference: https://github.com/oxpig/AbLang

Parameters:

sequences – pd.Series for single chain or pd.DataFrame for H+L mode
batch_size – int: Number of sequences to process in each batch.
residue_level – bool: If True, returns residue-level embeddings.

antiberta2(sequences, cache_dir: str | None = None, residue_level: bool = False, batch_size: int = 50)[source]

Embeds sequences using the antiBERTa2 RoFormer model. The maximum length of the sequences to be embedded is 256.

Parameters:

sequences – pd.Series for single chain or pd.DataFrame for H+L mode
cache_dir – Optional[str]: Directory to cache the model files.
residue_level – bool: If True, returns residue-level embeddings.
batch_size – int: Number of sequences to process in each batch.

antiberty(sequences, cache_dir: str | None = None, batch_size: int = 50, residue_level: bool = False)[source]

Embeds sequences using the AntiBERTy model. The maximum length of the sequences to be embedded is 510.

Parameters:: sequences – pd.Series for single chain or pd.DataFrame for H+L mode

balm_paired(sequences, cache_dir: str = '/tmp/amulety', residue_level: bool = False, batch_size: int = 50)[source]

Embeds sequences using the BALM-paired model. The maximum length of the sequences to be embedded is 1024. The embedding dimension is 1024.

Parameters:

sequences – pd.Series for single chain or pd.DataFrame for H+L mode
cache_dir – Optional[str]: Directory to cache the model files.
residue_level – bool: If True, returns residue-level embeddings.
batch_size – int: Number of sequences to process in each batch.