amulety.bcr_embeddings

BCR embedding functions using various models.

Functions

ablang(sequences[, batch_size, residue_level])

Embeds antibody sequences using the AbLang model.

antiberta2(sequences[, cache_dir, ...])

Embeds sequences using the antiBERTa2 RoFormer model.

antiberty(sequences[, cache_dir, ...])

Embeds sequences using the AntiBERTy model.

balm_paired(sequences[, cache_dir, ...])

Embeds sequences using the BALM-paired model.

ablang(sequences, batch_size: int = 50, residue_level: bool = False)[source]

Embeds antibody sequences using the AbLang model.

Note:

AbLang consists of two models: one for heavy chains and one for light chains. Each AbLang model has two parts: AbRep (creates representations) and AbHead (predicts amino acids). Trained on antibody sequences in the OAS database, demonstrating power in restoring missing residues. This is a key capability for B-cell receptor repertoire sequencing data. Maximum sequence length: 160 amino acids. Reference: https://github.com/oxpig/AbLang

Parameters:
  • sequences – pd.Series for single chain or pd.DataFrame for H+L mode

  • batch_size – int: Number of sequences to process in each batch.

  • residue_level – bool: If True, returns residue-level embeddings.

antiberta2(sequences, cache_dir: str | None = None, residue_level: bool = False, batch_size: int = 50)[source]

Embeds sequences using the antiBERTa2 RoFormer model. The maximum length of the sequences to be embedded is 256.

Parameters:
  • sequences – pd.Series for single chain or pd.DataFrame for H+L mode

  • cache_dir – Optional[str]: Directory to cache the model files.

  • residue_level – bool: If True, returns residue-level embeddings.

  • batch_size – int: Number of sequences to process in each batch.

antiberty(sequences, cache_dir: str | None = None, batch_size: int = 50, residue_level: bool = False)[source]

Embeds sequences using the AntiBERTy model. The maximum length of the sequences to be embedded is 510.

Parameters:

sequences – pd.Series for single chain or pd.DataFrame for H+L mode

balm_paired(sequences, cache_dir: str = '/tmp/amulety', residue_level: bool = False, batch_size: int = 50)[source]

Embeds sequences using the BALM-paired model. The maximum length of the sequences to be embedded is 1024. The embedding dimension is 1024.

Parameters:
  • sequences – pd.Series for single chain or pd.DataFrame for H+L mode

  • cache_dir – Optional[str]: Directory to cache the model files.

  • residue_level – bool: If True, returns residue-level embeddings.

  • batch_size – int: Number of sequences to process in each batch.