{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# AMULETY: using embeddings for Machine Learning\n", "\n", "This tutorial demonstrates how to use AMULETY for downstream machine learning tasks with B-cell receptor (BCR) data. AMULETY is a python package for generating embeddings from BCR and TCR sequences using various state-of-the-art models.\n", "\n", "## Overview\n", "\n", "In this tutorial, you will learn to:\n", "1. Install and set up AMULETY\n", "2. Load and prepare BCR data in AIRR format\n", "3. Generate embeddings using different models\n", "4. Perform downstream machine learning tasks\n", "5. Evaluate model performance\n", "\n", "## Installation\n", "\n", "First, install AMULETY using pip:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Requirement already satisfied: amulety in /opt/anaconda3/envs/torchen/lib/python3.10/site-packages (0.1.1)\n", "Requirement already satisfied: numpy in /opt/anaconda3/envs/torchen/lib/python3.10/site-packages (from amulety) (1.26.0)\n", "Requirement already satisfied: pandas in /opt/anaconda3/envs/torchen/lib/python3.10/site-packages (from amulety) (2.2.3)\n", "Requirement already satisfied: torch in /opt/anaconda3/envs/torchen/lib/python3.10/site-packages (from amulety) (2.6.0)\n", "Requirement already satisfied: transformers in /opt/anaconda3/envs/torchen/lib/python3.10/site-packages (from amulety) (4.50.3)\n", "Requirement already satisfied: typer in /opt/anaconda3/envs/torchen/lib/python3.10/site-packages (from amulety) (0.15.4)\n", "Requirement already satisfied: antiberty in /opt/anaconda3/envs/torchen/lib/python3.10/site-packages (from amulety) (0.1.3)\n", "Requirement already satisfied: ablang in /opt/anaconda3/envs/torchen/lib/python3.10/site-packages (from amulety) (0.3.1)\n", "Requirement already satisfied: rjieba in /opt/anaconda3/envs/torchen/lib/python3.10/site-packages (from amulety) (0.1.13)\n", "Requirement already satisfied: pre-commit in /opt/anaconda3/envs/torchen/lib/python3.10/site-packages (from amulety) (4.2.0)\n", "Requirement already satisfied: protobuf in /opt/anaconda3/envs/torchen/lib/python3.10/site-packages (from amulety) (4.25.6)\n", "Requirement already satisfied: sentencepiece in /opt/anaconda3/envs/torchen/lib/python3.10/site-packages (from amulety) (0.2.0)\n", "Requirement already satisfied: pytest-workflow>=1.6.0 in /opt/anaconda3/envs/torchen/lib/python3.10/site-packages (from amulety) (2.1.0)\n", "Requirement already satisfied: pytest>=7.0.0 in /opt/anaconda3/envs/torchen/lib/python3.10/site-packages (from amulety) (8.3.5)\n", "Requirement already satisfied: gensim>=3.8.3 in /opt/anaconda3/envs/torchen/lib/python3.10/site-packages (from amulety) (4.3.3)\n", "Requirement already satisfied: scipy<1.14.0,>=1.7.0 in /opt/anaconda3/envs/torchen/lib/python3.10/site-packages (from gensim>=3.8.3->amulety) (1.13.1)\n", "Requirement already satisfied: smart-open>=1.8.1 in /opt/anaconda3/envs/torchen/lib/python3.10/site-packages (from gensim>=3.8.3->amulety) (7.1.0)\n", "Requirement already satisfied: exceptiongroup>=1.0.0rc8 in /opt/anaconda3/envs/torchen/lib/python3.10/site-packages (from pytest>=7.0.0->amulety) (1.2.0)\n", "Requirement already satisfied: iniconfig in /opt/anaconda3/envs/torchen/lib/python3.10/site-packages (from pytest>=7.0.0->amulety) (2.1.0)\n", "Requirement already satisfied: packaging in /opt/anaconda3/envs/torchen/lib/python3.10/site-packages (from pytest>=7.0.0->amulety) (24.2)\n", "Requirement already satisfied: pluggy<2,>=1.5 in /opt/anaconda3/envs/torchen/lib/python3.10/site-packages (from pytest>=7.0.0->amulety) (1.6.0)\n", "Requirement already satisfied: tomli>=1 in /opt/anaconda3/envs/torchen/lib/python3.10/site-packages (from pytest>=7.0.0->amulety) (2.2.1)\n", "Requirement already satisfied: pyyaml in /opt/anaconda3/envs/torchen/lib/python3.10/site-packages (from pytest-workflow>=1.6.0->amulety) (6.0.2)\n", "Requirement already satisfied: jsonschema in /opt/anaconda3/envs/torchen/lib/python3.10/site-packages (from pytest-workflow>=1.6.0->amulety) (4.23.0)\n", "Requirement already satisfied: xopen>=1.4.0 in /opt/anaconda3/envs/torchen/lib/python3.10/site-packages (from pytest-workflow>=1.6.0->amulety) (2.0.2)\n", "Requirement already satisfied: zstandard in /opt/anaconda3/envs/torchen/lib/python3.10/site-packages (from pytest-workflow>=1.6.0->amulety) (0.23.0)\n", "Requirement already satisfied: wrapt in /opt/anaconda3/envs/torchen/lib/python3.10/site-packages (from smart-open>=1.8.1->gensim>=3.8.3->amulety) (1.14.1)\n", "Requirement already satisfied: requests in /opt/anaconda3/envs/torchen/lib/python3.10/site-packages (from ablang->amulety) (2.32.4)\n", "Requirement already satisfied: numba in /opt/anaconda3/envs/torchen/lib/python3.10/site-packages (from ablang->amulety) (0.61.0)\n", "Requirement already satisfied: filelock in /opt/anaconda3/envs/torchen/lib/python3.10/site-packages (from torch->amulety) (3.17.0)\n", "Requirement already satisfied: typing-extensions>=4.10.0 in /opt/anaconda3/envs/torchen/lib/python3.10/site-packages (from torch->amulety) (4.12.2)\n", "Requirement already satisfied: networkx in /opt/anaconda3/envs/torchen/lib/python3.10/site-packages (from torch->amulety) (3.4.2)\n", "Requirement already satisfied: jinja2 in /opt/anaconda3/envs/torchen/lib/python3.10/site-packages (from torch->amulety) (3.1.5)\n", "Requirement already satisfied: fsspec in /opt/anaconda3/envs/torchen/lib/python3.10/site-packages (from torch->amulety) (2025.2.0)\n", "Requirement already satisfied: sympy==1.13.1 in /opt/anaconda3/envs/torchen/lib/python3.10/site-packages (from torch->amulety) (1.13.1)\n", "Requirement already satisfied: mpmath<1.4,>=1.1.0 in /opt/anaconda3/envs/torchen/lib/python3.10/site-packages (from sympy==1.13.1->torch->amulety) (1.3.0)\n", "Requirement already satisfied: huggingface-hub<1.0,>=0.26.0 in /opt/anaconda3/envs/torchen/lib/python3.10/site-packages (from transformers->amulety) (0.30.1)\n", "Requirement already satisfied: regex!=2019.12.17 in /opt/anaconda3/envs/torchen/lib/python3.10/site-packages (from transformers->amulety) (2024.11.6)\n", "Requirement already satisfied: tokenizers<0.22,>=0.21 in /opt/anaconda3/envs/torchen/lib/python3.10/site-packages (from transformers->amulety) (0.21.1)\n", "Requirement already satisfied: safetensors>=0.4.3 in /opt/anaconda3/envs/torchen/lib/python3.10/site-packages (from transformers->amulety) (0.5.3)\n", "Requirement already satisfied: tqdm>=4.27 in /opt/anaconda3/envs/torchen/lib/python3.10/site-packages (from transformers->amulety) (4.67.1)\n", "Requirement already satisfied: MarkupSafe>=2.0 in /opt/anaconda3/envs/torchen/lib/python3.10/site-packages (from jinja2->torch->amulety) (3.0.2)\n", "Requirement already satisfied: attrs>=22.2.0 in /opt/anaconda3/envs/torchen/lib/python3.10/site-packages (from jsonschema->pytest-workflow>=1.6.0->amulety) (24.3.0)\n", "Requirement already satisfied: jsonschema-specifications>=2023.03.6 in /opt/anaconda3/envs/torchen/lib/python3.10/site-packages (from jsonschema->pytest-workflow>=1.6.0->amulety) (2024.10.1)\n", "Requirement already satisfied: referencing>=0.28.4 in /opt/anaconda3/envs/torchen/lib/python3.10/site-packages (from jsonschema->pytest-workflow>=1.6.0->amulety) (0.36.2)\n", "Requirement already satisfied: rpds-py>=0.7.1 in /opt/anaconda3/envs/torchen/lib/python3.10/site-packages (from jsonschema->pytest-workflow>=1.6.0->amulety) (0.23.1)\n", "Requirement already satisfied: llvmlite<0.45,>=0.44.0dev0 in /opt/anaconda3/envs/torchen/lib/python3.10/site-packages (from numba->ablang->amulety) (0.44.0)\n", "Requirement already satisfied: python-dateutil>=2.8.2 in /opt/anaconda3/envs/torchen/lib/python3.10/site-packages (from pandas->amulety) (2.9.0.post0)\n", "Requirement already satisfied: pytz>=2020.1 in /opt/anaconda3/envs/torchen/lib/python3.10/site-packages (from pandas->amulety) (2025.1)\n", "Requirement already satisfied: tzdata>=2022.7 in /opt/anaconda3/envs/torchen/lib/python3.10/site-packages (from pandas->amulety) (2025.1)\n", "Requirement already satisfied: six>=1.5 in /opt/anaconda3/envs/torchen/lib/python3.10/site-packages (from python-dateutil>=2.8.2->pandas->amulety) (1.17.0)\n", "Requirement already satisfied: cfgv>=2.0.0 in /opt/anaconda3/envs/torchen/lib/python3.10/site-packages (from pre-commit->amulety) (3.4.0)\n", "Requirement already satisfied: identify>=1.0.0 in /opt/anaconda3/envs/torchen/lib/python3.10/site-packages (from pre-commit->amulety) (2.6.12)\n", "Requirement already satisfied: nodeenv>=0.11.1 in /opt/anaconda3/envs/torchen/lib/python3.10/site-packages (from pre-commit->amulety) (1.9.1)\n", "Requirement already satisfied: virtualenv>=20.10.0 in /opt/anaconda3/envs/torchen/lib/python3.10/site-packages (from pre-commit->amulety) (20.31.2)\n", "Requirement already satisfied: distlib<1,>=0.3.7 in /opt/anaconda3/envs/torchen/lib/python3.10/site-packages (from virtualenv>=20.10.0->pre-commit->amulety) (0.3.9)\n", "Requirement already satisfied: platformdirs<5,>=3.9.1 in /opt/anaconda3/envs/torchen/lib/python3.10/site-packages (from virtualenv>=20.10.0->pre-commit->amulety) (4.3.7)\n", "Requirement already satisfied: charset_normalizer<4,>=2 in /opt/anaconda3/envs/torchen/lib/python3.10/site-packages (from requests->ablang->amulety) (3.3.2)\n", "Requirement already satisfied: idna<4,>=2.5 in /opt/anaconda3/envs/torchen/lib/python3.10/site-packages (from requests->ablang->amulety) (3.10)\n", "Requirement already satisfied: urllib3<3,>=1.21.1 in /opt/anaconda3/envs/torchen/lib/python3.10/site-packages (from requests->ablang->amulety) (1.26.20)\n", "Requirement already satisfied: certifi>=2017.4.17 in /opt/anaconda3/envs/torchen/lib/python3.10/site-packages (from requests->ablang->amulety) (2025.6.15)\n", "Requirement already satisfied: click<8.2,>=8.0.0 in /opt/anaconda3/envs/torchen/lib/python3.10/site-packages (from typer->amulety) (8.1.8)\n", "Requirement already satisfied: shellingham>=1.3.0 in /opt/anaconda3/envs/torchen/lib/python3.10/site-packages (from typer->amulety) (1.5.4)\n", "Requirement already satisfied: rich>=10.11.0 in /opt/anaconda3/envs/torchen/lib/python3.10/site-packages (from typer->amulety) (13.9.4)\n", "Requirement already satisfied: markdown-it-py>=2.2.0 in /opt/anaconda3/envs/torchen/lib/python3.10/site-packages (from rich>=10.11.0->typer->amulety) (3.0.0)\n", "Requirement already satisfied: pygments<3.0.0,>=2.13.0 in /opt/anaconda3/envs/torchen/lib/python3.10/site-packages (from rich>=10.11.0->typer->amulety) (2.19.1)\n", "Requirement already satisfied: mdurl~=0.1 in /opt/anaconda3/envs/torchen/lib/python3.10/site-packages (from markdown-it-py>=2.2.0->rich>=10.11.0->typer->amulety) (0.1.2)\n", "Requirement already satisfied: pandas in /opt/anaconda3/envs/torchen/lib/python3.10/site-packages (2.2.3)\n", "Requirement already satisfied: numpy in /opt/anaconda3/envs/torchen/lib/python3.10/site-packages (1.26.0)\n", "Requirement already satisfied: scikit-learn in /opt/anaconda3/envs/torchen/lib/python3.10/site-packages (1.6.1)\n", "Requirement already satisfied: torch in /opt/anaconda3/envs/torchen/lib/python3.10/site-packages (2.6.0)\n", "Requirement already satisfied: matplotlib in /opt/anaconda3/envs/torchen/lib/python3.10/site-packages (3.10.0)\n", "Requirement already satisfied: seaborn in /opt/anaconda3/envs/torchen/lib/python3.10/site-packages (0.13.2)\n", "Requirement already satisfied: python-dateutil>=2.8.2 in /opt/anaconda3/envs/torchen/lib/python3.10/site-packages (from pandas) (2.9.0.post0)\n", "Requirement already satisfied: pytz>=2020.1 in /opt/anaconda3/envs/torchen/lib/python3.10/site-packages (from pandas) (2025.1)\n", "Requirement already satisfied: tzdata>=2022.7 in /opt/anaconda3/envs/torchen/lib/python3.10/site-packages (from pandas) (2025.1)\n", "Requirement already satisfied: scipy>=1.6.0 in /opt/anaconda3/envs/torchen/lib/python3.10/site-packages (from scikit-learn) (1.13.1)\n", "Requirement already satisfied: joblib>=1.2.0 in /opt/anaconda3/envs/torchen/lib/python3.10/site-packages (from scikit-learn) (1.4.2)\n", "Requirement already satisfied: threadpoolctl>=3.1.0 in /opt/anaconda3/envs/torchen/lib/python3.10/site-packages (from scikit-learn) (3.5.0)\n", "Requirement already satisfied: filelock in /opt/anaconda3/envs/torchen/lib/python3.10/site-packages (from torch) (3.17.0)\n", "Requirement already satisfied: typing-extensions>=4.10.0 in /opt/anaconda3/envs/torchen/lib/python3.10/site-packages (from torch) (4.12.2)\n", "Requirement already satisfied: networkx in /opt/anaconda3/envs/torchen/lib/python3.10/site-packages (from torch) (3.4.2)\n", "Requirement already satisfied: jinja2 in /opt/anaconda3/envs/torchen/lib/python3.10/site-packages (from torch) (3.1.5)\n", "Requirement already satisfied: fsspec in /opt/anaconda3/envs/torchen/lib/python3.10/site-packages (from torch) (2025.2.0)\n", "Requirement already satisfied: sympy==1.13.1 in /opt/anaconda3/envs/torchen/lib/python3.10/site-packages (from torch) (1.13.1)\n", "Requirement already satisfied: mpmath<1.4,>=1.1.0 in /opt/anaconda3/envs/torchen/lib/python3.10/site-packages (from sympy==1.13.1->torch) (1.3.0)\n", "Requirement already satisfied: contourpy>=1.0.1 in /opt/anaconda3/envs/torchen/lib/python3.10/site-packages (from matplotlib) (1.3.1)\n", "Requirement already satisfied: cycler>=0.10 in /opt/anaconda3/envs/torchen/lib/python3.10/site-packages (from matplotlib) (0.11.0)\n", "Requirement already satisfied: fonttools>=4.22.0 in /opt/anaconda3/envs/torchen/lib/python3.10/site-packages (from matplotlib) (4.55.3)\n", "Requirement already satisfied: kiwisolver>=1.3.1 in /opt/anaconda3/envs/torchen/lib/python3.10/site-packages (from matplotlib) (1.4.8)\n", "Requirement already satisfied: packaging>=20.0 in /opt/anaconda3/envs/torchen/lib/python3.10/site-packages (from matplotlib) (24.2)\n", "Requirement already satisfied: pillow>=8 in /opt/anaconda3/envs/torchen/lib/python3.10/site-packages (from matplotlib) (11.1.0)\n", "Requirement already satisfied: pyparsing>=2.3.1 in /opt/anaconda3/envs/torchen/lib/python3.10/site-packages (from matplotlib) (3.2.0)\n", "Requirement already satisfied: six>=1.5 in /opt/anaconda3/envs/torchen/lib/python3.10/site-packages (from python-dateutil>=2.8.2->pandas) (1.17.0)\n", "Requirement already satisfied: MarkupSafe>=2.0 in /opt/anaconda3/envs/torchen/lib/python3.10/site-packages (from jinja2->torch) (3.0.2)\n" ] } ], "source": [ "# Install Amulety\n", "!pip install amulety\n", "\n", "# Install additional dependencies for this tutorial\n", "!pip install pandas numpy scikit-learn torch matplotlib seaborn" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Import Required Libraries" ] }, { "cell_type": "code", "execution_count": 59, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "All libraries imported successfully!\n" ] } ], "source": [ "import pandas as pd\n", "import numpy as np\n", "import torch\n", "import matplotlib.pyplot as plt\n", "import seaborn as sns\n", "from sklearn.model_selection import StratifiedGroupKFold, GridSearchCV\n", "from sklearn.svm import SVC\n", "from sklearn.metrics import matthews_corrcoef, f1_score, accuracy_score\n", "from sklearn.preprocessing import LabelEncoder\n", "from collections import Counter\n", "import warnings\n", "warnings.filterwarnings('ignore')\n", "\n", "# Import Amulety\n", "from amulety import embed_airr\n", "\n", "print(\"All libraries imported successfully!\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Load BCR Data\n", "\n", "We'll load the BCR data in AIRR (Adaptive Immune Receptor Repertoire) format. This dataset contains antibody sequences with associated metadata including gene usage, isotype, and other annotations." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "--2025-09-24 13:32:27-- https://zenodo.org/records/17186858/files/ML_bcr_airr_dataset.tsv\n", "Resolving zenodo.org (zenodo.org)... 188.185.43.25, 188.185.45.92, 188.185.48.194, ...\n", "Connecting to zenodo.org (zenodo.org)|188.185.43.25|:443... connected.\n", "HTTP request sent, awaiting response... 200 OK\n", "Length: 345071483 (329M) [application/octet-stream]\n", "Saving to: 'tutorial/ML_bcr_airr_dataset.tsv'\n", "\n", "ML_bcr_airr_dataset 100%[===================>] 329.08M 6.45MB/s in 54s \n", "\n", "2025-09-24 13:33:21 (6.14 MB/s) - 'tutorial/ML_bcr_airr_dataset.tsv' saved [345071483/345071483]\n", "\n" ] } ], "source": [ "# Download the tutorial dataset\n", "! mkdir -p tutorial\n", "! wget -P tutorial https://zenodo.org/records/17186858/files/ML_bcr_airr_dataset.tsv" ] }, { "cell_type": "code", "execution_count": 60, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Dataset shape: (1399366, 19)\n", "\n", "Columns: ['sequence_id', 'sequence_vdj_aa', 'locus', 'cell_id', 'chain_type', 'v_call', 'v_call_family', 'j_call_family', 'mu_freq', 'junction_aa_length', 'isotype', 'source', 'subject', 'specificity', 'duplicate_count', 'productive', 'rev_comp', 'stop_codon', 'vj_in_frame']\n", "\n", "First few rows:\n" ] }, { "data": { "text/html": [ "
| \n", " | sequence_id | \n", "sequence_vdj_aa | \n", "locus | \n", "cell_id | \n", "chain_type | \n", "v_call | \n", "v_call_family | \n", "j_call_family | \n", "mu_freq | \n", "junction_aa_length | \n", "isotype | \n", "source | \n", "subject | \n", "specificity | \n", "duplicate_count | \n", "productive | \n", "rev_comp | \n", "stop_codon | \n", "vj_in_frame | \n", "
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | \n", "1_heavy | \n", "QVQLVESGGGLVKPGGSLRLSCAASGFTFSDYYMSWIRQAPGKGLE... | \n", "IGH | \n", "cell_1 | \n", "H | \n", "IGHV3-11 | \n", "IGHV3 | \n", "IGHJ4 | \n", "0.016447 | \n", "12.0 | \n", "NaN | \n", "OAS | \n", "OAS_King_Subject-BCP3 | \n", "unlabeled | \n", "1 | \n", "True | \n", "False | \n", "False | \n", "True | \n", "
| 1 | \n", "2_heavy | \n", "QVQLVESGGGVVQPGRSLRLSCAASGFTFSSYAMHWVRQAPGKGLE... | \n", "IGH | \n", "cell_2 | \n", "H | \n", "IGHV3-30 | \n", "IGHV3 | \n", "IGHJ4 | \n", "0.026578 | \n", "17.0 | \n", "NaN | \n", "OAS | \n", "OAS_King_Subject-BCP3 | \n", "unlabeled | \n", "1 | \n", "True | \n", "False | \n", "False | \n", "True | \n", "
| 2 | \n", "3_heavy | \n", "EVQLVESGGGLVKPGGSLTLSCAVSGFTFKNAWMSWVRQAPGKGLE... | \n", "IGH | \n", "cell_3 | \n", "H | \n", "IGHV3-15 | \n", "IGHV3 | \n", "IGHJ4 | \n", "0.069536 | \n", "14.0 | \n", "NaN | \n", "OAS | \n", "OAS_King_Subject-BCP3 | \n", "unlabeled | \n", "1 | \n", "True | \n", "False | \n", "False | \n", "True | \n", "
| 3 | \n", "4_heavy | \n", "EVQLVESGGALVKPGGSLRLSCVVSGLTFTDAYMIWVRQAPGKGLE... | \n", "IGH | \n", "cell_4 | \n", "H | \n", "IGHV3-15 | \n", "IGHV3 | \n", "IGHJ6 | \n", "0.075410 | \n", "14.0 | \n", "NaN | \n", "OAS | \n", "OAS_King_Subject-BCP3 | \n", "unlabeled | \n", "1 | \n", "True | \n", "False | \n", "False | \n", "True | \n", "
| 4 | \n", "5_heavy | \n", "QEELVEAGGTVVQPGRSLGLSCAASGFSFSNYLMHWVRQTPGKGLE... | \n", "IGH | \n", "cell_5 | \n", "H | \n", "IGHV3-30-3 | \n", "IGHV3 | \n", "IGHJ5 | \n", "0.076923 | \n", "22.0 | \n", "NaN | \n", "OAS | \n", "OAS_King_Subject-BCP3 | \n", "unlabeled | \n", "1 | \n", "True | \n", "False | \n", "False | \n", "True | \n", "