Jupyter Notebook

Gene Ontology (GO)#

Pathways represent interconnected molecular networks of signaling cascades that govern critical cellular processes. They provide understandings cellular behavior mechanisms, insights of disease progression and treatment responses. In an R&D organization, managing pathways across different datasets are crucial for gaining insights of potential therapeutic targets and intervention strategies.

In this notebook we manage a pathway registry based on “2023 GO Biological Process” ontology. We’ll walk you through the following steps:

  1. Register pathways and link them to genes.

  2. Perform a pathway enrichment analysis on an interferon-beta treated dataset and track the dataset with LaminDB.

  3. Demonstrate how datasets are queryable by pathways and genes using LaminDB.

Setup#

Warning

Please ensure that you have created or loaded a LaminDB instance before running the remaining part of this notebook!

# A lamindb instance containing bionty schema (skip if you already loaded your own instance)
!lamin init --storage ./enrichr --schema bionty
Hide code cell output
✅ saved: User(uid='DzTjkKse', handle='testuser1', name='Test User1', updated_at=2023-11-15 15:52:55 UTC)
✅ saved: Storage(uid='fvRpasFd', root='/home/runner/work/lamin-usecases/lamin-usecases/docs/enrichr', type='local', updated_at=2023-11-15 15:52:55 UTC, created_by_id=1)
💡 loaded instance: testuser1/enrichr
💡 did not register local instance on hub

import lamindb as ln
import lnschema_bionty as lb
from lamin_usecases import datasets as ds
import gseapy as gp
import scanpy as sc
import matplotlib.pyplot as plt

lb.settings.organism = "human"  # globally set organism
💡 lamindb instance: testuser1/enrichr

Fetch GO pathways annotated with human genes using Enrichr#

First we fetch the “GO_Biological_Process_2023” pathways for humans using GSEApy which wraps GSEA and Enrichr.

go_bp = gp.get_library(name="GO_Biological_Process_2023", organism="Human")
print(f"Number of pathways {len(go_bp)}")
2023-11-15 15:52:58,008:INFO - Downloading and generating Enrichr library gene sets...
2023-11-15 15:53:00,383:INFO - 0001 gene_sets have been filtered out when max_size=2000 and min_size=0
Number of pathways 5406

go_bp["ATF6-mediated Unfolded Protein Response (GO:0036500)"]
['MBTPS1', 'MBTPS2', 'XBP1', 'ATF6B', 'DDIT3', 'CREBZF']

Parse out the ontology_id from keys, convert into the format of {ontology_id: (name, genes)}

def parse_ontology_id_from_keys(key):
    """Parse out the ontology id.

    "ATF6-mediated Unfolded Protein Response (GO:0036500)" -> ("GO:0036500", "ATF6-mediated Unfolded Protein Response")
    """
    id = key.split(" ")[-1].replace("(", "").replace(")", "")
    name = key.replace(f" ({id})", "")
    return (id, name)
go_bp_parsed = {}

for key, genes in go_bp.items():
    id, name = parse_ontology_id_from_keys(key)
    go_bp_parsed[id] = (name, genes)
go_bp_parsed["GO:0036500"]
('ATF6-mediated Unfolded Protein Response',
 ['MBTPS1', 'MBTPS2', 'XBP1', 'ATF6B', 'DDIT3', 'CREBZF'])

Register pathway ontology in LaminDB#

pathway_bionty = lb.Pathway.bionty()  # equals to bionty.Pathway()
pathway_bionty
Pathway
Organism: all
Source: go, 2023-05-10
#terms: 47514

📖 Pathway.df(): ontology reference table
🔎 Pathway.lookup(): autocompletion of terms
🎯 Pathway.search(): free text search of terms
✅ Pathway.validate(): strictly validate values
🧐 Pathway.inspect(): full inspection of values
👽 Pathway.standardize(): convert to standardized names
🪜 Pathway.diff(): difference between two versions
🔗 Pathway.ontology: Pronto.Ontology object

Next, we register all the pathways and genes in LaminDB to finally link pathways to genes.

Register pathway terms#

To register the pathways we make use of .from_values to directly parse the annotated GO pathway ontology IDs into LaminDB.

pathway_records = lb.Pathway.from_values(go_bp_parsed.keys(), lb.Pathway.ontology_id)
lb.Pathway.from_bionty(ontology_id="GO:0015868")
Pathway(uid='SMqshx3Y', name='purine ribonucleotide transport', ontology_id='GO:0015868', description='The Directed Movement Of A Purine Ribonucleotide, Any Compound Consisting Of A Purine Ribonucleoside (A Purine Organic Base Attached To A Ribose Sugar) Esterified With (Ortho)Phosphate, Into, Out Of Or Within A Cell.', bionty_source_id=40, created_by_id=1)
ln.save(pathway_records, parents=False)  # not recursing through parents

Register gene symbols#

Similarly, we use .from_values for all Pathway associated genes to register them with LaminDB.

all_genes = {g for genes in go_bp.values() for g in genes}
gene_records = lb.Gene.from_values(all_genes, lb.Gene.symbol)
Hide code cell output
❗ ambiguous validation in Bionty for 1082 records: 'GBA1', 'CCHCR1', 'GTF2H4', 'NFATC4', 'ZNF430', 'PNP', 'OTUB2', 'STK19', 'HLA-DPA1', 'HES5', 'CENATAC', 'TRAPPC12', 'CLDN7', 'WNT3', 'NAT8B', 'ADCY4', 'ZNF35', 'KIR2DL4', 'MARF1', 'NFKBIB', ...
did not create Gene records for 37 non-validated symbols: 'AFD1', 'AZF1', 'CCL4L1', 'DGS2', 'DUX3', 'DUX5', 'FOXL3-OT1', 'IGL', 'LOC100653049', 'LOC102723475', 'LOC102723996', 'LOC102724159', 'LOC107984156', 'LOC112268384', 'LOC122319436', 'LOC122513141', 'LOC122539214', 'LOC344967', 'MDRV', 'MTRNR2L1', ...
gene_records[:3]
[Gene(uid='X9YXexM3fVi0', symbol='FAAH2', ensembl_gene_id='ENSG00000165591', ncbi_gene_ids='158584', biotype='protein_coding', description='fatty acid amide hydrolase 2 [Source:HGNC Symbol;Acc:HGNC:26440]', synonyms='RP11-479E16.1|FAAH-2|FLJ31204|AMDD', organism_id=1, bionty_source_id=9, created_by_id=1),
 Gene(uid='4pWg9UUW0X9l', symbol='GBA1', ensembl_gene_id='ENSG00000177628', ncbi_gene_ids='2629', biotype='protein_coding', description='glucosylceramidase beta 1 [Source:HGNC Symbol;Acc:HGNC:4177]', synonyms='GBA|GLUC', organism_id=1, bionty_source_id=9, created_by_id=1),
 Gene(uid='1OtGpkrh8hlI', symbol='GBA1', ensembl_gene_id='ENSG00000262446', ncbi_gene_ids='2629', biotype='protein_coding', description='glucosylceramidase beta 1 [Source:HGNC Symbol;Acc:HGNC:4177]', synonyms='GBA|GLUC', organism_id=1, bionty_source_id=9, created_by_id=1)]
ln.save(gene_records);

A interferon-beta treated dataset#

We will now conduct a pathway enrichment analysis on a small peripheral blood mononuclear cell dataset that is split into control and stimulated groups. The stimulated group was treated with interferon beta.

Let’s load the dataset and look at the cell type annotations.

adata = ds.anndata_seurat_ifnb()
Hide code cell output


/opt/hostedtoolcache/Python/3.9.18/x64/lib/python3.9/site-packages/anndata/__init__.py:51: FutureWarning: `anndata.read` is deprecated, use `anndata.read_h5ad` instead. `ad.read` will be removed in mid 2024.
  warnings.warn(
adata
AnnData object with n_obs × n_vars = 13999 × 14053
    obs: 'orig.ident', 'nCount_RNA', 'nFeature_RNA', 'stim', 'seurat_annotations'
    var: 'features'
    uns: 'log1p'
adata.obs["seurat_annotations"].value_counts()
CD14 Mono       4362
CD4 Naive T     2504
CD4 Memory T    1762
CD16 Mono       1044
B                978
CD8 T            814
NK               633
T activated      619
DC               472
B Activated      388
Mk               236
pDC              132
Eryth             55
Name: seurat_annotations, dtype: int64

For simplicity, we subset to “B Activated” cells:

adata_ba = adata[adata.obs.seurat_annotations == "B Activated"].copy()
adata_ba
AnnData object with n_obs × n_vars = 388 × 14053
    obs: 'orig.ident', 'nCount_RNA', 'nFeature_RNA', 'stim', 'seurat_annotations'
    var: 'features'
    uns: 'log1p'

Pathway enrichment analysis using Enrichr#

This analysis is based on the GSEApy scRNA-seq Example.

First, we compute differentially expressed genes using a Wilcoxon test between stimulated and control cells.

# compute differentially expressed genes
sc.tl.rank_genes_groups(
    adata_ba,
    groupby="stim",
    use_raw=False,
    method="wilcoxon",
    groups=["STIM"],
    reference="CTRL",
)

rank_genes_groups_df = sc.get.rank_genes_groups_df(adata_ba, "STIM")
... storing 'orig.ident' as categorical
... storing 'seurat_annotations' as categorical
rank_genes_groups_df.head()
names scores logfoldchanges pvals pvals_adj
0 ISG15 16.881584 5.923428 6.147295e-64 6.536230e-60
1 ISG20 16.857113 4.167713 9.302256e-64 6.536230e-60
2 IFIT3 14.587233 31.232290 3.386569e-48 1.586382e-44
3 IFI6 14.128634 6.471180 2.530019e-45 8.888589e-42
4 MX1 13.442097 6.241539 3.425901e-41 9.628837e-38

Next, we filter out up/down-regulated differentially expressed gene sets:

degs_up = rank_genes_groups_df[
    (rank_genes_groups_df["logfoldchanges"] > 0)
    & (rank_genes_groups_df["pvals_adj"] < 0.05)
]
degs_dw = rank_genes_groups_df[
    (rank_genes_groups_df["logfoldchanges"] < 0)
    & (rank_genes_groups_df["pvals_adj"] < 0.05)
]
degs_up.shape, degs_dw.shape
((89, 5), (47, 5))

Run pathway enrichment analysis on DEGs and plot top 10 pathways:

enr_up = gp.enrichr(degs_up.names, gene_sets="GO_Biological_Process_2023").res2d

gp.dotplot(enr_up, figsize=(2, 3), title="Up", cmap=plt.cm.autumn_r);
_images/3ee3cd6b8d317a772209563c9c61abe92953d06b8328a5ac05f931cb58621e25.png
enr_dw = gp.enrichr(degs_dw.names, gene_sets="GO_Biological_Process_2023").res2d

gp.dotplot(enr_dw, figsize=(2, 3), title="Down", cmap=plt.cm.winter_r, size=10);
_images/22b60597ce856231002cb6ee786b53d2e274128ff5d1f3263af2d5fed59af6d5.png

Track datasets containing annotated pathways in LaminDB#

Let’s enable tracking of the current notebook as the transform of this file:

ln.track()
💡 notebook imports: gseapy==1.1.0 lamin_usecases==0.0.1 lamindb==0.61.0 lnschema_bionty==0.35.0 matplotlib==3.8.1 scanpy==1.9.6
💡 saved: Transform(uid='6oxEIEduvo6wz8', name='Gene Ontology (GO)', short_name='enrichr', version='0', type=notebook, updated_at=2023-11-15 15:53:44 UTC, created_by_id=1)
💡 saved: Run(uid='aiazZIEawevaS2DOfDpq', run_at=2023-11-15 15:53:44 UTC, transform_id=1, created_by_id=1)

We further create a File object to track the dataset.

file = ln.File.from_anndata(
    adata_ba, description="seurat_ifnb_activated_Bcells", field=lb.Gene.symbol
)
file.save()
4729 terms (33.70%) are not validated for symbol: AL627309.1, RP11-206L10.2, LINC00115, KLHL17, C1orf159, ACAP3, CPSF3L, GLTPD1, RP4-758J18.2, AL645728.1, RP11-345P4.9, SLC35E2B, SLC35E2, RP5-892K4.1, C1orf86, AL590822.2, MORN1, RP3-395M20.12, RP3-395M20.9, FAM213B, ...
5 terms (100.00%) are not validated for name: orig.ident, nCount_RNA, nFeature_RNA, stim, seurat_annotations
❗    no validated features, skip creating feature set

We further create two feature sets for degs_up and degs_dw which we can later associate with the associated pathways:

degs_up_featureset = ln.FeatureSet.from_values(degs_up.names, lb.Gene.symbol)
Hide code cell output
13 terms (14.60%) are not validated for symbol: EPSTI1, WARS, LAP3, SAMD9L, NMI, TMEM123, CMPK2, H3F3B, PSMA2.1, PHF11, CLEC2D, DDX58, CD48
degs_dw_featureset = ln.FeatureSet.from_values(degs_dw.names, lb.Gene.symbol)
Hide code cell output
3 terms (6.40%) are not validated for symbol: GNB2L1, TMEM66, HLA-DQB1

Link the top 10 pathways to the corresponding differentially expressed genes:

# get ontology ids for the top 10 pathways
enr_up_top10 = [
    pw_id[0] for pw_id in enr_up.head(10).Term.apply(parse_ontology_id_from_keys)
]
enr_dw_top10 = [
    pw_id[0] for pw_id in enr_dw.head(10).Term.apply(parse_ontology_id_from_keys)
]

# get pathway records
enr_up_top10_pathways = lb.Pathway.from_values(enr_up_top10, lb.Pathway.ontology_id)
enr_dw_top10_pathways = lb.Pathway.from_values(enr_dw_top10, lb.Pathway.ontology_id)

Link feature sets to file:

file.features.add_feature_set(degs_up_featureset, slot="up-DEGs")
file.features.add_feature_set(degs_dw_featureset, slot="down-DEGs")

Associate the pathways to the differentially expressed genes:

degs_up_featureset.pathways.set(enr_up_top10_pathways)
degs_dw_featureset.pathways.set(enr_dw_top10_pathways)
degs_up_featureset.pathways.list("name")
['antiviral innate immune response',
 'defense response to symbiont',
 'response to type II interferon',
 'defense response to virus',
 'negative regulation of viral genome replication',
 'interleukin-27-mediated signaling pathway',
 'positive regulation of interferon-beta production',
 'response to interferon-beta',
 'negative regulation of viral process',
 'regulation of viral genome replication']

Querying for pathways#

Querying for pathways is now simple with .filter:

lb.Pathway.filter(name__contains="interferon-beta").df()
uid name ontology_id abbr synonyms description bionty_source_id updated_at created_by_id
id
684 l06ZujxW cellular response to interferon-beta GO:0035458 None cellular response to fibroblast interferon|cel... Any Process That Results In A Change In State ... 40 2023-11-15 15:53:03.552987+00:00 1
2130 uu9GYFx2 negative regulation of interferon-beta production GO:0032688 None down regulation of interferon-beta production|... Any Process That Stops, Prevents, Or Reduces T... 40 2023-11-15 15:53:03.648478+00:00 1
3127 SGYMKD7O positive regulation of interferon-beta production GO:0032728 None positive regulation of IFN-beta production|up-... Any Process That Activates Or Increases The Fr... 40 2023-11-15 15:53:03.712064+00:00 1
4334 GD9xCHBK regulation of interferon-beta production GO:0032648 None regulation of IFN-beta production Any Process That Modulates The Frequency, Rate... 40 2023-11-15 15:53:03.797467+00:00 1
4953 mCgM7JYR response to interferon-beta GO:0035456 None response to fiblaferon|response to fibroblast ... Any Process That Results In A Change In State ... 40 2023-11-15 15:53:03.842406+00:00 1

Query pathways from a gene:

lb.Pathway.filter(genes__symbol="KIR2DL1").df()
uid name ontology_id abbr synonyms description bionty_source_id updated_at created_by_id
id
1346 TSXmNUbN immune response-inhibiting cell surface recept... GO:0002767 None immune response-inhibiting cell surface recept... The Series Of Molecular Signals Initiated By A... 40 2023-11-15 15:53:03.595870+00:00 1

Query files from a pathway:

ln.File.filter(feature_sets__pathways__name__icontains="interferon-beta").first()
File(uid='ZLeBNG4g4Y6DlBaVTS05', suffix='.h5ad', accessor='AnnData', description='seurat_ifnb_activated_Bcells', size=5896640, hash='DsfKFJ906VT9UR4lrxd8Ag', hash_type='md5', visibility=0, key_is_virtual=True, updated_at=2023-11-15 15:53:46 UTC, storage_id=1, transform_id=1, run_id=1, created_by_id=1)

Query featuresets from a pathway to learn from which geneset this pathway was computed:

pathway = lb.Pathway.filter(ontology_id="GO:0035456").one()
pathway
Pathway(uid='mCgM7JYR', name='response to interferon-beta', ontology_id='GO:0035456', synonyms='response to fiblaferon|response to fibroblast interferon|response to interferon beta', description='Any Process That Results In A Change In State Or Activity Of A Cell Or An Organism (In Terms Of Movement, Secretion, Enzyme Production, Gene Expression, Etc.) As A Result Of An Interferon-Beta Stimulus. Interferon-Beta Is A Type I Interferon.', updated_at=2023-11-15 15:53:03 UTC, bionty_source_id=40, created_by_id=1)
degs = ln.FeatureSet.filter(pathways__ontology_id=pathway.ontology_id).one()

Now we can get the list of genes that are differentially expressed and belong to this pathway:

pathway_genes = set(pathway.genes.list("symbol"))
degs_genes = set(degs.genes.list("symbol"))
pathway_genes.intersection(degs_genes)
{'BST2',
 'IFI16',
 'IFITM2',
 'IFITM3',
 'IRF1',
 'OAS1',
 'PLSCR1',
 'STAT1',
 'XAF1'}
# clean up test instance
!lamin delete --force enrichr
!rm -r ./enrichr
Hide code cell output
💡 deleting instance testuser1/enrichr
✅     deleted instance settings file: /home/runner/.lamin/instance--testuser1--enrichr.env
✅     instance cache deleted
✅     deleted '.lndb' sqlite file
❗     consider manually deleting your stored data: /home/runner/work/lamin-usecases/lamin-usecases/docs/enrichr