本期內容介紹如何使用R包clusterProfiler來做富集分析,以及用R包enrichplot來可視化富集分析結果。
1使用R包 clusterProfiler 富集分析
安裝R包 clusterProfiler
if (!require("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("clusterProfiler")
數據準備
讀了上期內容的讀者,應該知道富集分析需要的數據有 1)差異表達的基因集合 2)對應物種的背景文件。這裡使用R中 DOSE 數據包 中的人的genList數據:
1)差異表達的基因集合
library(clusterProfiler)
# 加載 DOSE包中的 geneList數據集 [包含12495個基因及其?表達值(maybe)]
data(geneList, package="DOSE")
# 保留表達值大於2的基因作為富集分析的輸入基因集合
gene <- names(geneList)[abs(geneList) > 2]
# 這裡的gene ID是KEGG中一樣的ENTREZ ID(就是NCBI GeneID)
head(gene)
[1] "4312" "8318" "10874" "55143" "55388" "991"
clusterProfiler 包目前支持的輸入Gene ID類型僅包括kegg, ncbi-geneid, ncbi-proteinid 和uniprot。
其他類型的ID需要做ID轉換,clusterProfiler 包提供了 bitr() 和 bitr_kegg() 函數來轉換 ID 類型。轉換需要基因組註釋文件,在這裡:
https://bioconductor.org/packages/release/BiocViews.html#___OrgDb
安裝ID轉換文件:
if (!require("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("org.Hs.eg.db")
查看支持的ID 類型:
library(org.Hs.eg.db)
keytypes(org.Hs.eg.db)
[1] "ACCNUM" "ALIAS" "ENSEMBL" "ENSEMBLPROT" "ENSEMBLTRANS"
[6] "ENTREZID" "ENZYME" "EVIDENCE" "EVIDENCEALL" "GENENAME"
[11] "GENETYPE" "GO" "GOALL" "IPI" "MAP"
[16] "OMIM" "ONTOLOGY" "ONTOLOGYALL" "PATH" "PFAM"
[21] "PMID" "PROSITE" "REFSEQ" "SYMBOL" "UCSCKG"
[26] "UNIPROT"
2)人的背景文件
clusterProfiler 包支持 KEGG 數據庫中支持的所有物種,可以理解為包中內置了KEGG支持物種的背景文件。KEGG 支持的生物體的完整列表可通過http://www.genome.jp/kegg/catalog/org_list.html 訪問。
富集分析
KEGG
# 富集分析
kegg_ea <- enrichKEGG(gene = gene, # 差異表達基因集合
organism = 'hsa', # 選擇物種背景文件
pvalueCutoff = 0.05) # 顯著性篩選閾值
# 預覽結果
# head(kegg_ea)
## ID category
# hsa04110 Cellular Processes
# hsa04114 Cellular Processes
# hsa04218 Cellular Processes
# hsa04061 Environmental Information Processing
# hsa03320 Organismal Systems
# hsa04814 Cellular Processes
## subcategory
# Cell growth and death
# Cell growth and death
# Cell growth and death
# Signaling molecules and interaction
# Endocrine system
# Cell motility
## Description
# Cell cycle
# Oocyte meiosis
# Cellular senescence
# Viral protein interaction with cytokine and cytokine receptor
# PPAR signaling pathway
# Motor proteins
## GeneRatio BgRatio pvalue p.adjust
15/106 157/8644 6.156704e-10 1.311378e-07
10/106 131/8644 4.204441e-06 4.477729e-04
10/106 156/8644 1.980684e-05 1.406285e-03
8/106 100/8644 2.869719e-05 1.456955e-03
7/106 75/8644 3.420081e-05 1.456955e-03
10/106 193/8644 1.209583e-04 4.294020e-03
## qvalue Count
1.296148e-07 15
4.425727e-04 10
1.389953e-03 10
1.440034e-03 8
1.440034e-03 7
4.244151e-03 10
其中富集得分等於 GeneRatio/BgRatio,pvalue就是富集顯著性。Count是對應通路中富集到的基因個數,p.adjust是Bonferroni校正p值,Bonferroni 校正是一種嚴格的控制方法,但可能過於保守,qvalue則更側重於控制在拒絕原假設的情況下產生的假陽性比例。
GO
# 挑選前100個基因做GO富集
gene <- names(geneList)[1:100]
go_ea <- enrichGO( gene = gene,
universe = names(geneList), # background
OrgDb = org.Hs.eg.db,
ont = "ALL", # BP", "MF", and "CC
pAdjustMethod = "BH",
pvalueCutoff = 0.01,
qvalueCutoff = 0.05)
head(go_ea)
ONTOLOGY ID Description GeneRatio
GO:0098813 BP GO:0098813 nuclear chromosome segregation 33/196
GO:0007059 BP GO:0007059 chromosome segregation 37/196
GO:0000070 BP GO:0000070 mitotic sister chromatid segregation 27/196
GO:0000819 BP GO:0000819 sister chromatid segregation 29/196
GO:0140014 BP GO:0140014 mitotic nuclear division 31/196
GO:0000280 BP GO:0000280 nuclear division 36/196
BgRatio pvalue p.adjust qvalue
GO:0098813 236/11580 2.779610e-21 6.690617e-18 6.155976e-18
GO:0007059 316/11580 4.445593e-21 6.690617e-18 6.155976e-18
GO:0000070 151/11580 2.096509e-20 2.103497e-17 1.935409e-17
GO:0000819 184/11580 2.919034e-20 2.196573e-17 2.021047e-17
GO:0140014 224/11580 6.841060e-20 4.118318e-17 3.789227e-17
GO:0000280 325/11580 1.054467e-19 5.289911e-17 4.867200e-17
geneID
Count
GO:0098813 33
GO:0007059 37
GO:0000070 27
GO:0000819 29
GO:0140014 31
GO:0000280 36
2使用 enrichplot 對富集結果可視化
kegg
# 安裝 enrichplot
if (!require("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("enrichplot")
# 導包
library(enrichplot)
# barplot
barplot(kegg_ea, showCategory=20)
# dotplot
dotplot(kegg_ea, showCategory=20)
# upsetplot
upsetplot(kegg_ea)
條形圖
氣泡圖
Upset圖
GO
# barplot
barplot(go_ea,
x = "GeneRatio", color = "p.adjust", # 默認參數
showCategory = 5, # 隻顯示前5
split="ONTOLOGY") # 以ONTOLOGY類型分開
facet_grid(ONTOLOGY~., scale='free') # 以ONTOLOGY類型分開繪圖
# dotplot
dotplot(go_ea, x = "GeneRatio", color = "p.adjust", size = "Count", # 默認參數
showCategory = 5,# 隻顯示前5
split="ONTOLOGY") #以ONTOLOGY類型分開
facet_grid(ONTOLOGY~., scale='free')
條形圖
氣泡圖
以上均采用了繪圖函數的默認參數,如需修改要展示的數據,xy軸標簽,標題,顏色等細節內容參考函數的幫助文檔!
3高級繪圖
展示富集分析結果還可以有:
本期就介紹到這裡啦!