本期內容介紹如何使用R包clusterProfiler來做富集分析，以及用R包enrichplot來可視化富集分析結果。

1使用R包 clusterProfiler 富集分析

安裝R包 clusterProfiler

if (!require("BiocManager", quietly = TRUE))
    install.packages("BiocManager")
BiocManager::install("clusterProfiler")

數據準備

讀了上期內容的讀者，應該知道富集分析需要的數據有 1）差異表達的基因集合 2）對應物種的背景文件。這裡使用R中 DOSE 數據包中的人的genList數據：

1）差異表達的基因集合

library(clusterProfiler)
# 加載 DOSE包中的 geneList數據集 [包含12495個基因及其?表達值(maybe)]
data(geneList, package="DOSE")
# 保留表達值大於2的基因作為富集分析的輸入基因集合
gene <- names(geneList)[abs(geneList) > 2]
# 這裡的gene ID是KEGG中一樣的ENTREZ ID(就是NCBI GeneID)
head(gene)
[1] "4312"  "8318"  "10874" "55143" "55388" "991"

clusterProfiler 包目前支持的輸入Gene ID類型僅包括kegg, ncbi-geneid, ncbi-proteinid 和uniprot。

其他類型的ID需要做ID轉換，clusterProfiler 包提供了 bitr() 和 bitr_kegg() 函數來轉換 ID 類型。轉換需要基因組註釋文件，在這裡：

https://bioconductor.org/packages/release/BiocViews.html#___OrgDb

安裝ID轉換文件：

if (!require("BiocManager", quietly = TRUE))
    install.packages("BiocManager")
BiocManager::install("org.Hs.eg.db")

查看支持的ID 類型:

library(org.Hs.eg.db)
keytypes(org.Hs.eg.db)
 [1] "ACCNUM"       "ALIAS"        "ENSEMBL"      "ENSEMBLPROT"  "ENSEMBLTRANS"
 [6] "ENTREZID"     "ENZYME"       "EVIDENCE"     "EVIDENCEALL"  "GENENAME"    
[11] "GENETYPE"     "GO"           "GOALL"        "IPI"          "MAP"         
[16] "OMIM"         "ONTOLOGY"     "ONTOLOGYALL"  "PATH"         "PFAM"        
[21] "PMID"         "PROSITE"      "REFSEQ"       "SYMBOL"       "UCSCKG"      
[26] "UNIPROT"

2）人的背景文件

clusterProfiler 包支持 KEGG 數據庫中支持的所有物種，可以理解為包中內置了KEGG支持物種的背景文件。KEGG 支持的生物體的完整列表可通過http://www.genome.jp/kegg/catalog/org_list.html 訪問。

富集分析

KEGG

# 富集分析
kegg_ea <- enrichKEGG(gene    = gene,  # 差異表達基因集合
                 organism     = 'hsa', # 選擇物種背景文件
                 pvalueCutoff = 0.05)  # 顯著性篩選閾值
# 預覽結果
# head(kegg_ea)
## ID         category
# hsa04110   Cellular Processes
# hsa04114   Cellular Processes
# hsa04218   Cellular Processes
# hsa04061   Environmental Information Processing
# hsa03320   Organismal Systems
# hsa04814   Cellular Processes
## subcategory 
# Cell growth and death
# Cell growth and death
# Cell growth and death
# Signaling molecules and interaction
# Endocrine system
# Cell motility
## Description
# Cell cycle
# Oocyte meiosis
# Cellular senescence
# Viral protein interaction with cytokine and cytokine receptor
# PPAR signaling pathway
# Motor proteins
## GeneRatio  BgRatio    pvalue       p.adjust
   15/106     157/8644   6.156704e-10 1.311378e-07 
   10/106     131/8644   4.204441e-06 4.477729e-04 
   10/106     156/8644   1.980684e-05 1.406285e-03 
   8/106      100/8644   2.869719e-05 1.456955e-03 
   7/106      75/8644    3.420081e-05 1.456955e-03 
   10/106     193/8644   1.209583e-04 4.294020e-03 
## qvalue             Count
   1.296148e-07       15
   4.425727e-04       10
   1.389953e-03       10
   1.440034e-03       8
   1.440034e-03       7
   4.244151e-03       10

其中富集得分等於 GeneRatio/BgRatio，pvalue就是富集顯著性。Count是對應通路中富集到的基因個數，p.adjust是Bonferroni校正p值，Bonferroni 校正是一種嚴格的控制方法，但可能過於保守，qvalue則更側重於控制在拒絕原假設的情況下產生的假陽性比例。

# 挑選前100個基因做GO富集
gene <- names(geneList)[1:100]
go_ea <- enrichGO( gene          = gene,
                   universe      = names(geneList), # background
                   OrgDb         = org.Hs.eg.db,
                   ont           = "ALL", # BP", "MF", and "CC
                   pAdjustMethod = "BH",
                   pvalueCutoff  = 0.01,
                   qvalueCutoff  = 0.05)
head(go_ea)
           ONTOLOGY         ID                          Description GeneRatio
GO:0098813       BP GO:0098813       nuclear chromosome segregation    33/196
GO:0007059       BP GO:0007059               chromosome segregation    37/196
GO:0000070       BP GO:0000070 mitotic sister chromatid segregation    27/196
GO:0000819       BP GO:0000819         sister chromatid segregation    29/196
GO:0140014       BP GO:0140014             mitotic nuclear division    31/196
GO:0000280       BP GO:0000280                     nuclear division    36/196
             BgRatio       pvalue     p.adjust       qvalue
GO:0098813 236/11580 2.779610e-21 6.690617e-18 6.155976e-18
GO:0007059 316/11580 4.445593e-21 6.690617e-18 6.155976e-18
GO:0000070 151/11580 2.096509e-20 2.103497e-17 1.935409e-17
GO:0000819 184/11580 2.919034e-20 2.196573e-17 2.021047e-17
GO:0140014 224/11580 6.841060e-20 4.118318e-17 3.789227e-17
GO:0000280 325/11580 1.054467e-19 5.289911e-17 4.867200e-17
                                                                                                                                                                                                             geneID
           Count
GO:0098813    33
GO:0007059    37
GO:0000070    27
GO:0000819    29
GO:0140014    31
GO:0000280    36

2使用 enrichplot 對富集結果可視化

kegg

# 安裝 enrichplot
if (!require("BiocManager", quietly = TRUE))
    install.packages("BiocManager")
BiocManager::install("enrichplot")
# 導包
library(enrichplot)
# barplot
barplot(kegg_ea, showCategory=20) 
# dotplot
dotplot(kegg_ea, showCategory=20)
# upsetplot
upsetplot(kegg_ea)

條形圖

氣泡圖

Upset圖

# barplot
barplot(go_ea, 
        x = "GeneRatio", color = "p.adjust", # 默認參數
        showCategory = 5,   # 隻顯示前5
        split="ONTOLOGY")   # 以ONTOLOGY類型分開
  facet_grid(ONTOLOGY~., scale='free') # 以ONTOLOGY類型分開繪圖
# dotplot
dotplot(go_ea, x = "GeneRatio", color = "p.adjust", size = "Count", # 默認參數
      showCategory = 5,# 隻顯示前5
      split="ONTOLOGY")   #以ONTOLOGY類型分開
facet_grid(ONTOLOGY~., scale='free')

條形圖

氣泡圖

以上均采用了繪圖函數的默認參數，如需修改要展示的數據，xy軸標簽，標題，顏色等細節內容參考函數的幫助文檔！

3高級繪圖

展示富集分析結果還可以有：

本期就介紹到這裡啦！

SCI繪圖-KEGG-GO富集分析

1使用R包 clusterProfiler 富集分析