純幹貨!最詳細的GSEA分析實戰及結果可視化教程!

2024年2月6日 17点热度 0人点赞

01 使用R包fgsea做GSEA分析

1.安裝

if (!require("BiocManager", quietly = TRUE))
    install.packages("BiocManager")
BiocManager::install("fgsea")

2.準備數據

輸入數據包括按某種指標(如FC)排序好的數字向量和被檢查的通路列表。

library(fgsea)
set.seed(42)
# Ranked List
> data(exampleRanks)
> head(exampleRanks)
   170942    109711     18124     12775     72148     16010 
-63.33703 -49.74779 -43.63878 -41.51889 -33.26039 -32.77626 
> is.vector(exampleRanks)
[1] TRUE
> length(exampleRanks)
[1] 12000
# checked PATHWAYS
> data(examplePathways)
> head(examplePathways, n=1)
$`1221633_Meiotic_Synapsis`
 [1] "12189"     "13006"     "15077"     "15078"     "15270"     "15512"    
 [7] "16905"     "16906"     "19357"     "20842"     "20843"     "20957"    
[13] "20962"     "21749"     "21750"     "22196"     "23856"     "24061"    
[19] "28113"     "50878"     "56739"     "57321"     "64009"     "66654"    
[25] "69386"     "71846"     "74075"     "77053"     "94244"     "97114"    
[31] "97122"     "97908"     "101185"    "140557"    "223697"    "260423"   
[37] "319148"    "319149"    "319150"    "319151"    "319152"    "319153"   
[43] "319154"    "319155"    "319156"    "319157"    "319158"    "319159"   
[49] "319160"    "319161"    "319565"    "320332"    "320558"    "326619"   
[55] "326620"    "360198"    "497652"    "544973"    "625328"    "667250"   
[61] "100041230" "102641229" "102641751" "102642045"
> is.list(examplePathways)
[1] TRUE
> length(examplePathways)
[1] 1457

使用data函數載入包內示例數據,並查看數據類型和值。自己的數據可以按照這個格式來構建變量。可以看到examplePathways有1457個帶檢查的通路,取前10個用於後續分析,加快示例分析速度。

checkedPathways <-  examplePathways[1:10]

3.GSEA分析

fgseaRes <- fgsea(pathways = checkedPathways, 
stats    = exampleRanks,
eps      = 1e-10, 
#  default 1e-10,lower bound for estimating P-values
# eps值越小,P值估計更準,但會降低運算速度
minSize  = 15, # Minimal size of a gene set to test
maxSize  = 500 # Maximal size of a gene set to test.
)
fgseaRes <- fgseaRes[order(pval), ]
> fgseaRes
                                                 pathway      pval      padj    log2err        ES
1:                              1221633_Meiotic_Synapsis 0.5481611 0.7204117 0.06767604 0.2885754
2: 1445146_Translocation_of_Glut4_to_the_Plasma_Membrane 0.7204117 0.7204117 0.05367696 0.2387284
         NES size                             leadingEdge
1: 0.9429806   27                 15270,12189,71846,19357
2: 0.8419156   39 17918,19341,20336,22628,22627,20619,...

可以看到結果中從輸入的10個通路隻剩下2個,隻是因為那8個通路的大小不滿足 in [minSize, maxSize]。結果中各字段的解釋可以查看函數幫助文檔或查看上期分析原理中的內容。

4.繪圖

plotEnrichment(
  checkedPathways[['1221633_Meiotic_Synapsis']],
  exampleRanks)   labs(title="Meiotic_Synapsis")

富集結果很差,NES<1, P值也很大,基因也未在排序列表的頂端或者底端富集

5.繪制表格圖

plotGseaTable(checkedPathways[c(1,4)], 
              exampleRanks, 
              fgseaRes,
              gseaParam = 0.5)

02 使用R包clusterProfiler做GSEA分析

clusterProfiler內置了KEGG、GO等功能通路的信息,在做此類分析是就不需要輸入功能通路的基因集信息了,常用的方法有三個 1)GSEA 通用分析函數 2) gseGO 3) gseKEGG, 函數詳細參數及其解釋如下:

# GSEA 通用基因集富集分析函數
GSEA(
geneList, # order ranked geneList
exponent = 1, # exponent weight of each step
minGSSize = 10,
maxGSSize = 500,
eps = 1e-10,  # low boundary for calculating the p value
pvalueCutoff = 0.05,
pAdjustMethod = "BH",
gson = NULL,
TERM2GENE, # 通路名和基因名的對照表
TERM2NAME = NA,
verbose = TRUE,
seed = FALSE,
by = "fgsea", # 分析包 ’fgsea’ or ’DOSE’
...
)
# gseGO
gseGO(
geneList,
ont = "BP", # one of "BP", "MF", and "CC"OrgDb,
keyType = "ENTREZID", # keytype of gene
exponent = 1,
minGSSize = 10,
maxGSSize = 500,
eps = 1e-10,
pvalueCutoff = 0.05,
pAdjustMethod = "BH",
verbose = TRUE,
seed = FALSE,
by = "fgsea",
...
)
# gseKEGG
gseKEGG(
geneList,
organism = "hsa",
keyType = "kegg",
exponent = 1,
minGSSize = 10,
maxGSSize = 500,
eps = 1e-10,
pvalueCutoff = 0.05,
pAdjustMethod = "BH",
verbose = TRUE,
use_internal_data = FALSE, # logical, use KEGG.db or latest online KEGG data
seed = FALSE,
by = "fgsea",
...
)

1.安裝

if (!require("BiocManager", quietly = TRUE))
  install.packages("BiocManager")
BiocManager::install("clusterProfiler")
# 加載測試數據
BiocManager::install("DOSE")
# 人的id mapping db
BiocManager::install("org.Hs.eg.db")
# 繪圖
BiocManager::install("enrichplot")

2.準備數據

使用DOSE包的自帶數據geneList, geneList的names是ncbi geneID, values是表達值的差異倍數。如果是其他類型的ID,需要轉換成one of "kegg", 'ncbi-geneid', 'ncib-proteinid' and 'uniprot'。怎麼轉換ID可參考這篇文章 SCI繪圖-KEGG/GO富集分析。

> library(clusterProfiler)
> library(DOSE)
> library(enrichplot)
> library(org.Hs.eg.db)
> data(geneList)
> head(geneList)
    4312     8318    10874    55143    55388      991 
4.572613 4.514594 4.418218 4.144075 3.876258 3.677857 

3.KEGG-GSEA分析

gseKEGGRes <- gseKEGG(
        geneList,
        organism = 'hsa',
        keyType = 'ncbi-geneid',
        minGSSize = 10,
        maxGSSize = 500,
        pvalueCutoff = 0.05,
        pAdjustMethod = 'BH'
        )
gseKEGGResDf <- gseKEGGRes@result
head(gseKEGGResDf)

分析結果解讀:

gseKEGGResDf中每一行是一個KEGG通路的GSEA分析結果,其中前幾個參數看了上一期內容應該都知道,講下後面幾個。qvalue就是上期內容中說的FDR,要<0.25; rank就是gene在排序列表中的位置,leading_edge中Tags表示對富集分數有貢獻的基因的百分比,List表示在列表中的何處獲得富集分數, Signal表示富集信號強度;最後一列是富集的核心基因。

4.GO-GSEA分析

gseGORes <- gseGO(
  geneList,
  ont = 'BP',
  OrgDb = org.Hs.eg.db,
  keyType = "ENTREZID",
  minGSSize = 10,
  maxGSSize = 500,
  pvalueCutoff = 0.05,
  pAdjustMethod = "BH"
)
gseGOResDf <- gseGORes@result
head(gseGOResDf)

分析結果和gseKEGG類似。

5.通用GSEA分析

在做通用GSEA分析前,需要準備自己感興趣的功能通路或基因集文件TERM2NAME(一個包含通路名和基因名稱的 2 列數據框),為了方便使用GO的通路來模擬以演示GSEA分析。

library(tidyverse)
gosubset <- gseGORes@geneSets[1:20]
godf <- data.frame(term = names(gosubset), name = I(gosubset), stringsAsFactors = FALSE)
term2name <- godf %>% unnest(name)
head(term2name )

準備成以上格式就可以啦!

腳本分析:

gseRes <- GSEA(
  geneList,
  minGSSize = 10,
  maxGSSize = 500,
  pvalueCutoff = 0.05,
  pAdjustMethod = "BH",
  TERM2GENE = term2name, # 這個很重要,別忘了!
  by = "fgsea"
)
gseResDf <-gseRes@result
head(gseResDf)

結果類型和前面的一樣,TERM2NAME中就取了GO中的20個通路來分析,所以最後保留下來的通路較少。

03 GSEA分析結果可視化

1.得分圖

gseaplot2(
  gseGORes, 
  geneSetID = 1, 
  title = gseGORes$Description[1]
  )

多個通路畫在一起

gseaplot2(gseGORes, geneSetID = 1:3)

修改顏色,顯示P值

gseaplot2(gseGORes,, geneSetID = 1:3,  
          pvalue_table = TRUE,
          color = c("#E495A5", "#86B875", "#7DB0DD")
          )

2.Dotplot

dotplot(gseGORes, showCategory=20) 
  ggtitle("dotplot for GSEA")

3.Ridgeline Plot

ridgeplot(gseKEGGRes, showCategory=20)