R 微分表达式分析——开关截距系数
我试图使用edgeR对一个生物计数数据集进行差异表达分析。我的样本分为病例组和对照组,我想知道病例组样本(即患有该疾病的患者)和对照组中上调或下调的基因。然而,当使用R 微分表达式分析——开关截距系数,r,matrix,bioinformatics,glm,rna-seq,R,Matrix,Bioinformatics,Glm,Rna Seq,我试图使用edgeR对一个生物计数数据集进行差异表达分析。我的样本分为病例组和对照组,我想知道病例组样本(即患有该疾病的患者)和对照组中上调或下调的基因。然而,当使用edgeR时,我遇到了一个问题,即当前基因的结果与对照样本相关,而不是与案例相关。我可以用假数据在R中重现问题 假数据在对照组中的计数值低于病例样本,因此我们预计病例样本中的所有基因都会上调: #First create the expression matrix set.seed(101) #create data so firs
edgeR
时,我遇到了一个问题,即当前基因的结果与对照样本相关,而不是与案例相关。我可以用假数据在R中重现问题
假数据在对照组中的计数值低于病例样本,因此我们预计病例样本中的所有基因都会上调:
#First create the expression matrix
set.seed(101)
#create data so first 50 (the controls) have lower values than second 50 samples (those with the condition)
exprDat <- cbind(matrix(round((runif(500)/10)*100),ncol=50),
matrix(round((1-runif(500)/10)*100),ncol=50))
colnames(exprDat) <- paste0("sample_",1:100)
rownames(exprDat) <- paste0("gene_",1:10)
#Now create the annotation dataset
targets <- data.frame("group_sample"=colnames(exprDat),
"case_control"=as.factor(c(rep("Control",50),
rep("Case",50))))
#create the design matrix comparing case and control
design <- model.matrix(~case_control, data = targets)
y <- edgeR::DGEList(counts = exprDat,
group = targets[["case_control"]])
#normalise
y <- edgeR::calcNormFactors(y,method = 'TMM')
y <- edgeR::estimateDisp(y, design)
#build linear model
fit <- edgeR::glmFit(y, design = design)
#test the comparison, coef=1 is the intercept
test <- edgeR::glmLRT(fit,coef=2)
pvals <- test$table
因此,logFC表明,与病例相比,对照样本中的这些基因表达下调:
> pvals
logFC logCPM LR PValue
gene_1 -0.14418015 16.69933 2.4281485 0.119173587
gene_2 -0.03421562 16.69108 0.1422319 0.706072179
gene_3 -0.12961726 16.69159 1.9632930 0.161161580
gene_4 -0.17710527 16.68963 3.5894597 0.058147147
gene_5 -0.14551401 16.69491 2.4641372 0.116471640
gene_6 0.17585301 16.70497 4.1366713 0.041963611
gene_7 -0.05396444 16.69328 0.3514909 0.553270396
gene_8 -0.15662395 16.69380 2.8394354 0.091976525
gene_9 -0.09823345 16.69603 1.1459499 0.284398595
gene_10 -0.30105913 16.68291 9.8090930 0.001736511
起初,我认为这不是一个问题,因为我可以改变目标中的因子顺序
,因此设计矩阵将创建一个case\u controlCase
,这将是相反的比较,这意味着p值将是相同的,但logFC的方向将翻转:
#reorder levels in target
levels(targets$case_control) <- sort(levels(targets$case_control),
decreasing=TRUE)
design <- model.matrix(~case_control, data = targets)
y <- edgeR::DGEList(counts = exprDat,
group = targets[["case_control"]])
y <- edgeR::calcNormFactors(y,method = 'TMM')
y <- edgeR::estimateDisp(y, design)
fit <- edgeR::glmFit(y, design = design)
test <- edgeR::glmLRT(fit,coef=2)
pvals <- test$table
然而,奇怪的是,这些基因仍然与对照组相关:
> pvals
logFC logCPM LR PValue
gene_1 -0.14418015 16.69933 2.4281485 0.119173587
gene_2 -0.03421562 16.69108 0.1422319 0.706072179
gene_3 -0.12961726 16.69159 1.9632930 0.161161580
gene_4 -0.17710527 16.68963 3.5894597 0.058147147
gene_5 -0.14551401 16.69491 2.4641372 0.116471640
gene_6 0.17585301 16.70497 4.1366713 0.041963611
gene_7 -0.05396444 16.69328 0.3514909 0.553270396
gene_8 -0.15662395 16.69380 2.8394354 0.091976525
gene_9 -0.09823345 16.69603 1.1459499 0.284398595
gene_10 -0.30105913 16.68291 9.8090930 0.001736511
我不知道为什么这种情况还在发生,因为设计已经改变了!如果有人有任何线索,这将是惊人的,因为这已经破坏了我的头一段时间!或者,如果任何人有不同的方式翻转logFC,使其与案例样本而不是控制相关(即确保控制样本作为GLM中的截距),那就太好了。需要注意的是,我知道我可以在结果表中交换符号,但这是我真正想要避免的事情,我更愿意理解上面代码中的错误
最后,我要声明的是,我认为我的问题并不特定于edgeR,而只是使用GLM进行差异分析的一般问题。基本上,我只想知道如何使用GLM和设计矩阵交换截距系数。为清楚起见,我还将此信息发布到Biostars,一个特定的生物分析社区网站:
会话信息:
> sessionInfo()
R version 3.6.0 (2019-04-26)
Platform: x86_64-redhat-linux-gnu (64-bit)
Running under: CentOS Linux 7 (Core)
Matrix products: default
BLAS/LAPACK: /usr/lib64/R/lib/libRblas.so
locale:
[1] LC_CTYPE=en_GB.UTF-8 LC_NUMERIC=C LC_TIME=en_GB.UTF-8 LC_COLLATE=en_GB.UTF-8
[5] LC_MONETARY=en_GB.UTF-8 LC_MESSAGES=en_GB.UTF-8 LC_PAPER=en_GB.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats4 parallel stats graphics grDevices utils datasets methods base
other attached packages:
[1] gridExtra_2.3 reshape2_1.4.4 data.table_1.14.0 Hmisc_4.5-0
[5] Formula_1.2-4 survival_3.2-9 lattice_0.20-38 ggrepel_0.9.1
[9] viridis_0.6.0 viridisLite_0.4.0 cowplot_1.1.1 ggplot2_3.3.3
[13] qs_0.24.1 edgeR_3.28.1 limma_3.42.2 purrr_0.3.4
[17] magrittr_2.0.1 dplyr_1.0.6 SingleCellExperiment_1.8.0 SummarizedExperiment_1.16.1
[21] DelayedArray_0.12.3 BiocParallel_1.20.1 matrixStats_0.58.0 Biobase_2.46.0
[25] biomaRt_2.42.1 BSgenome_1.54.0 rtracklayer_1.46.0 Biostrings_2.54.0
[29] XVector_0.26.0 GenomicRanges_1.38.0 GenomeInfoDb_1.22.1 IRanges_2.20.2
[33] S4Vectors_0.24.4 BiocGenerics_0.32.0
loaded via a namespace (and not attached):
[1] colorspace_2.0-1 ellipsis_0.3.2 htmlTable_2.1.0 base64enc_0.1-3
[5] rstudioapi_0.13 listenv_0.8.0 bit64_4.0.5 AnnotationDbi_1.48.0
[9] fansi_0.4.2 codetools_0.2-16 splines_3.6.0 cachem_1.0.4
[13] knitr_1.33 Rsamtools_2.2.3 cluster_2.0.8 dbplyr_2.1.1
[17] png_0.1-7 sctransform_0.3.2 BiocManager_1.30.12 compiler_3.6.0
[21] httr_1.4.2 backports_1.2.1 assertthat_0.2.1 Matrix_1.2-17
[25] fastmap_1.1.0 cli_2.5.0 htmltools_0.5.1.1 prettyunits_1.1.1
[29] tools_3.6.0 gtable_0.3.0 glue_1.4.2 GenomeInfoDbData_1.2.2
[33] rappdirs_0.3.3 Rcpp_1.0.6 vctrs_0.3.7 xfun_0.22
[37] stringr_1.4.0 globals_0.14.0 lifecycle_1.0.0 pacman_0.5.1
[41] XML_3.99-0.3 future_1.21.0 zlibbioc_1.32.0 MASS_7.3-51.4
[45] scales_1.1.1 hms_1.0.0 RColorBrewer_1.1-2 yaml_2.2.1
[49] curl_4.3.1 memoise_2.0.0 rpart_4.1-15 latticeExtra_0.6-29
[53] stringi_1.5.3 RSQLite_2.2.4 checkmate_2.0.0 rlang_0.4.11
[57] pkgconfig_2.0.3 bitops_1.0-7 evaluate_0.14 GenomicAlignments_1.22.1
[61] htmlwidgets_1.5.3 bit_4.0.4 tidyselect_1.1.1 parallelly_1.25.0
[65] plyr_1.8.6 R6_2.5.0 generics_0.1.0 DBI_1.1.1
[69] pillar_1.6.0 foreign_0.8-71 withr_2.4.2 RCurl_1.98-1.3
[73] nnet_7.3-12 tibble_3.1.1 future.apply_1.7.0 crayon_1.4.1
[77] utf8_1.2.1 BiocFileCache_1.10.2 RApiSerialize_0.1.0 rmarkdown_2.7
[81] jpeg_0.1-8.1 progress_1.2.2 locfit_1.5-9.4 grid_3.6.0
[85] blob_1.2.1 infotheo_1.2.0 digest_0.6.27 openssl_1.4.4
[89] RcppParallel_5.0.3 munsell_0.5.0 stringfish_0.15.0 askpass_1.1
您正在重命名因子级别,而不是重新设置因子级别。要解决此问题,请尝试:
目标当然是$case\u控制!!真不敢相信我错过了这个。非常感谢,这让我烦了太久了!
> pvals
logFC logCPM LR PValue
gene_1 -0.14418015 16.69933 2.4281485 0.119173587
gene_2 -0.03421562 16.69108 0.1422319 0.706072179
gene_3 -0.12961726 16.69159 1.9632930 0.161161580
gene_4 -0.17710527 16.68963 3.5894597 0.058147147
gene_5 -0.14551401 16.69491 2.4641372 0.116471640
gene_6 0.17585301 16.70497 4.1366713 0.041963611
gene_7 -0.05396444 16.69328 0.3514909 0.553270396
gene_8 -0.15662395 16.69380 2.8394354 0.091976525
gene_9 -0.09823345 16.69603 1.1459499 0.284398595
gene_10 -0.30105913 16.68291 9.8090930 0.001736511
> sessionInfo()
R version 3.6.0 (2019-04-26)
Platform: x86_64-redhat-linux-gnu (64-bit)
Running under: CentOS Linux 7 (Core)
Matrix products: default
BLAS/LAPACK: /usr/lib64/R/lib/libRblas.so
locale:
[1] LC_CTYPE=en_GB.UTF-8 LC_NUMERIC=C LC_TIME=en_GB.UTF-8 LC_COLLATE=en_GB.UTF-8
[5] LC_MONETARY=en_GB.UTF-8 LC_MESSAGES=en_GB.UTF-8 LC_PAPER=en_GB.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats4 parallel stats graphics grDevices utils datasets methods base
other attached packages:
[1] gridExtra_2.3 reshape2_1.4.4 data.table_1.14.0 Hmisc_4.5-0
[5] Formula_1.2-4 survival_3.2-9 lattice_0.20-38 ggrepel_0.9.1
[9] viridis_0.6.0 viridisLite_0.4.0 cowplot_1.1.1 ggplot2_3.3.3
[13] qs_0.24.1 edgeR_3.28.1 limma_3.42.2 purrr_0.3.4
[17] magrittr_2.0.1 dplyr_1.0.6 SingleCellExperiment_1.8.0 SummarizedExperiment_1.16.1
[21] DelayedArray_0.12.3 BiocParallel_1.20.1 matrixStats_0.58.0 Biobase_2.46.0
[25] biomaRt_2.42.1 BSgenome_1.54.0 rtracklayer_1.46.0 Biostrings_2.54.0
[29] XVector_0.26.0 GenomicRanges_1.38.0 GenomeInfoDb_1.22.1 IRanges_2.20.2
[33] S4Vectors_0.24.4 BiocGenerics_0.32.0
loaded via a namespace (and not attached):
[1] colorspace_2.0-1 ellipsis_0.3.2 htmlTable_2.1.0 base64enc_0.1-3
[5] rstudioapi_0.13 listenv_0.8.0 bit64_4.0.5 AnnotationDbi_1.48.0
[9] fansi_0.4.2 codetools_0.2-16 splines_3.6.0 cachem_1.0.4
[13] knitr_1.33 Rsamtools_2.2.3 cluster_2.0.8 dbplyr_2.1.1
[17] png_0.1-7 sctransform_0.3.2 BiocManager_1.30.12 compiler_3.6.0
[21] httr_1.4.2 backports_1.2.1 assertthat_0.2.1 Matrix_1.2-17
[25] fastmap_1.1.0 cli_2.5.0 htmltools_0.5.1.1 prettyunits_1.1.1
[29] tools_3.6.0 gtable_0.3.0 glue_1.4.2 GenomeInfoDbData_1.2.2
[33] rappdirs_0.3.3 Rcpp_1.0.6 vctrs_0.3.7 xfun_0.22
[37] stringr_1.4.0 globals_0.14.0 lifecycle_1.0.0 pacman_0.5.1
[41] XML_3.99-0.3 future_1.21.0 zlibbioc_1.32.0 MASS_7.3-51.4
[45] scales_1.1.1 hms_1.0.0 RColorBrewer_1.1-2 yaml_2.2.1
[49] curl_4.3.1 memoise_2.0.0 rpart_4.1-15 latticeExtra_0.6-29
[53] stringi_1.5.3 RSQLite_2.2.4 checkmate_2.0.0 rlang_0.4.11
[57] pkgconfig_2.0.3 bitops_1.0-7 evaluate_0.14 GenomicAlignments_1.22.1
[61] htmlwidgets_1.5.3 bit_4.0.4 tidyselect_1.1.1 parallelly_1.25.0
[65] plyr_1.8.6 R6_2.5.0 generics_0.1.0 DBI_1.1.1
[69] pillar_1.6.0 foreign_0.8-71 withr_2.4.2 RCurl_1.98-1.3
[73] nnet_7.3-12 tibble_3.1.1 future.apply_1.7.0 crayon_1.4.1
[77] utf8_1.2.1 BiocFileCache_1.10.2 RApiSerialize_0.1.0 rmarkdown_2.7
[81] jpeg_0.1-8.1 progress_1.2.2 locfit_1.5-9.4 grid_3.6.0
[85] blob_1.2.1 infotheo_1.2.0 digest_0.6.27 openssl_1.4.4
[89] RcppParallel_5.0.3 munsell_0.5.0 stringfish_0.15.0 askpass_1.1