R 微分表达式分析——开关截距系数

R 微分表达式分析——开关截距系数,r,matrix,bioinformatics,glm,rna-seq,R,Matrix,Bioinformatics,Glm,Rna Seq,我试图使用edgeR对一个生物计数数据集进行差异表达分析。我的样本分为病例组和对照组,我想知道病例组样本(即患有该疾病的患者)和对照组中上调或下调的基因。然而,当使用edgeR时,我遇到了一个问题,即当前基因的结果与对照样本相关,而不是与案例相关。我可以用假数据在R中重现问题 假数据在对照组中的计数值低于病例样本,因此我们预计病例样本中的所有基因都会上调: #First create the expression matrix set.seed(101) #create data so firs

我试图使用edgeR对一个生物计数数据集进行差异表达分析。我的样本分为病例组和对照组,我想知道病例组样本(即患有该疾病的患者)和对照组中上调或下调的基因。然而,当使用
edgeR
时,我遇到了一个问题,即当前基因的结果与对照样本相关,而不是与案例相关。我可以用假数据在R中重现问题

假数据在对照组中的计数值低于病例样本,因此我们预计病例样本中的所有基因都会上调:

#First create the expression matrix
set.seed(101)
#create data so first 50 (the controls) have lower values than second 50 samples (those with the condition)
exprDat <- cbind(matrix(round((runif(500)/10)*100),ncol=50),
                    matrix(round((1-runif(500)/10)*100),ncol=50))
colnames(exprDat) <- paste0("sample_",1:100)
rownames(exprDat) <- paste0("gene_",1:10)

#Now create the annotation dataset
targets <- data.frame("group_sample"=colnames(exprDat),
                      "case_control"=as.factor(c(rep("Control",50),
                                                        rep("Case",50))))
#create the design matrix comparing case and control
design <- model.matrix(~case_control, data = targets)
y <- edgeR::DGEList(counts = exprDat,
                    group = targets[["case_control"]])
#normalise
y <- edgeR::calcNormFactors(y,method = 'TMM')
y <- edgeR::estimateDisp(y, design)
#build linear model
fit <- edgeR::glmFit(y, design = design)
#test the comparison, coef=1 is the intercept
test <- edgeR::glmLRT(fit,coef=2)
pvals <- test$table
因此,logFC表明,与病例相比,对照样本中的这些基因表达下调:

> pvals
              logFC   logCPM        LR      PValue
gene_1  -0.14418015 16.69933 2.4281485 0.119173587
gene_2  -0.03421562 16.69108 0.1422319 0.706072179
gene_3  -0.12961726 16.69159 1.9632930 0.161161580
gene_4  -0.17710527 16.68963 3.5894597 0.058147147
gene_5  -0.14551401 16.69491 2.4641372 0.116471640
gene_6   0.17585301 16.70497 4.1366713 0.041963611
gene_7  -0.05396444 16.69328 0.3514909 0.553270396
gene_8  -0.15662395 16.69380 2.8394354 0.091976525
gene_9  -0.09823345 16.69603 1.1459499 0.284398595
gene_10 -0.30105913 16.68291 9.8090930 0.001736511
起初,我认为这不是一个问题,因为我可以改变
目标中的因子顺序
,因此设计矩阵将创建一个
case\u controlCase
,这将是相反的比较,这意味着p值将是相同的,但logFC的方向将翻转:

#reorder levels in target
levels(targets$case_control) <- sort(levels(targets$case_control),
                                        decreasing=TRUE)
design <- model.matrix(~case_control, data = targets)
y <- edgeR::DGEList(counts = exprDat,
                    group = targets[["case_control"]])
y <- edgeR::calcNormFactors(y,method = 'TMM')
y <- edgeR::estimateDisp(y, design)
fit <- edgeR::glmFit(y, design = design)
test <- edgeR::glmLRT(fit,coef=2)
pvals <- test$table
然而,奇怪的是,这些基因仍然与对照组相关:

> pvals
              logFC   logCPM        LR      PValue
gene_1  -0.14418015 16.69933 2.4281485 0.119173587
gene_2  -0.03421562 16.69108 0.1422319 0.706072179
gene_3  -0.12961726 16.69159 1.9632930 0.161161580
gene_4  -0.17710527 16.68963 3.5894597 0.058147147
gene_5  -0.14551401 16.69491 2.4641372 0.116471640
gene_6   0.17585301 16.70497 4.1366713 0.041963611
gene_7  -0.05396444 16.69328 0.3514909 0.553270396
gene_8  -0.15662395 16.69380 2.8394354 0.091976525
gene_9  -0.09823345 16.69603 1.1459499 0.284398595
gene_10 -0.30105913 16.68291 9.8090930 0.001736511
我不知道为什么这种情况还在发生,因为
设计
已经改变了!如果有人有任何线索,这将是惊人的,因为这已经破坏了我的头一段时间!或者,如果任何人有不同的方式翻转logFC,使其与案例样本而不是控制相关(即确保控制样本作为GLM中的截距),那就太好了。需要注意的是,我知道我可以在结果表中交换符号,但这是我真正想要避免的事情,我更愿意理解上面代码中的错误

最后,我要声明的是,我认为我的问题并不特定于edgeR,而只是使用GLM进行差异分析的一般问题。基本上,我只想知道如何使用GLM和设计矩阵交换截距系数。为清楚起见,我还将此信息发布到Biostars,一个特定的生物分析社区网站:

会话信息:

> sessionInfo()
R version 3.6.0 (2019-04-26)
Platform: x86_64-redhat-linux-gnu (64-bit)
Running under: CentOS Linux 7 (Core)

Matrix products: default
BLAS/LAPACK: /usr/lib64/R/lib/libRblas.so

locale:
 [1] LC_CTYPE=en_GB.UTF-8       LC_NUMERIC=C               LC_TIME=en_GB.UTF-8        LC_COLLATE=en_GB.UTF-8    
 [5] LC_MONETARY=en_GB.UTF-8    LC_MESSAGES=en_GB.UTF-8    LC_PAPER=en_GB.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C             LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] gridExtra_2.3               reshape2_1.4.4              data.table_1.14.0           Hmisc_4.5-0                
 [5] Formula_1.2-4               survival_3.2-9              lattice_0.20-38             ggrepel_0.9.1              
 [9] viridis_0.6.0               viridisLite_0.4.0           cowplot_1.1.1               ggplot2_3.3.3              
[13] qs_0.24.1                   edgeR_3.28.1                limma_3.42.2                purrr_0.3.4                
[17] magrittr_2.0.1              dplyr_1.0.6                 SingleCellExperiment_1.8.0  SummarizedExperiment_1.16.1
[21] DelayedArray_0.12.3         BiocParallel_1.20.1         matrixStats_0.58.0          Biobase_2.46.0             
[25] biomaRt_2.42.1              BSgenome_1.54.0             rtracklayer_1.46.0          Biostrings_2.54.0          
[29] XVector_0.26.0              GenomicRanges_1.38.0        GenomeInfoDb_1.22.1         IRanges_2.20.2             
[33] S4Vectors_0.24.4            BiocGenerics_0.32.0        

loaded via a namespace (and not attached):
 [1] colorspace_2.0-1         ellipsis_0.3.2           htmlTable_2.1.0          base64enc_0.1-3         
 [5] rstudioapi_0.13          listenv_0.8.0            bit64_4.0.5              AnnotationDbi_1.48.0    
 [9] fansi_0.4.2              codetools_0.2-16         splines_3.6.0            cachem_1.0.4            
[13] knitr_1.33               Rsamtools_2.2.3          cluster_2.0.8            dbplyr_2.1.1            
[17] png_0.1-7                sctransform_0.3.2        BiocManager_1.30.12      compiler_3.6.0          
[21] httr_1.4.2               backports_1.2.1          assertthat_0.2.1         Matrix_1.2-17           
[25] fastmap_1.1.0            cli_2.5.0                htmltools_0.5.1.1        prettyunits_1.1.1       
[29] tools_3.6.0              gtable_0.3.0             glue_1.4.2               GenomeInfoDbData_1.2.2  
[33] rappdirs_0.3.3           Rcpp_1.0.6               vctrs_0.3.7              xfun_0.22               
[37] stringr_1.4.0            globals_0.14.0           lifecycle_1.0.0          pacman_0.5.1            
[41] XML_3.99-0.3             future_1.21.0            zlibbioc_1.32.0          MASS_7.3-51.4           
[45] scales_1.1.1             hms_1.0.0                RColorBrewer_1.1-2       yaml_2.2.1              
[49] curl_4.3.1               memoise_2.0.0            rpart_4.1-15             latticeExtra_0.6-29     
[53] stringi_1.5.3            RSQLite_2.2.4            checkmate_2.0.0          rlang_0.4.11            
[57] pkgconfig_2.0.3          bitops_1.0-7             evaluate_0.14            GenomicAlignments_1.22.1
[61] htmlwidgets_1.5.3        bit_4.0.4                tidyselect_1.1.1         parallelly_1.25.0       
[65] plyr_1.8.6               R6_2.5.0                 generics_0.1.0           DBI_1.1.1               
[69] pillar_1.6.0             foreign_0.8-71           withr_2.4.2              RCurl_1.98-1.3          
[73] nnet_7.3-12              tibble_3.1.1             future.apply_1.7.0       crayon_1.4.1            
[77] utf8_1.2.1               BiocFileCache_1.10.2     RApiSerialize_0.1.0      rmarkdown_2.7           
[81] jpeg_0.1-8.1             progress_1.2.2           locfit_1.5-9.4           grid_3.6.0              
[85] blob_1.2.1               infotheo_1.2.0           digest_0.6.27            openssl_1.4.4           
[89] RcppParallel_5.0.3       munsell_0.5.0            stringfish_0.15.0        askpass_1.1    

您正在重命名因子级别,而不是重新设置因子级别。要解决此问题,请尝试:


目标当然是$case\u控制!!真不敢相信我错过了这个。非常感谢,这让我烦了太久了!
> pvals
              logFC   logCPM        LR      PValue
gene_1  -0.14418015 16.69933 2.4281485 0.119173587
gene_2  -0.03421562 16.69108 0.1422319 0.706072179
gene_3  -0.12961726 16.69159 1.9632930 0.161161580
gene_4  -0.17710527 16.68963 3.5894597 0.058147147
gene_5  -0.14551401 16.69491 2.4641372 0.116471640
gene_6   0.17585301 16.70497 4.1366713 0.041963611
gene_7  -0.05396444 16.69328 0.3514909 0.553270396
gene_8  -0.15662395 16.69380 2.8394354 0.091976525
gene_9  -0.09823345 16.69603 1.1459499 0.284398595
gene_10 -0.30105913 16.68291 9.8090930 0.001736511
> sessionInfo()
R version 3.6.0 (2019-04-26)
Platform: x86_64-redhat-linux-gnu (64-bit)
Running under: CentOS Linux 7 (Core)

Matrix products: default
BLAS/LAPACK: /usr/lib64/R/lib/libRblas.so

locale:
 [1] LC_CTYPE=en_GB.UTF-8       LC_NUMERIC=C               LC_TIME=en_GB.UTF-8        LC_COLLATE=en_GB.UTF-8    
 [5] LC_MONETARY=en_GB.UTF-8    LC_MESSAGES=en_GB.UTF-8    LC_PAPER=en_GB.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C             LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] gridExtra_2.3               reshape2_1.4.4              data.table_1.14.0           Hmisc_4.5-0                
 [5] Formula_1.2-4               survival_3.2-9              lattice_0.20-38             ggrepel_0.9.1              
 [9] viridis_0.6.0               viridisLite_0.4.0           cowplot_1.1.1               ggplot2_3.3.3              
[13] qs_0.24.1                   edgeR_3.28.1                limma_3.42.2                purrr_0.3.4                
[17] magrittr_2.0.1              dplyr_1.0.6                 SingleCellExperiment_1.8.0  SummarizedExperiment_1.16.1
[21] DelayedArray_0.12.3         BiocParallel_1.20.1         matrixStats_0.58.0          Biobase_2.46.0             
[25] biomaRt_2.42.1              BSgenome_1.54.0             rtracklayer_1.46.0          Biostrings_2.54.0          
[29] XVector_0.26.0              GenomicRanges_1.38.0        GenomeInfoDb_1.22.1         IRanges_2.20.2             
[33] S4Vectors_0.24.4            BiocGenerics_0.32.0        

loaded via a namespace (and not attached):
 [1] colorspace_2.0-1         ellipsis_0.3.2           htmlTable_2.1.0          base64enc_0.1-3         
 [5] rstudioapi_0.13          listenv_0.8.0            bit64_4.0.5              AnnotationDbi_1.48.0    
 [9] fansi_0.4.2              codetools_0.2-16         splines_3.6.0            cachem_1.0.4            
[13] knitr_1.33               Rsamtools_2.2.3          cluster_2.0.8            dbplyr_2.1.1            
[17] png_0.1-7                sctransform_0.3.2        BiocManager_1.30.12      compiler_3.6.0          
[21] httr_1.4.2               backports_1.2.1          assertthat_0.2.1         Matrix_1.2-17           
[25] fastmap_1.1.0            cli_2.5.0                htmltools_0.5.1.1        prettyunits_1.1.1       
[29] tools_3.6.0              gtable_0.3.0             glue_1.4.2               GenomeInfoDbData_1.2.2  
[33] rappdirs_0.3.3           Rcpp_1.0.6               vctrs_0.3.7              xfun_0.22               
[37] stringr_1.4.0            globals_0.14.0           lifecycle_1.0.0          pacman_0.5.1            
[41] XML_3.99-0.3             future_1.21.0            zlibbioc_1.32.0          MASS_7.3-51.4           
[45] scales_1.1.1             hms_1.0.0                RColorBrewer_1.1-2       yaml_2.2.1              
[49] curl_4.3.1               memoise_2.0.0            rpart_4.1-15             latticeExtra_0.6-29     
[53] stringi_1.5.3            RSQLite_2.2.4            checkmate_2.0.0          rlang_0.4.11            
[57] pkgconfig_2.0.3          bitops_1.0-7             evaluate_0.14            GenomicAlignments_1.22.1
[61] htmlwidgets_1.5.3        bit_4.0.4                tidyselect_1.1.1         parallelly_1.25.0       
[65] plyr_1.8.6               R6_2.5.0                 generics_0.1.0           DBI_1.1.1               
[69] pillar_1.6.0             foreign_0.8-71           withr_2.4.2              RCurl_1.98-1.3          
[73] nnet_7.3-12              tibble_3.1.1             future.apply_1.7.0       crayon_1.4.1            
[77] utf8_1.2.1               BiocFileCache_1.10.2     RApiSerialize_0.1.0      rmarkdown_2.7           
[81] jpeg_0.1-8.1             progress_1.2.2           locfit_1.5-9.4           grid_3.6.0              
[85] blob_1.2.1               infotheo_1.2.0           digest_0.6.27            openssl_1.4.4           
[89] RcppParallel_5.0.3       munsell_0.5.0            stringfish_0.15.0        askpass_1.1