R 如何根据相关值正确筛选出基因表达矩阵？_R_Bioinformatics

R 如何根据相关值正确筛选出基因表达矩阵？

R 如何根据相关值正确筛选出基因表达矩阵？,r,bioinformatics,R,Bioinformatics,我已经对Affymetrix微阵列基因表达数据进行了预处理（32830个probesets在行中，735个RNA样本在列中）。下面是我的表达式矩阵的外观： > exprs_mat[1:6, 1:4] Tarca_001_P1A01 Tarca_003_P1A03 Tarca_004_P1A04 Tarca_005_P1A05 1_at 6.062215 6.125023 5.875502 6.

我已经对Affymetrix微阵列基因表达数据进行了预处理（32830个probesets在行中，735个RNA样本在列中）。下面是我的表达式矩阵的外观：

> exprs_mat[1:6, 1:4]
             Tarca_001_P1A01 Tarca_003_P1A03 Tarca_004_P1A04 Tarca_005_P1A05
1_at                6.062215        6.125023        5.875502        6.126131
10_at               3.796484        3.805305        3.450245        3.628411
100_at              5.849338        6.191562        6.550525        6.421877
1000_at             3.567779        3.452524        3.316134        3.432451
10000_at            6.166815        5.678373        6.185059        5.633757
100009613_at        4.443027        4.773199        4.393488        4.623783

我还有此Affymetrix表达式的数据（行中为RNA样本标识符，列中为样本描述）：

因为在phenodata中，样本标识符是在行中的，所以我需要找到方法将phenodata中的sampleID与表达式矩阵中的sampleID进行匹配

exprs\u mat

目标：

我想通过测量每个基因与

phenodata

中的目标图谱数据之间的相关性，筛选出表达矩阵中的基因。以下是我的初步尝试，但不太确定准确性：

更新：我在R中的实现：

我打算看看每个样本中的基因如何与注释数据中相应样本的GA值相关。下面是我在R中查找这种相关性的简单函数：

getPCC <- function(expr_mat, anno_mat, verbose=FALSE){
stopifnot(class(expr_mat)=="matrix")
stopifnot(class(anno_mat)=="matrix")
stopifnot(ncol(expr_mat)==nrow(anno_mat))
final_df <- as.data.frame()
lapply(colnames(expr_mat), function(x){
    lapply(x, rownames(y){
        if(colnames(x) %in% rownames(anno_mat)){
            cor_mat <- stats::cor(y, anno_mat$GA, method = "pearson")
            ncor <- ncol(cor_mat)
            cmatt <- col(cor_mat)
            ord <- order(-cmat, cor_mat, decreasing = TRUE)- (ncor*cmatt - ncor)
            colnames(ord) <- colnames(cor_mat)
            res <- cbind(ID=c(cold(ord), ID2=c(ord)))
            res <- as.data.frame(cbind(out, cor=cor_mat[res]))
            final_df <- cbind(res, cor=cor_mat[out])
        }
    })
})
return(final_df)

getPCC执行类似以下帮助的操作：
library(tidyverse)

x <- data.frame(stringsAsFactors=FALSE,
     Levels = c("1_at", "10_at", "100_at", "1000_at", "10000_at", "100009613_at"),
     Tarca_001_P1A01 = c(6.062215, 3.796484, 5.849338, 3.567779, 6.166815,
                           4.443027),
     Tarca_003_P1A03 = c(6.125023, 3.805305, 6.191562, 3.452524, 5.678373,
                           4.773199),
     Tarca_004_P1A04 = c(5.875502, 3.450245, 6.550525, 3.316134, 6.185059,
                           4.393488),
     Tarca_005_P1A05 = c(6.126131, 3.628411, 6.421877, 3.432451, 5.633757,
                           4.623783)
     )


y <- data.frame(stringsAsFactors=FALSE,
     gene = c("Tarca_001_P1A01", "Tarca_013_P1B01", "Tarca_025_P1C01",
              "Tarca_037_P1D01", "Tarca_049_P1E01", "Tarca_061_P1F01"),
     SampleID = c("Tarca_001_P1A01", "Tarca_013_P1B01", "Tarca_025_P1C01",
                    "Tarca_037_P1D01", "Tarca_049_P1E01", "Tarca_061_P1F01"),
     GA = c(11, 15.3, 21.7, 26.7, 31.3, 32.1),
     Batch = c(1, 1, 1, 1, 1, 1),
     Set = c("PRB_HTA", "PRB_HTA", "PRB_HTA", "PRB_HTA", "PRB_HTA", "PRB_HTA")
     )



x %>% gather(SampleID, value, -Levels) %>% 
  left_join(., y, by = "SampleID") %>% 
  group_by(SampleID) %>% 
  filter(value == max(value)) %>% 
  spread(SampleID, value)

库（tidyverse）
x%
左联接（，y，by=“SampleID”）%>%
分组依据（样本ID）%>%
过滤器（值==最大值））%>%
排列（样本ID、值）
如何过滤？高相关性，低相关性？另外，请注意，expr\u mat
中的colnames与pheno
中的顺序（Sample\u ID
）不匹配（您可能要先匹配它们）。@PoGibas我刚刚更新了我的帖子，在找到表达矩阵中的基因与phenodata中的基因之间的相关性之前，需要在expr\u mat
和phenodata
中找到样本ID的匹配项。你知道吗？如何纠正上述方法？感谢you@PoGibas我用我的代码更新了我的帖子。有什么想法吗？@Jerry你能给我看看样品吗output@cephalopod样本输出应与expr\u mat矩阵具有相同的格式，其中应包括具有高相关值的筛选基因列表。
library(tidyverse)

x <- data.frame(stringsAsFactors=FALSE,
     Levels = c("1_at", "10_at", "100_at", "1000_at", "10000_at", "100009613_at"),
     Tarca_001_P1A01 = c(6.062215, 3.796484, 5.849338, 3.567779, 6.166815,
                           4.443027),
     Tarca_003_P1A03 = c(6.125023, 3.805305, 6.191562, 3.452524, 5.678373,
                           4.773199),
     Tarca_004_P1A04 = c(5.875502, 3.450245, 6.550525, 3.316134, 6.185059,
                           4.393488),
     Tarca_005_P1A05 = c(6.126131, 3.628411, 6.421877, 3.432451, 5.633757,
                           4.623783)
     )


y <- data.frame(stringsAsFactors=FALSE,
     gene = c("Tarca_001_P1A01", "Tarca_013_P1B01", "Tarca_025_P1C01",
              "Tarca_037_P1D01", "Tarca_049_P1E01", "Tarca_061_P1F01"),
     SampleID = c("Tarca_001_P1A01", "Tarca_013_P1B01", "Tarca_025_P1C01",
                    "Tarca_037_P1D01", "Tarca_049_P1E01", "Tarca_061_P1F01"),
     GA = c(11, 15.3, 21.7, 26.7, 31.3, 32.1),
     Batch = c(1, 1, 1, 1, 1, 1),
     Set = c("PRB_HTA", "PRB_HTA", "PRB_HTA", "PRB_HTA", "PRB_HTA", "PRB_HTA")
     )



x %>% gather(SampleID, value, -Levels) %>% 
  left_join(., y, by = "SampleID") %>% 
  group_by(SampleID) %>% 
  filter(value == max(value)) %>% 
  spread(SampleID, value)