R:在循环中处理数据

R:在循环中处理数据,r,loops,R,Loops,下面,我使用提供的示例数据运行一个循环,将必要的项添加到geneexptest,并在循环中进一步处理它。但是,在构建dfs时,我希望每次运行的端点都是data.framegeneexptotal,…,如图所示。问题是,它似乎在geneexptestapp停止了,并将其输出到每一轮dfs中。请让我知道如何将循环的其余部分包含到输出中 gex <- data.frame("sample" = c("BIX","HEF","TUR","ZOP","VAG","JUF","FED","MEQ",

下面,我使用提供的示例数据运行一个循环,将必要的项添加到geneexptest,并在循环中进一步处理它。但是,在构建dfs时,我希望每次运行的端点都是data.framegeneexptotal,…,如图所示。问题是,它似乎在geneexptestapp停止了,并将其输出到每一轮dfs中。请让我知道如何将循环的其余部分包含到输出中

gex <- data.frame("sample" =  c("BIX","HEF","TUR","ZOP","VAG","JUF","FED","MEQ","YIF","HRB","LOP","LIX","COT","DRP","KFC","TUY","DOG","KEX","RAV","UEH"), 
                  "TCGA-F4-6703-01" = runif(20, -1, 1),
                  "TCGA-DM-A28E-01" = runif(20, -1, 1),
                  "TCGA-AY-6197-01" = runif(20, -1, 1),
                  "TCGA-A6-5657-01" = runif(20, -1, 1))
colnames(gex) <- gsub("[.]", "_",colnames(gex))

listx <- c("TCGA_DM_A28E_01","TCGA_A6_5657_01")

mxy <- data.frame("TCGA-AD-6963-01" = runif(20, -1, 1),
                  "TCGA-AA-3663-11" = runif(20, -1, 1),
                  "TCGA-AD-6901-01" = runif(20, -1, 1),
                  "TCGA-AZ-2511-01" = runif(20, -1, 1),
                  "TCGA-A6-A567-01" = runif(20, -1, 1)) 

colnames(mxy) <- gsub("[.]", "_",colnames(mxy))

zScore <- function(x,y)((as.numeric(x) - as.numeric(rowMeans(y,na.rm=T)))/as.numeric(sd(y,na.rm=T)))

    dfs <- lapply(listx, function(colName) {
      do.call(rbind, lapply(seq(nrow(mxy)), function(i) {
        zvalues <- zScore(gex[i,colName], mxy[i,])
        geneexptest <- data.frame(gex$sample[i], zvalues, row.names = NULL, stringsAsFactors = TRUE)
        geneexptest$zvalues <- as.numeric(as.character(geneexptest$zvalues))
        is.na(geneexptest) <- sapply(geneexptest, is.infinite)
        geneexptestapp <- na.omit(geneexptest)
        geneexptestorder <- geneexptestapp[order(geneexptestapp$zvalues, decreasing = FALSE, na.last = NA), ]
        geneexpa <- geneexptestorder[1:((0.05)*nrow(geneexptest)),]
        geneexpz <- geneexptestorder[(nrow(geneexptestorder)-((0.05)*nrow(geneexptest))):nrow(geneexptestorder),]
        geneexptotal <- rbind(geneexpa, geneexpz)
        data.frame(geneexptotal$gex.sample, row.names = NULL, stringsAsFactors = TRUE)
      }))
    })

您的代码目前运行良好。由于您正在进行一些数据管理,所以您会有意外的输出。我对你的代码进行了一些修改,以提高可读性。我做了两个新函数,fun1和fun2-fun2是你的内部函数,fun1是外部函数。fun2将colName作为参数来传递它

fun2 = function(i,colName) {
  zvalues <- zScore(gex[i,colName], mxy[i,])
  geneexptest <- data.frame(gex$sample[i], zvalues, row.names = NULL, stringsAsFactors = TRUE)
  geneexptest$zvalues <- as.numeric(as.character(geneexptest$zvalues))
  is.na(geneexptest) <- sapply(geneexptest, is.infinite)
  geneexptestapp <- na.omit(geneexptest)
  geneexptestorder <- geneexptestapp[order(geneexptestapp$zvalues, decreasing = FALSE, na.last = NA), ]
  geneexpa <- geneexptestorder[1:((0.05)*nrow(geneexptest)),]
  geneexpz <- geneexptestorder[(nrow(geneexptestorder)-((0.05)*nrow(geneexptest))):nrow(geneexptestorder),]
  geneexptotal <- rbind(geneexpa, geneexpz)
  data.frame(geneexptotal, row.names = NULL, stringsAsFactors = TRUE)
}

fun1 = function(colName) {
  do.call(rbind, lapply(seq(nrow(mxy)), fun2, colName=colName))
}

dfs <- lapply(listx, fun1)
做一些检查,确认没有无限的东西,如果是无限的,就把它移除。现在我们有:

> geneexptestapp
  gex.sample.i.    zvalues
1           BIX -0.6955057
现在对一行数据帧进行排序。没有什么变化。这里的问题是nrowgeneexptest=1,因此对于geneexpa,您要求的是第1.05行,这与1相同,对于geneexpz,您要求的是第95:1行,这是0.95。没有分数行。这导致:

> geneexpa;geneexpz
  gex.sample.i.    zvalues
1           BIX -0.6955057
[1] gex.sample.i. zvalues      
<0 rows> (or 0-length row.names)
向该函数传递一个数据帧x,告诉它默认情况下哪个列的zvalue为c=2,默认情况下,从顶部和底部的p=0.05中得到所需的比例。然后返回ZValue位于顶部和底部百分比的第一列

如何使所有这些都起作用:

fun2 = function(i,colName) {
  zvalues <- zScore(gex[i,colName], mxy[i,])
  geneexptest <- data.frame(gex$sample[i], zvalues, row.names = NULL, stringsAsFactors = TRUE)
  geneexptest$zvalues <- as.numeric(as.character(geneexptest$zvalues))
  is.na(geneexptest) <- sapply(geneexptest, is.infinite)
  return(na.omit(geneexptest))
}

fun1 = function(colName) {
  getExtremeValues(do.call(rbind, lapply(seq(nrow(mxy)), fun2, colName=colName)))
}

dfs <- lapply(listx, fun1)

假设有20个样本,1个在前5%,1个在后5%,listx中列出了两个列名,因此返回了4个样本。

您能发布您期望的输出结果吗?谢谢您的回复。我更新了OP以延长样本数据。希望有帮助。这里的目的是提取对应于顶部和底部5%Z值的示例项。我希望这是有意义的。在fun2中,我尝试在所有行都绑定到geneexptest之后生成geneexpa和geneexpz。我想这就是问题所在。亨利-你是不是想从样本中得到5%的极限值,例如BIX,HEF…?我更新了操作。。在最后一行中,我试图提取geneexptotal$gex.sample,而不仅仅是geneexptotal。因此,对于第一个列名,我想从总数中提取顶部和底部5%的最负和最正的Z值,这意味着8个值,最后只提取这8行的相应样本名。然后,转到下一个列名。因此,在dfs的最后,我想要两个样本列表,第一个对应于TCGA_DM_A28E_01,第二个对应于TCGA_A6_5657_01。我希望这能澄清问题。
getExtremeValues = function(x,p=0.05){
  z = x[,2]
  n = ceiling(nrow(x)*p)
  r = x[order(z),1]
  return(as.character(r[c(1:n,length(r):(length(r)-n+1))]))
}
fun2 = function(i,colName) {
  zvalues <- zScore(gex[i,colName], mxy[i,])
  geneexptest <- data.frame(gex$sample[i], zvalues, row.names = NULL, stringsAsFactors = TRUE)
  geneexptest$zvalues <- as.numeric(as.character(geneexptest$zvalues))
  is.na(geneexptest) <- sapply(geneexptest, is.infinite)
  return(na.omit(geneexptest))
}

fun1 = function(colName) {
  getExtremeValues(do.call(rbind, lapply(seq(nrow(mxy)), fun2, colName=colName)))
}

dfs <- lapply(listx, fun1)
> dfs
[[1]]
[1] "BIX" "TUY"

[[2]]
[1] "BIX" "TUR"