R 数据帧的若干子集的重叠_R_Lapply

R 数据帧的若干子集的重叠

R 数据帧的若干子集的重叠,r,lapply,R,Lapply,我有一个数据框数据，包含基因组内突变核苷酸的染色体和位置： structure(list(chrom = c(1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 3L, 3L, 4L, 4L, 4L, 4L), pos = c(10L, 200L, 134L, 400L, 600L, 1000L, 20L, 33L, 40L, 45L, 50L, 55L, 100L, 123L)), .Names = c("chrom", "pos"), class = "data.frame

我有一个数据框

数据

，包含基因组内突变核苷酸的

染色体

和

位置

：

structure(list(chrom = c(1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 3L, 
3L, 4L, 4L, 4L, 4L), pos = c(10L, 200L, 134L, 400L, 600L, 1000L, 
20L, 33L, 40L, 45L, 50L, 55L, 100L, 123L)), .Names = c("chrom", 
"pos"), class = "data.frame", row.names = c(NA, -14L))

和另一个

tss\U位置

，包含

基因

内特征（

tss

）的位置及其所在的

染色体

：

structure(list(gene = structure(c(1L, 4L, 5L, 6L, 7L, 8L, 9L, 
10L, 11L, 2L, 3L), .Label = c("gene1", "gene10", "gene11", "gene2", 
"gene3", "gene4", "gene5", "gene6", "gene7", "gene8", "gene9"
), class = "factor"), chrom = c(1L, 1L, 1L, 2L, 2L, 2L, 3L, 3L, 
3L, 4L, 4L), tss = c(5L, 10L, 23L, 1340L, 313L, 88L, 44L, 57L, 
88L, 74L, 127L)), .Names = c("gene", "chrom", "tss"), class = "data.frame", row.names = c(NA, 
-11L))

我正试图为

数据中的每个pos
计算到同一染色体上最近的tss
的距离
到目前为止，我可以计算从每个数据$pos
到任何tss\U位置$tss
（即，与每个pos
最近的tss
，与染色体无关）：
但是，我希望能够为每个pos
找到同一染色体上最接近的tss

我知道我可以将其单独应用于每个染色体，但我希望看到全局距离（跨越所有染色体），但只在同一染色体上的pos
itions和tss
s之间进行比较
我如何调整这一点以实现这一目标？按染色体对两个数据帧进行子集划分并合并结果
到目前为止，这是正确的方法吗
 类似的方法可能会在数据
数据框中为每个染色体获得最接近的tss
data <- structure(list(chrom = c(1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 3L, 
                                 3L, 4L, 4L, 4L, 4L), pos = c(10L, 200L, 134L, 400L, 600L, 1000L, 
                                                              20L, 33L, 40L, 45L, 50L, 55L, 100L, 123L)), .Names = c("chrom", 
                                                                                                                     "pos"), class = "data.frame", row.names = c(NA, -14L))

tss_locations <- structure(list(gene = structure(c(1L, 4L, 5L, 6L, 7L, 8L, 9L, 
                                                  10L, 11L, 2L, 3L), .Label = c("gene1", "gene10", "gene11", "gene2", 
                                                                                "gene3", "gene4", "gene5", "gene6", "gene7", "gene8", "gene9"
                                                  ), class = "factor"), chrom = c(1L, 1L, 1L, 2L, 2L, 2L, 3L, 3L, 
                                                                                  3L, 4L, 4L), tss = c(5L, 10L, 23L, 1340L, 313L, 88L, 44L, 57L, 
                                                                                                       88L, 74L, 127L)), .Names = c("gene", "chrom", "tss"), class = "data.frame", row.names = c(NA, 
                                                                                                                                                                                                 -11L))

# Generate needed values by applying function to all rows and transposing t() the results
data[,c("closest_gene", "closest_tss", "min_dist")] <- t(apply(data, 1, function(x){
   # Get subset of tss_locations where the chromosome matches the current row
   genes <- tss_locations[tss_locations$chrom == x["chrom"], ]

   # Find the minimum distance from the current row's pos to the nearest tss location
   min.dist <- min(abs(genes$tss - x["pos"]))

   # Find the closest tss location to the current row's pos
   closest_tss <- genes[which.min(abs(genes$tss - x["pos"])), "tss"]

   # Check if closest tss location is less than pos and set min.dist to negative if true
   min.dist <- ifelse(closest_tss < x["pos"], min.dist * -1, min.dist)

   # Find the closest gene to the current row's pos
   closest_gene <- as.character(genes[which.min(abs(genes$tss - x["pos"])), "gene"])

   # Return the values to the matrix
   return(c(closest_gene, closest_tss, min.dist))
}))

数据谢谢-似乎很接近，但输出略有下降。例如，在运行它时，数据的第一行看起来像：110 gene150
-其中最近的tss
是5
，但是min\u dist
是0
你是对的，我已经编辑了答案来更正它。太好了！我在问题中提供的数据是玩具数据。我的真实数据是相同的，只是数据
有几个额外的列。我如何调整它以适应这种情况（并明确选择pos
列）？data
中的其他列根本不会影响功能。代码已经在显式检查chrom
和pos
列（即x[“chrom”]
和x[“pos”]
），您能否在回答中添加几行解释？我不想要与tss
的绝对距离，但是当我更改min.dist时
   gene chrom  tss
1 gene1     1    5
2 gene2     1   10
3 gene3     1   23
4 gene4     2 1340
5 gene5     2  313
6 gene6     2   88

fun <- function(p) {
  # Get index of nearest tss
  index<-which.min(abs(tss_locations$tss - p))
  # Lookup the value
  closestTss<-tss_locations$tss[[index]]
  # Calculate the distance
  dist<-(closestTss-p)
  list(snp=p, closest=closestTss, distance2nearest=dist)
}

# Run function for each 'pos' in data
dist2tss<-lapply(data$pos, fun)

# Convert to data frame and sort descending:
dist2tss<-do.call(rbind, dist2tss)
dist2tss<-as.data.frame(dist2tss)

dist2tss<-arrange(dist2tss,(as.numeric(distance2nearest)))
dist2tss$distance2nearest<-as.numeric(dist2tss$distance2nearest)

head(dist2tss)

  snp closest distance2nearest
1 600     313             -287
2 400     313              -87
3 200     127              -73
4 100      88              -12
5  33      23              -10
6 134     127               -7

data <- structure(list(chrom = c(1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 3L, 
                                 3L, 4L, 4L, 4L, 4L), pos = c(10L, 200L, 134L, 400L, 600L, 1000L, 
                                                              20L, 33L, 40L, 45L, 50L, 55L, 100L, 123L)), .Names = c("chrom", 
                                                                                                                     "pos"), class = "data.frame", row.names = c(NA, -14L))

tss_locations <- structure(list(gene = structure(c(1L, 4L, 5L, 6L, 7L, 8L, 9L, 
                                                  10L, 11L, 2L, 3L), .Label = c("gene1", "gene10", "gene11", "gene2", 
                                                                                "gene3", "gene4", "gene5", "gene6", "gene7", "gene8", "gene9"
                                                  ), class = "factor"), chrom = c(1L, 1L, 1L, 2L, 2L, 2L, 3L, 3L, 
                                                                                  3L, 4L, 4L), tss = c(5L, 10L, 23L, 1340L, 313L, 88L, 44L, 57L, 
                                                                                                       88L, 74L, 127L)), .Names = c("gene", "chrom", "tss"), class = "data.frame", row.names = c(NA, 
                                                                                                                                                                                                 -11L))

# Generate needed values by applying function to all rows and transposing t() the results
data[,c("closest_gene", "closest_tss", "min_dist")] <- t(apply(data, 1, function(x){
   # Get subset of tss_locations where the chromosome matches the current row
   genes <- tss_locations[tss_locations$chrom == x["chrom"], ]

   # Find the minimum distance from the current row's pos to the nearest tss location
   min.dist <- min(abs(genes$tss - x["pos"]))

   # Find the closest tss location to the current row's pos
   closest_tss <- genes[which.min(abs(genes$tss - x["pos"])), "tss"]

   # Check if closest tss location is less than pos and set min.dist to negative if true
   min.dist <- ifelse(closest_tss < x["pos"], min.dist * -1, min.dist)

   # Find the closest gene to the current row's pos
   closest_gene <- as.character(genes[which.min(abs(genes$tss - x["pos"])), "gene"])

   # Return the values to the matrix
   return(c(closest_gene, closest_tss, min.dist))
}))