基于data.frame r中另一列的值从列中提取信息

基于data.frame r中另一列的值从列中提取信息,r,dataframe,R,Dataframe,我有一个大文件~100k行和100列,我想创建一个基于另一列的四列信息提取文件。有一个名为Caller的列,该列将告诉您哪些列具有.sample将包含noSample以外的信息 我曾尝试过if和elseif语句,但有时会满足两个条件,编写所有可能的组合将需要很多努力,我非常确信有更好的方法 我的real data.frame看起来像这样: 编辑 我想要的输出不包括样本之间的管道,但我可以使用strsplit将其删除 由于data.frame很大,因此速度至关重要。这里有一个可能的解决方案: Df

我有一个大文件~100k行和100列,我想创建一个基于另一列的四列信息提取文件。有一个名为Caller的列,该列将告诉您哪些列具有.sample将包含noSample以外的信息

我曾尝试过if和elseif语句,但有时会满足两个条件,编写所有可能的组合将需要很多努力,我非常确信有更好的方法

我的real data.frame看起来像这样:

编辑

我想要的输出不包括样本之间的管道,但我可以使用strsplit将其删除


由于data.frame很大,因此速度至关重要。

这里有一个可能的解决方案:

Df <- data.frame(A = c("chr1", "chr1", "chr1", "chr1", "chr1", "chr1", "chr1"),
                 B= c(10,12,13,14,15,16,17),
                 Caller = c("A", "B", "C",  "D", "A,C", "A,B,C", "B,D"),
                 A.sample = c("3xd|432", "noSample","noSample","noSample","1234|567|87sd","234|456|897a","noSample"),
                 B.sample = c("noSample", "456|789|asd", "noSample","noSample","noSample","674e|7892|123|432","bgcf|12er|567|zxs3|12ple"),
                 C.sample = c("noSample","noSample", "zxc|vbn|mn","noSample","gfd3|123|456|789","674e|7892|123","noSample" ),
                 D.sample = c("noSample","noSample", "noSample", "poi|uyh|gfrt|562", "noSample", "noSample", "567|zxs3|12ple"),
                 stringsAsFactors=FALSE)

#find names of columns
names<-substr(names(Df), 1, 1)
#Set unwanted names to NA
names[-c(4:ncol(Df))]<-NA

#create a regular expression by replacing the comma with the or |
reg<-gsub(",", "\\|", Df$Caller)

#find the column matches
columns<-sapply(reg, function(x){grep(x, names)})    

#extract the desired columns out into a list
lapply(seq_along(columns), function(x){Df[x,columns[[x]]]})

我在数据帧定义中添加了stringsAsFactors=FALSE,以便删除与因子级别相关的行李。

仅显示了实现所需结果的多种可能方法之一。请注意,我使用了与@Dave2e相同的数据帧,也就是说,我在对data.frame的调用中添加了stringsAsFactors=F

现在,我们可以简单地使用子集来检索所需的结果:

out[out$rowid == 1,"value"]
[1] "3xd|432"
out[out$rowid == 5,"value"]
[1] "1234|567|87sd"    "gfd3|123|456|789"

看起来您正试图从数据帧中获取波段对角线。您可能希望将数据格式化为表格/矩阵,这样这一点就可以理解。@TimBiegeleisen,它并不总是一个完美的对角线,在某些情况下,一整列样本都可以包含所有值,这一格式如何?试着给我们一个最小的问题。如果我不理解你的观点,很抱歉,但我只想用noSample从这些列中提取样本信息,并且这些信息必须以某种方式通过行索引。在输出向量中用[1]等表示样本有多重要?它工作得很好,但在我的真实数据中。设置列a.sample,B.sample,C.sample,D.sample不是连续的,它们位于位置c8,10,12,14,我不知道如何修复列步骤以获得正确的列,因为您使用了+3来获得正确的索引,对吗?@user2380782,我做了一次编辑,准备处理非连续列,只需在数组中将c8,10,12,14替换为行名[-c.]
Df <- data.frame(A = c("chr1", "chr1", "chr1", "chr1", "chr1", "chr1", "chr1"),
                 B= c(10,12,13,14,15,16,17),
                 Caller = c("A", "B", "C",  "D", "A,C", "A,B,C", "B,D"),
                 A.sample = c("3xd|432", "noSample","noSample","noSample","1234|567|87sd","234|456|897a","noSample"),
                 B.sample = c("noSample", "456|789|asd", "noSample","noSample","noSample","674e|7892|123|432","bgcf|12er|567|zxs3|12ple"),
                 C.sample = c("noSample","noSample", "zxc|vbn|mn","noSample","gfd3|123|456|789","674e|7892|123","noSample" ),
                 D.sample = c("noSample","noSample", "noSample", "poi|uyh|gfrt|562", "noSample", "noSample", "567|zxs3|12ple"),
                 stringsAsFactors=FALSE)

#find names of columns
names<-substr(names(Df), 1, 1)
#Set unwanted names to NA
names[-c(4:ncol(Df))]<-NA

#create a regular expression by replacing the comma with the or |
reg<-gsub(",", "\\|", Df$Caller)

#find the column matches
columns<-sapply(reg, function(x){grep(x, names)})    

#extract the desired columns out into a list
lapply(seq_along(columns), function(x){Df[x,columns[[x]]]})
library(tidyverse)
out <- df %>% rowid_to_column() %>% # adding explicit row IDs
       gather(key, value, -rowid, -A, -B, -Caller) %>% # reshaping the dataframe
       filter(value != "noSample")
out
   rowid    A  B Caller      key                    value
1      1 chr1 10      A A.sample                  3xd|432
2      5 chr1 15    A,C A.sample            1234|567|87sd
3      6 chr1 16  A,B,C A.sample             234|456|897a
4      2 chr1 12      B B.sample              456|789|asd
5      6 chr1 16  A,B,C B.sample        674e|7892|123|432
6      7 chr1 17    B,D B.sample bgcf|12er|567|zxs3|12ple
7      3 chr1 13      C C.sample               zxc|vbn|mn
8      5 chr1 15    A,C C.sample         gfd3|123|456|789
9      6 chr1 16  A,B,C C.sample            674e|7892|123
10     4 chr1 14      D D.sample         poi|uyh|gfrt|562
11     7 chr1 17    B,D D.sample           567|zxs3|12ple
out[out$rowid == 1,"value"]
[1] "3xd|432"
out[out$rowid == 5,"value"]
[1] "1234|567|87sd"    "gfd3|123|456|789"