R 与条件连接
我试图将一个文件的一些行连接成一行,但它必须取决于内容,并且在整个文件中是可变的 我的数据文件的简化版本:R 与条件连接,r,R,我试图将一个文件的一些行连接成一行,但它必须取决于内容,并且在整个文件中是可变的 我的数据文件的简化版本: >xy|number|Name ABCABCABC ABCABCABC ABCABCABC ABC >xy|number2|Name2 ABCABCABC ABCABC >xy|number3|Name3 ABCABCABC ABCABCABC ABCABCABC ABCAB 我希望它以这样的方式结束:(空格表示不同的列) dat这里有一个与@MatthewLundbe
>xy|number|Name
ABCABCABC
ABCABCABC
ABCABCABC
ABC
>xy|number2|Name2
ABCABCABC
ABCABC
>xy|number3|Name3
ABCABCABC
ABCABCABC
ABCABCABC
ABCAB
我希望它以这样的方式结束:(空格表示不同的列)
dat这里有一个与@MatthewLundberg类似的解决方案,但使用cumsum
分割向量
file<-scan('~/Desktop/data.txt','character')
h<-grepl('^>',file)
file[h]<-gsub('^>','',paste0(file[h],'|'),'')
l<-split(file,cumsum(h))
do.call(rbind,strsplit(sapply(l,paste,collapse=''),'[|]'))
# [,1] [,2] [,3] [,4]
# 1 "xy" "number" "Name" "ABCABCABCABCABCABCABCABCABCABC"
# 2 "xy" "number2" "Name2" "ABCABCABCABCABC"
# 3 "xy" "number3" "Name3" "ABCABCABCABCABCABCABCABCABCABCAB"
文件如果您想要一个带有结果的data.frame,请考虑以下事项:
raw <- ">xy|number|Name
ABCABCABC
ABCABCABC
ABCABCABC
ABC
>xy|number2|Name2
ABCABCABC
ABCABC
>xy|number3|Name3
ABCABCABC
ABCABCABC
ABCABCABC
ABCAB"
s <- readLines(textConnection(raw)) # s is vector of strings
first.line <- which(substr(s,1,1) == ">") # find first line of set
N <- length(first.line)
first.line <- c(first.line, length(s)+1) # add first line past end
# Preallocate data.frame (good idea if large)
d <- data.frame(X1=rep("",N), X2=rep("",N), X3=rep("",N), X4=rep("",N),
stringsAsFactors=FALSE)
for (i in 1:N)
{
w <- unlist(strsplit(s[first.line[i]],">|\\|")) # Parse 1st line
d$X1[i] <- w[2]
d$X2[i] <- w[3]
d$X3[i] <- w[4]
d$X4[i] <- paste(s[ (first.line[i]+1) : (first.line[i+1]-1) ], collapse="")
}
d
X1 X2 X3 X4
1 xy number Name ABCABCABCABCABCABCABCABCABCABC
2 xy number2 Name2 ABCABCABCABCABC
3 xy number3 Name3 ABCABCABCABCABCABCABCABCABCABCAB
raw我确信这可以在R中完成,但它几乎肯定是用于任务的错误语言(在R中,您将如何处理这些结构?)。考虑一个命令语言,比如Perl或C.@ MatthewLUndberg,如果他想在R中进行后处理,并且文件不是巨大的,我不明白为什么R是错误的语言来做这件事。“取决于内容”不是一个描述!在这里,确保文件不是一个因素。我只是写了一些类似的东西,但第二行和第四行没有打包。。。现在发布没有意义。。。用scan
,what=character()
阅读该文件,这将是一个完整的答案。不幸的是,它没有给我4列,只给了我2列(xy,然后是其余),但我现在有很多工作要做,谢谢@这很奇怪。我准确地复制了你的测试数据,它工作正常。对不起,我的错。我今天很傻。输入文件错误。非常好用,非常感谢!!不知怎的,我把电话号码和名字弄丢了,但是你给了我很多好信息,谢谢!
file<-scan('~/Desktop/data.txt','character')
h<-grepl('^>',file)
file[h]<-gsub('^>','',paste0(file[h],'|'),'')
l<-split(file,cumsum(h))
do.call(rbind,strsplit(sapply(l,paste,collapse=''),'[|]'))
# [,1] [,2] [,3] [,4]
# 1 "xy" "number" "Name" "ABCABCABCABCABCABCABCABCABCABC"
# 2 "xy" "number2" "Name2" "ABCABCABCABCABC"
# 3 "xy" "number3" "Name3" "ABCABCABCABCABCABCABCABCABCABCAB"
raw <- ">xy|number|Name
ABCABCABC
ABCABCABC
ABCABCABC
ABC
>xy|number2|Name2
ABCABCABC
ABCABC
>xy|number3|Name3
ABCABCABC
ABCABCABC
ABCABCABC
ABCAB"
s <- readLines(textConnection(raw)) # s is vector of strings
first.line <- which(substr(s,1,1) == ">") # find first line of set
N <- length(first.line)
first.line <- c(first.line, length(s)+1) # add first line past end
# Preallocate data.frame (good idea if large)
d <- data.frame(X1=rep("",N), X2=rep("",N), X3=rep("",N), X4=rep("",N),
stringsAsFactors=FALSE)
for (i in 1:N)
{
w <- unlist(strsplit(s[first.line[i]],">|\\|")) # Parse 1st line
d$X1[i] <- w[2]
d$X2[i] <- w[3]
d$X3[i] <- w[4]
d$X4[i] <- paste(s[ (first.line[i]+1) : (first.line[i+1]-1) ], collapse="")
}
d
X1 X2 X3 X4
1 xy number Name ABCABCABCABCABCABCABCABCABCABC
2 xy number2 Name2 ABCABCABCABCABC
3 xy number3 Name3 ABCABCABCABCABCABCABCABCABCABCAB