R 与条件连接

R 与条件连接,r,R,我试图将一个文件的一些行连接成一行,但它必须取决于内容,并且在整个文件中是可变的 我的数据文件的简化版本: >xy|number|Name ABCABCABC ABCABCABC ABCABCABC ABC >xy|number2|Name2 ABCABCABC ABCABC >xy|number3|Name3 ABCABCABC ABCABCABC ABCABCABC ABCAB 我希望它以这样的方式结束:(空格表示不同的列) dat这里有一个与@MatthewLundbe

我试图将一个文件的一些行连接成一行,但它必须取决于内容,并且在整个文件中是可变的

我的数据文件的简化版本:

>xy|number|Name
ABCABCABC
ABCABCABC
ABCABCABC
ABC
>xy|number2|Name2
ABCABCABC
ABCABC
>xy|number3|Name3
ABCABCABC
ABCABCABC
ABCABCABC
ABCAB
我希望它以这样的方式结束:(空格表示不同的列)


dat这里有一个与@MatthewLundberg类似的解决方案,但使用
cumsum
分割向量

file<-scan('~/Desktop/data.txt','character')
h<-grepl('^>',file)
file[h]<-gsub('^>','',paste0(file[h],'|'),'')
l<-split(file,cumsum(h))
do.call(rbind,strsplit(sapply(l,paste,collapse=''),'[|]'))

#   [,1] [,2]      [,3]    [,4]                              
# 1 "xy" "number"  "Name"  "ABCABCABCABCABCABCABCABCABCABC"  
# 2 "xy" "number2" "Name2" "ABCABCABCABCABC"                 
# 3 "xy" "number3" "Name3" "ABCABCABCABCABCABCABCABCABCABCAB"

文件如果您想要一个带有结果的data.frame,请考虑以下事项:

raw <- ">xy|number|Name
ABCABCABC
ABCABCABC
ABCABCABC
ABC
>xy|number2|Name2
ABCABCABC
ABCABC
>xy|number3|Name3
ABCABCABC
ABCABCABC
ABCABCABC
ABCAB"

s <- readLines(textConnection(raw))        # s is vector of strings

first.line <- which(substr(s,1,1) == ">")  # find first line of set
N <- length(first.line)
first.line <- c(first.line, length(s)+1)   # add first line past end

# Preallocate data.frame (good idea if large)
d <- data.frame(X1=rep("",N), X2=rep("",N), X3=rep("",N), X4=rep("",N),
                stringsAsFactors=FALSE)

for (i in 1:N)
{
  w <- unlist(strsplit(s[first.line[i]],">|\\|"))  # Parse 1st line
  d$X1[i] <- w[2]
  d$X2[i] <- w[3]
  d$X3[i] <- w[4]
  d$X4[i] <- paste(s[ (first.line[i]+1) : (first.line[i+1]-1) ], collapse="")
}


d
  X1      X2    X3                               X4
1 xy  number  Name   ABCABCABCABCABCABCABCABCABCABC
2 xy number2 Name2                  ABCABCABCABCABC
3 xy number3 Name3 ABCABCABCABCABCABCABCABCABCABCAB

raw我确信这可以在R中完成,但它几乎肯定是用于任务的错误语言(在R中,您将如何处理这些结构?)。考虑一个命令语言,比如Perl或C.@ MatthewLUndberg,如果他想在R中进行后处理,并且文件不是巨大的,我不明白为什么R是错误的语言来做这件事。“取决于内容”不是一个描述!在这里,确保
文件
不是一个因素。我只是写了一些类似的东西,但第二行和第四行没有打包。。。现在发布没有意义。。。用
scan
what=character()
阅读该文件,这将是一个完整的答案。不幸的是,它没有给我4列,只给了我2列(xy,然后是其余),但我现在有很多工作要做,谢谢@这很奇怪。我准确地复制了你的测试数据,它工作正常。对不起,我的错。我今天很傻。输入文件错误。非常好用,非常感谢!!不知怎的,我把电话号码和名字弄丢了,但是你给了我很多好信息,谢谢!
file<-scan('~/Desktop/data.txt','character')
h<-grepl('^>',file)
file[h]<-gsub('^>','',paste0(file[h],'|'),'')
l<-split(file,cumsum(h))
do.call(rbind,strsplit(sapply(l,paste,collapse=''),'[|]'))

#   [,1] [,2]      [,3]    [,4]                              
# 1 "xy" "number"  "Name"  "ABCABCABCABCABCABCABCABCABCABC"  
# 2 "xy" "number2" "Name2" "ABCABCABCABCABC"                 
# 3 "xy" "number3" "Name3" "ABCABCABCABCABCABCABCABCABCABCAB"
raw <- ">xy|number|Name
ABCABCABC
ABCABCABC
ABCABCABC
ABC
>xy|number2|Name2
ABCABCABC
ABCABC
>xy|number3|Name3
ABCABCABC
ABCABCABC
ABCABCABC
ABCAB"

s <- readLines(textConnection(raw))        # s is vector of strings

first.line <- which(substr(s,1,1) == ">")  # find first line of set
N <- length(first.line)
first.line <- c(first.line, length(s)+1)   # add first line past end

# Preallocate data.frame (good idea if large)
d <- data.frame(X1=rep("",N), X2=rep("",N), X3=rep("",N), X4=rep("",N),
                stringsAsFactors=FALSE)

for (i in 1:N)
{
  w <- unlist(strsplit(s[first.line[i]],">|\\|"))  # Parse 1st line
  d$X1[i] <- w[2]
  d$X2[i] <- w[3]
  d$X3[i] <- w[4]
  d$X4[i] <- paste(s[ (first.line[i]+1) : (first.line[i+1]-1) ], collapse="")
}


d
  X1      X2    X3                               X4
1 xy  number  Name   ABCABCABCABCABCABCABCABCABCABC
2 xy number2 Name2                  ABCABCABCABCABC
3 xy number3 Name3 ABCABCABCABCABCABCABCABCABCABCAB