R 将合同信息的列式表转换为行

R 将合同信息的列式表转换为行,r,reformat,R,Reformat,我有一个很大的合同信息表(几百份合同),都收集在一个表的一列中。每个契约占据6个连续的行。我已经能够添加另一列(CAT),指示列中每一行的内容:公司、地址、CitySzip、联系人、合同、标题。[使用R] 数据的可复制版本在添加第二列以表示列名后,数据如下所示: textFile <- "col1|col2 XYZCo|Company 123 Main Street|Address Yourtown, MA 12345|CityStZip Joe Smith|Contact 20

我有一个很大的合同信息表(几百份合同),都收集在一个表的一列中。每个契约占据6个连续的行。我已经能够添加另一列(CAT),指示列中每一行的内容:公司、地址、CitySzip、联系人、合同、标题。[使用R]

数据的可复制版本在添加第二列以表示列名后,数据如下所示:

textFile <- "col1|col2
XYZCo|Company
123 Main Street|Address
Yourtown, MA 12345|CityStZip
Joe Smith|Contact
20-234-56/3|Contract
Process for Work|Title
ZZTop Co|Company
123 Jefferson Street|Address
Chicago, IL 60636|CityStZip
Jane Doe|Contact
23-274-11/3|Contract
Yet Another One|Title"

data <- read.csv(text=textFile,header = TRUE,sep="|")
data


                   col1      col2
1                 XYZCo   Company
2       123 Main Street   Address
3    Yourtown, MA 12345 CityStZip
4             Joe Smith   Contact
5           20-234-56/3  Contract
6      Process for Work     Title
7              ZZTop Co   Company
8  123 Jefferson Street   Address
9     Chicago, IL 60636 CityStZip
10             Jane Doe   Contact
11          23-274-11/3  Contract
12      Yet Another One     Title
Company Address           CitySTZip            Contact     Contract      Title

XYZCo 123 Main Street   Yourtown, MA 12345   Joe Smith   20-234-56/3   Process for Work
> data
                   col1      col2
1                 XYZCo   Company
2       123 Main Street   Address
3    Yourtown, MA 12345 CityStZip
4             Joe Smith   Contact
5           20-234-56/3  Contract
6      Process for Work     Title
7              ZZTop Co   Company
8  123 Jefferson Street   Address
9     Chicago, IL 60636 CityStZip
10             Jane Doe   Contact
11          23-274-11/3  Contract
12      Yet Another One     Title
所需的输出如下所示:

textFile <- "col1|col2
XYZCo|Company
123 Main Street|Address
Yourtown, MA 12345|CityStZip
Joe Smith|Contact
20-234-56/3|Contract
Process for Work|Title
ZZTop Co|Company
123 Jefferson Street|Address
Chicago, IL 60636|CityStZip
Jane Doe|Contact
23-274-11/3|Contract
Yet Another One|Title"

data <- read.csv(text=textFile,header = TRUE,sep="|")
data


                   col1      col2
1                 XYZCo   Company
2       123 Main Street   Address
3    Yourtown, MA 12345 CityStZip
4             Joe Smith   Contact
5           20-234-56/3  Contract
6      Process for Work     Title
7              ZZTop Co   Company
8  123 Jefferson Street   Address
9     Chicago, IL 60636 CityStZip
10             Jane Doe   Contact
11          23-274-11/3  Contract
12      Yet Another One     Title
Company Address           CitySTZip            Contact     Contract      Title

XYZCo 123 Main Street   Yourtown, MA 12345   Joe Smith   20-234-56/3   Process for Work
> data
                   col1      col2
1                 XYZCo   Company
2       123 Main Street   Address
3    Yourtown, MA 12345 CityStZip
4             Joe Smith   Contact
5           20-234-56/3  Contract
6      Process for Work     Title
7              ZZTop Co   Company
8  123 Jefferson Street   Address
9     Chicago, IL 60636 CityStZip
10             Jane Doe   Contact
11          23-274-11/3  Contract
12      Yet Another One     Title

在发布我的原始答案后,我意识到数据可能与我的假设不同,因为原始帖子中的内容引用了原始数据中的第1列和第2列。如果数据如下所示,则有一个相对简单的答案,它将
dplyr
tidyr::pivot\u wider()
相结合

首先,我们将读取数据并打印结果数据框,包括数据值和列名在内的两列

textFile <- "col1|col2
XYZCo|Company
123 Main Street|Address
Yourtown, MA 12345|CityStZip
Joe Smith|Contact
20-234-56/3|Contract
Process for Work|Title
ZZTop Co|Company
123 Jefferson Street|Address
Chicago, IL 60636|CityStZip
Jane Doe|Contact
23-274-11/3|Contract
Yet Another One|Title"
data <- read.csv(text = textFile,header = TRUE, sep="|")
colNames <- c("Company","Address","CityStZip","Contact","Contract","Title")
data <- read.csv("./data/tmpfile.csv",header = FALSE,sep = "|",
                 col.names = colNames)
为了将数据帧转换为宽格式整洁的数据,我们需要添加一个ID列来区分一个观察值和其他观察值。为此,我们可以使用
dplyr::mutate()
以及
天花()
函数。需要
天花()
函数,因为我们希望每6行输入数据的ID值保持不变。当我们将
seq_along()
的结果除以6时,它将生成所需的向量

一旦我们添加了ID列,旋转到宽格式就相对简单了

library(dplyr)
library(tidyr)
data %>% mutate(id = ceiling(seq_along(col1)/6)) %>%
    pivot_wider(.,id,names_from=col2,values_from=col1)
…以及输出:

# A tibble: 2 x 7
     id Company  Address          CityStZip       Contact  Contract   Title       
  <dbl> <chr>    <chr>            <chr>           <chr>    <chr>      <chr>       
1     1 XYZCo    123 Main Street  Yourtown, MA 1… Joe Smi… 20-234-56… Process for…
2     2 ZZTop Co 123 Jefferson S… Chicago, IL 60… Jane Doe 23-274-11… Yet Another…
> data
    Company               Address           CityStZip    Contact     Contract
1    XYZCo       123 Main Street  Yourtown, MA 12345  Joe Smith  20-234-56/3 
2 ZZTop Co  123 Jefferson Street   Chicago, IL 60636   Jane Doe  23-274-11/3 
              Title
1 Process for Work 
2  Yet Another One 
首先,我们使用
readLines()
将数据读入字符向量

接下来,我们循环遍历向量,并将每6行合并为一个输出记录,使用管道
作为分隔符,因为数据在
cityszip
字段中包含逗号

# write to tempfile as pipe separated values
tmpFile <- "./data/tmpfile.csv"
counter <- 0
outLine <- NULL
for(i in 1:length(dataVector)){
    counter <- counter + 1
    if(counter == 1 ) outLine <- dataVector[i]
    else outLine <- paste(outLine,dataVector[i],sep="|")
    if(counter == 6) {
         cat(outLine,file = "./data/tmpfile.csv",sep="\n",append=TRUE)
         counter <- 0
         outLine <- NULL
    }
}
…以及输出:

# A tibble: 2 x 7
     id Company  Address          CityStZip       Contact  Contract   Title       
  <dbl> <chr>    <chr>            <chr>           <chr>    <chr>      <chr>       
1     1 XYZCo    123 Main Street  Yourtown, MA 1… Joe Smi… 20-234-56… Process for…
2     2 ZZTop Co 123 Jefferson S… Chicago, IL 60… Jane Doe 23-274-11… Yet Another…
> data
    Company               Address           CityStZip    Contact     Contract
1    XYZCo       123 Main Street  Yourtown, MA 12345  Joe Smith  20-234-56/3 
2 ZZTop Co  123 Jefferson Street   Chicago, IL 60636   Jane Doe  23-274-11/3 
              Title
1 Process for Work 
2  Yet Another One 

似乎无法将数据显示在帖子的列中:(欢迎使用StackOverflow,@Xtfer。我的答案中发布的两种潜在数据格式是否正确?如果不正确,请使用
dput()
发布一小部分数据,如中所述。鉴于您在下面的评论,我更新了您的问题以使其可复制。请使用
for()编辑该部分
循环并发布您被卡住的实际代码,以及您的代码收到的任何错误消息。mutate!添加索引是一个诀窍。我尝试了长时间的“for循环”但是横向和纵向索引变得越来越复杂。总是惊讶于R和它的杰出实践者如何给出优雅的答案。谢谢!@xtufer-如果你觉得答案有用,请点击问题旁边的复选标记接受它,然后向上投票。