R 将合同信息的列式表转换为行
我有一个很大的合同信息表(几百份合同),都收集在一个表的一列中。每个契约占据6个连续的行。我已经能够添加另一列(CAT),指示列中每一行的内容:公司、地址、CitySzip、联系人、合同、标题。[使用R] 数据的可复制版本在添加第二列以表示列名后,数据如下所示:R 将合同信息的列式表转换为行,r,reformat,R,Reformat,我有一个很大的合同信息表(几百份合同),都收集在一个表的一列中。每个契约占据6个连续的行。我已经能够添加另一列(CAT),指示列中每一行的内容:公司、地址、CitySzip、联系人、合同、标题。[使用R] 数据的可复制版本在添加第二列以表示列名后,数据如下所示: textFile <- "col1|col2 XYZCo|Company 123 Main Street|Address Yourtown, MA 12345|CityStZip Joe Smith|Contact 20
textFile <- "col1|col2
XYZCo|Company
123 Main Street|Address
Yourtown, MA 12345|CityStZip
Joe Smith|Contact
20-234-56/3|Contract
Process for Work|Title
ZZTop Co|Company
123 Jefferson Street|Address
Chicago, IL 60636|CityStZip
Jane Doe|Contact
23-274-11/3|Contract
Yet Another One|Title"
data <- read.csv(text=textFile,header = TRUE,sep="|")
data
col1 col2
1 XYZCo Company
2 123 Main Street Address
3 Yourtown, MA 12345 CityStZip
4 Joe Smith Contact
5 20-234-56/3 Contract
6 Process for Work Title
7 ZZTop Co Company
8 123 Jefferson Street Address
9 Chicago, IL 60636 CityStZip
10 Jane Doe Contact
11 23-274-11/3 Contract
12 Yet Another One Title
Company Address CitySTZip Contact Contract Title
XYZCo 123 Main Street Yourtown, MA 12345 Joe Smith 20-234-56/3 Process for Work
> data
col1 col2
1 XYZCo Company
2 123 Main Street Address
3 Yourtown, MA 12345 CityStZip
4 Joe Smith Contact
5 20-234-56/3 Contract
6 Process for Work Title
7 ZZTop Co Company
8 123 Jefferson Street Address
9 Chicago, IL 60636 CityStZip
10 Jane Doe Contact
11 23-274-11/3 Contract
12 Yet Another One Title
所需的输出如下所示:
textFile <- "col1|col2
XYZCo|Company
123 Main Street|Address
Yourtown, MA 12345|CityStZip
Joe Smith|Contact
20-234-56/3|Contract
Process for Work|Title
ZZTop Co|Company
123 Jefferson Street|Address
Chicago, IL 60636|CityStZip
Jane Doe|Contact
23-274-11/3|Contract
Yet Another One|Title"
data <- read.csv(text=textFile,header = TRUE,sep="|")
data
col1 col2
1 XYZCo Company
2 123 Main Street Address
3 Yourtown, MA 12345 CityStZip
4 Joe Smith Contact
5 20-234-56/3 Contract
6 Process for Work Title
7 ZZTop Co Company
8 123 Jefferson Street Address
9 Chicago, IL 60636 CityStZip
10 Jane Doe Contact
11 23-274-11/3 Contract
12 Yet Another One Title
Company Address CitySTZip Contact Contract Title
XYZCo 123 Main Street Yourtown, MA 12345 Joe Smith 20-234-56/3 Process for Work
> data
col1 col2
1 XYZCo Company
2 123 Main Street Address
3 Yourtown, MA 12345 CityStZip
4 Joe Smith Contact
5 20-234-56/3 Contract
6 Process for Work Title
7 ZZTop Co Company
8 123 Jefferson Street Address
9 Chicago, IL 60636 CityStZip
10 Jane Doe Contact
11 23-274-11/3 Contract
12 Yet Another One Title
在发布我的原始答案后,我意识到数据可能与我的假设不同,因为原始帖子中的内容引用了原始数据中的第1列和第2列。如果数据如下所示,则有一个相对简单的答案,它将
dplyr
与tidyr::pivot\u wider()
相结合
首先,我们将读取数据并打印结果数据框,包括数据值和列名在内的两列
textFile <- "col1|col2
XYZCo|Company
123 Main Street|Address
Yourtown, MA 12345|CityStZip
Joe Smith|Contact
20-234-56/3|Contract
Process for Work|Title
ZZTop Co|Company
123 Jefferson Street|Address
Chicago, IL 60636|CityStZip
Jane Doe|Contact
23-274-11/3|Contract
Yet Another One|Title"
data <- read.csv(text = textFile,header = TRUE, sep="|")
colNames <- c("Company","Address","CityStZip","Contact","Contract","Title")
data <- read.csv("./data/tmpfile.csv",header = FALSE,sep = "|",
col.names = colNames)
为了将数据帧转换为宽格式整洁的数据,我们需要添加一个ID列来区分一个观察值和其他观察值。为此,我们可以使用dplyr::mutate()
以及天花()
函数。需要天花()
函数,因为我们希望每6行输入数据的ID值保持不变。当我们将seq_along()
的结果除以6时,它将生成所需的向量
一旦我们添加了ID列,旋转到宽格式就相对简单了
library(dplyr)
library(tidyr)
data %>% mutate(id = ceiling(seq_along(col1)/6)) %>%
pivot_wider(.,id,names_from=col2,values_from=col1)
…以及输出:
# A tibble: 2 x 7
id Company Address CityStZip Contact Contract Title
<dbl> <chr> <chr> <chr> <chr> <chr> <chr>
1 1 XYZCo 123 Main Street Yourtown, MA 1… Joe Smi… 20-234-56… Process for…
2 2 ZZTop Co 123 Jefferson S… Chicago, IL 60… Jane Doe 23-274-11… Yet Another…
> data
Company Address CityStZip Contact Contract
1 XYZCo 123 Main Street Yourtown, MA 12345 Joe Smith 20-234-56/3
2 ZZTop Co 123 Jefferson Street Chicago, IL 60636 Jane Doe 23-274-11/3
Title
1 Process for Work
2 Yet Another One
首先,我们使用readLines()
将数据读入字符向量
接下来,我们循环遍历向量,并将每6行合并为一个输出记录,使用管道
作为分隔符,因为数据在cityszip
字段中包含逗号
# write to tempfile as pipe separated values
tmpFile <- "./data/tmpfile.csv"
counter <- 0
outLine <- NULL
for(i in 1:length(dataVector)){
counter <- counter + 1
if(counter == 1 ) outLine <- dataVector[i]
else outLine <- paste(outLine,dataVector[i],sep="|")
if(counter == 6) {
cat(outLine,file = "./data/tmpfile.csv",sep="\n",append=TRUE)
counter <- 0
outLine <- NULL
}
}
…以及输出:
# A tibble: 2 x 7
id Company Address CityStZip Contact Contract Title
<dbl> <chr> <chr> <chr> <chr> <chr> <chr>
1 1 XYZCo 123 Main Street Yourtown, MA 1… Joe Smi… 20-234-56… Process for…
2 2 ZZTop Co 123 Jefferson S… Chicago, IL 60… Jane Doe 23-274-11… Yet Another…
> data
Company Address CityStZip Contact Contract
1 XYZCo 123 Main Street Yourtown, MA 12345 Joe Smith 20-234-56/3
2 ZZTop Co 123 Jefferson Street Chicago, IL 60636 Jane Doe 23-274-11/3
Title
1 Process for Work
2 Yet Another One
似乎无法将数据显示在帖子的列中:(欢迎使用StackOverflow,@Xtfer。我的答案中发布的两种潜在数据格式是否正确?如果不正确,请使用
dput()
发布一小部分数据,如中所述。鉴于您在下面的评论,我更新了您的问题以使其可复制。请使用for()编辑该部分
循环并发布您被卡住的实际代码,以及您的代码收到的任何错误消息。mutate!添加索引是一个诀窍。我尝试了长时间的“for循环”但是横向和纵向索引变得越来越复杂。总是惊讶于R和它的杰出实践者如何给出优雅的答案。谢谢!@xtufer-如果你觉得答案有用,请点击问题旁边的复选标记接受它,然后向上投票。