R 将一列多行数据转换为多列多行数据
我在R中有一个web抓取数据的输出,如下所示R 将一列多行数据转换为多列多行数据,r,reshape,R,Reshape,我在R中有一个web抓取数据的输出,如下所示 Name1 Email: email1@xyz.com City/Town: Location1 Name2 Email: email2@abc.com City/Town: Location2 Name3 Email: email3@pqr.com City/Town: Location3 某些名称可能没有电子邮件或位置。我想把上面的数据转换成表格格式。输出应该如下所示 Name Email City/Town Na
Name1
Email: email1@xyz.com
City/Town: Location1
Name2
Email: email2@abc.com
City/Town: Location2
Name3
Email: email3@pqr.com
City/Town: Location3
某些名称可能没有电子邮件或位置。我想把上面的数据转换成表格格式。输出应该如下所示
Name Email City/Town
Name1 email1@xyz.com Location1
Name2 email2@abc.com Location2
Name3 email3@pqr.com Location3
Name4 Location4
Name5 email5@abc.com
使用:
这也适用于真实姓名。使用@uweBlock的数据,您将获得:
并且每个部分都有多个键,并且有@UweBlock的数据:
使用数据:
txt <- textConnection("Name1
Email: email1@xyz.com
City/Town: Location1
Name2
Email: email2@abc.com
City/Town: Location2
Name3
Email: email3@pqr.com
City/Town: Location3
Name4
City/Town: Location4
Name5
Email: email5@abc.com")
输入数据带来了几个挑战: 数据以直线字符向量形式给出,而不是以具有预定义列的Data.frame形式给出。 行部分由键/值对组成,键/值对之间由以下部分分隔: 其他行用作节标题。在到达下一个标题之前,以下行中的所有键/值对都属于一个节。 以下代码仅依赖于两个假设: 键/值对包含且仅包含一个: 节标题完全没有。 一个节中的多个键(例如,具有电子邮件地址的多行)通过将toString指定为dcast的聚合函数来处理 数据 或者,每个部分有多个键
txt2 <- c("John Doe", "Email: email1@xyz.com", "Email: email1@abc.com", "City/Town: Location1", "Save the World Fund",
"Email: email2@abc.com", "City/Town: Location2", "Best Shoes Ltd.", "Email: email3@pqr.com",
"City/Town: Location3", "Mother", "City/Town: Location4", "City/Town: everywhere", "Jane",
"Email: email5@abc.com")
使用dplyr和tidyr,在@Jaap txt和@UweBlock txt1提供的数据上进行测试:
注:
看看我们为什么需要注册护士。
希望有人能建议只使用tidyverse的更好/更简单的代码。
基准:
代码:
插入\n名称:在每个名称之前,然后使用read.dcf(如果数据来自文件)将其读入,在代码的第一行中用文件名(例如myfile.dat)替换textConnectionLines。没有使用任何软件包
L <- trimws(readLines(textConnection(Lines)))
ix <- !grepl(":", L)
L[ix] <- paste("\nName:", L[ix])
read.dcf(textConnection(L))
注:使用输入。这是从问题中稍微修改的,以表明它在电子邮件或城市/城镇丢失时有效:
Lines <- "Name1
Email: email1@xyz.com
City/Town: Location1
Name2
City/Town: Location2
Name3
Email: email3@pqr.com"
源数据如下所示。姓名1电子邮件:email1@abc.com城市/城镇:地点1姓名2电子邮件:email2@xyz.com城市/城镇:地点2姓名3电子邮件:email3@pqr.com城市/城镇:位置3是否可以提供?更简单的:数据\u frametext=txt%>%sep\n'>%separatetext,c'var',val',sep=':',fill='左'>%mutateentry=cumsumis.navar,var=coalescevar,'名称'>%spreadvar,val%>%select4:2,其中txt是字符向量或路径。或者使用data\u frametext=read\u linestxt而不是单独的行。@alistaire无法使用txt2,重复行错误。也许加上rn?另外,也许可以添加一个新答案,或者我可以添加到我的答案中?您可以使用toString进行分组和总结,例如data\u frametext=txt2%>%separate\n'>%separatetext,c'var',val',sep=':',fill='左'%>%mutateentry=cumsumis.navar,var=CoalescVar,'名称'>%group\u byentry,var%>%summarseval=toStringval%>%spreadvar,val%>%ungroup%>%select4:2或向键添加索引,但我并不喜欢这两个选项。如果你愿意的话,继续添加它。所有其他答案都是解决Y问题,这个答案是解决X!回答得好!虽然base RA中提供了read.dcf函数,但我想它不是一个非常有名的函数,它很好地展示了学习base R的好处。如果参数all=TRUE与read.dcf一起使用,则解决方案还能够处理重复条目,例如多个电子邮件地址,如txt2示例数据集中的条目。
txt <- textConnection("Name1
Email: email1@xyz.com
City/Town: Location1
Name2
Email: email2@abc.com
City/Town: Location2
Name3
Email: email3@pqr.com
City/Town: Location3
Name4
City/Town: Location4
Name5
Email: email5@abc.com")
library(data.table)
# coerce to data.table
data.table(text = txt)[
# split key/value pairs in columns
, tstrsplit(text, ": ")][
# pick section headers and create new column
is.na(V2), Name := V1][
# fill in Name into the rows below
, Name := zoo::na.locf(Name)][
# reshape key/value pairs from long to wide format using Name as row id
!is.na(V2), dcast(.SD, Name ~ V1, fun = toString, value.var = "V2")]
Name City/Town Email
1: Name1 Location1 email1@xyz.com
2: Name2 Location2 email2@abc.com
3: Name3 Location3 email3@pqr.com
4: Name4 Location4 NA
5: Name5 NA email5@abc.com
txt <- c("Name1", "Email: email1@xyz.com", "City/Town: Location1", "Name2",
"Email: email2@abc.com", "City/Town: Location2", "Name3", "Email: email3@pqr.com",
"City/Town: Location3", "Name4", "City/Town: Location4", "Name5",
"Email: email5@abc.com")
txt1 <- c("John Doe", "Email: email1@xyz.com", "City/Town: Location1", "Save the World Fund",
"Email: email2@abc.com", "City/Town: Location2", "Best Shoes Ltd.", "Email: email3@pqr.com",
"City/Town: Location3", "Mother", "City/Town: Location4", "Jane",
"Email: email5@abc.com")
Name City/Town Email
1: Best Shoes Ltd. Location3 email3@pqr.com
2: Jane NA email5@abc.com
3: John Doe Location1 email1@xyz.com
4: Mother Location4 NA
5: Save the World Fund Location2 email2@abc.com
txt2 <- c("John Doe", "Email: email1@xyz.com", "Email: email1@abc.com", "City/Town: Location1", "Save the World Fund",
"Email: email2@abc.com", "City/Town: Location2", "Best Shoes Ltd.", "Email: email3@pqr.com",
"City/Town: Location3", "Mother", "City/Town: Location4", "City/Town: everywhere", "Jane",
"Email: email5@abc.com")
Name City/Town Email
1: Best Shoes Ltd. Location3 email3@pqr.com
2: Jane email5@abc.com
3: John Doe Location1 email1@xyz.com, email1@abc.com
4: Mother Location4, everywhere
5: Save the World Fund Location2 email2@abc.com
library(dplyr)
library(tidyr)
# data_frame(txt = txt1) %>%
data_frame(txt = txt) %>%
mutate(txt = if_else(grepl(":", txt), txt, paste("Name:", txt)),
rn = row_number()) %>%
separate(txt, into = c("mytype", "mytext"), sep = ":") %>%
spread(key = mytype, value = mytext) %>%
select(-rn) %>%
fill(Name) %>%
group_by(Name) %>%
fill(1:2, .direction = "down") %>%
fill(1:2, .direction = "up") %>%
unique() %>%
ungroup() %>%
select(3:1)
# # A tibble: 5 x 3
# Name Email `City/Town`
# <chr> <chr> <chr>
# 1 Name1 email1@xyz.com Location1
# 2 Name2 email2@abc.com Location2
# 3 Name3 email3@pqr.com Location3
# 4 Name4 <NA> Location4
# 5 Name5 email5@abc.com <NA>
txt2 <- c("John Doe", "Email: email1@xyz.com", "Email: email1@abc.com", "City/Town: Location1", "Save the World Fund",
"Email: email2@abc.com", "City/Town: Location2", "Best Shoes Ltd.", "Email: email3@pqr.com",
"City/Town: Location3", "Mother", "City/Town: Location4", "City/Town: everywhere", "Jane",
"Email: email5@abc.com")
library(microbenchmark)
library(data.table)
library(dplyr)
library(tidyr)
microbenchmark(ans.uwe = data.table(text = txt2)[, tstrsplit(text, ": ")
][is.na(V2), Name := V1
][, Name := zoo::na.locf(Name)
][!is.na(V2), dcast(.SD, Name ~ V1, fun = toString, value.var = "V2")],
ans.zx8754 = data_frame(txt = txt2) %>%
mutate(txt = ifelse(grepl(":", txt), txt, paste("Name:", txt)),
rn = row_number()) %>%
separate(txt, into = c("mytype", "mytext"), sep = ":") %>%
spread(key = mytype, value = mytext) %>%
select(-rn) %>%
fill(Name) %>%
group_by(Name) %>%
fill(1:2, .direction = "down") %>%
fill(1:2, .direction = "up") %>%
unique() %>%
ungroup() %>%
select(3:1),
ans.jaap = data.table(txt = txt2)[!grepl(':', txt), name := txt
][, name := zoo::na.locf(name)
][grepl('^Email:', txt), email := sub('Email: ','',txt)
][grepl('^City/Town:', txt), city_town := sub('City/Town: ','',txt)
][txt != name, lapply(.SD, function(x) toString(na.omit(x))), by = name, .SDcols = c('email','city_town')],
ans.G.Grothendieck = {
L <- trimws(readLines(textConnection(txt2)))
ix <- !grepl(":", L)
L[ix] <- paste("\nName:", L[ix])
read.dcf(textConnection(L))},
times = 1000)
Unit: microseconds
expr min lq mean median uq max neval cld
ans.uwe 4243.754 4885.4765 5305.8688 5139.0580 5390.360 92604.820 1000 c
ans.zx8754 39683.911 41771.2925 43940.7646 43168.4870 45291.504 130965.088 1000 d
ans.jaap 2153.521 2488.0665 2788.8250 2640.1580 2773.150 91862.177 1000 b
ans.G.Grothendieck 266.268 304.0415 332.6255 331.8375 349.797 721.261 1000 a
L <- trimws(readLines(textConnection(Lines)))
ix <- !grepl(":", L)
L[ix] <- paste("\nName:", L[ix])
read.dcf(textConnection(L))
Name Email City/Town
[1,] "Name1" "email1@xyz.com" "Location1"
[2,] "Name2" NA "Location2"
[3,] "Name3" "email3@pqr.com" NA
Lines <- "Name1
Email: email1@xyz.com
City/Town: Location1
Name2
City/Town: Location2
Name3
Email: email3@pqr.com"