R 将一列多行数据转换为多列多行数据_R_Reshape

R 将一列多行数据转换为多列多行数据

R 将一列多行数据转换为多列多行数据,r,reshape,R,Reshape,我在R中有一个web抓取数据的输出，如下所示 Name1 Email: email1@xyz.com City/Town: Location1 Name2 Email: email2@abc.com City/Town: Location2 Name3 Email: email3@pqr.com City/Town: Location3 某些名称可能没有电子邮件或位置。我想把上面的数据转换成表格格式。输出应该如下所示 Name Email City/Town Na

我在R中有一个web抓取数据的输出，如下所示

Name1
Email: email1@xyz.com
City/Town: Location1
Name2
Email: email2@abc.com
City/Town: Location2
Name3
Email: email3@pqr.com
City/Town: Location3

某些名称可能没有电子邮件或位置。我想把上面的数据转换成表格格式。输出应该如下所示

Name      Email           City/Town
Name1   email1@xyz.com  Location1
Name2   email2@abc.com  Location2
Name3   email3@pqr.com  Location3
Name4                   Location4
Name5   email5@abc.com

使用：

这也适用于真实姓名。使用@uweBlock的数据，您将获得：

并且每个部分都有多个键，并且有@UweBlock的数据：

使用数据：

txt <- textConnection("Name1
Email: email1@xyz.com
City/Town: Location1
Name2
Email: email2@abc.com
City/Town: Location2
Name3
Email: email3@pqr.com
City/Town: Location3
Name4
City/Town: Location4
Name5
Email: email5@abc.com")

输入数据带来了几个挑战：

数据以直线字符向量形式给出，而不是以具有预定义列的Data.frame形式给出。行部分由键/值对组成，键/值对之间由以下部分分隔：其他行用作节标题。在到达下一个标题之前，以下行中的所有键/值对都属于一个节。以下代码仅依赖于两个假设：

键/值对包含且仅包含一个：节标题完全没有。一个节中的多个键（例如，具有电子邮件地址的多行）通过将toString指定为dcast的聚合函数来处理

数据或者，每个部分有多个键

txt2 <- c("John Doe", "Email: email1@xyz.com", "Email: email1@abc.com", "City/Town: Location1", "Save the World Fund", 
"Email: email2@abc.com", "City/Town: Location2", "Best Shoes Ltd.", "Email: email3@pqr.com", 
"City/Town: Location3", "Mother", "City/Town: Location4", "City/Town: everywhere", "Jane", 
"Email: email5@abc.com")

使用dplyr和tidyr，在@Jaap txt和@UweBlock txt1提供的数据上进行测试：

注:

看看我们为什么需要注册护士。希望有人能建议只使用tidyverse的更好/更简单的代码。基准：代码：插入\n名称：在每个名称之前，然后使用read.dcf（如果数据来自文件）将其读入，在代码的第一行中用文件名（例如myfile.dat）替换textConnectionLines。没有使用任何软件包

L <- trimws(readLines(textConnection(Lines)))
ix <- !grepl(":", L)
L[ix] <- paste("\nName:", L[ix])
read.dcf(textConnection(L))

注：使用输入。这是从问题中稍微修改的，以表明它在电子邮件或城市/城镇丢失时有效：

Lines <- "Name1
Email: email1@xyz.com
City/Town: Location1
Name2
City/Town: Location2
Name3
Email: email3@pqr.com"

源数据如下所示。姓名1电子邮件：email1@abc.com城市/城镇：地点1姓名2电子邮件：email2@xyz.com城市/城镇：地点2姓名3电子邮件：email3@pqr.com城市/城镇：位置3是否可以提供？更简单的：数据\u frametext=txt%>%sep\n'>%separatetext，c'var'，val'，sep='：'，fill='左'>%mutateentry=cumsumis.navar，var=coalescevar，'名称'>%spreadvar，val%>%select4:2，其中txt是字符向量或路径。或者使用data\u frametext=read\u linestxt而不是单独的行。@alistaire无法使用txt2，重复行错误。也许加上rn？另外，也许可以添加一个新答案，或者我可以添加到我的答案中？您可以使用toString进行分组和总结，例如data\u frametext=txt2%>%separate\n'>%separatetext，c'var'，val'，sep='：'，fill='左'%>%mutateentry=cumsumis.navar，var=CoalescVar，'名称'>%group\u byentry，var%>%summarseval=toStringval%>%spreadvar，val%>%ungroup%>%select4:2或向键添加索引，但我并不喜欢这两个选项。如果你愿意的话，继续添加它。所有其他答案都是解决Y问题，这个答案是解决X！回答得好！虽然base RA中提供了read.dcf函数，但我想它不是一个非常有名的函数，它很好地展示了学习base R的好处。如果参数all=TRUE与read.dcf一起使用，则解决方案还能够处理重复条目，例如多个电子邮件地址，如txt2示例数据集中的条目。

txt <- textConnection("Name1
Email: email1@xyz.com
City/Town: Location1
Name2
Email: email2@abc.com
City/Town: Location2
Name3
Email: email3@pqr.com
City/Town: Location3
Name4
City/Town: Location4
Name5
Email: email5@abc.com")

library(data.table)
# coerce to data.table
data.table(text = txt)[
  # split key/value pairs in columns
  , tstrsplit(text, ": ")][
    # pick section headers and create new column 
    is.na(V2), Name := V1][
      # fill in Name into the rows below
      , Name := zoo::na.locf(Name)][
        # reshape key/value pairs from long to wide format using Name as row id
        !is.na(V2), dcast(.SD, Name ~ V1, fun = toString, value.var = "V2")]

    Name City/Town          Email
1: Name1 Location1 email1@xyz.com
2: Name2 Location2 email2@abc.com
3: Name3 Location3 email3@pqr.com
4: Name4 Location4             NA
5: Name5        NA email5@abc.com

txt <- c("Name1", "Email: email1@xyz.com", "City/Town: Location1", "Name2", 
"Email: email2@abc.com", "City/Town: Location2", "Name3", "Email: email3@pqr.com", 
"City/Town: Location3", "Name4", "City/Town: Location4", "Name5", 
"Email: email5@abc.com")

txt1 <- c("John Doe", "Email: email1@xyz.com", "City/Town: Location1", "Save the World Fund", 
"Email: email2@abc.com", "City/Town: Location2", "Best Shoes Ltd.", "Email: email3@pqr.com", 
"City/Town: Location3", "Mother", "City/Town: Location4", "Jane", 
"Email: email5@abc.com")

                  Name City/Town          Email
1:     Best Shoes Ltd. Location3 email3@pqr.com
2:                Jane        NA email5@abc.com
3:            John Doe Location1 email1@xyz.com
4:              Mother Location4             NA
5: Save the World Fund Location2 email2@abc.com

txt2 <- c("John Doe", "Email: email1@xyz.com", "Email: email1@abc.com", "City/Town: Location1", "Save the World Fund", 
"Email: email2@abc.com", "City/Town: Location2", "Best Shoes Ltd.", "Email: email3@pqr.com", 
"City/Town: Location3", "Mother", "City/Town: Location4", "City/Town: everywhere", "Jane", 
"Email: email5@abc.com")

                  Name             City/Town                          Email
1:     Best Shoes Ltd.             Location3                 email3@pqr.com
2:                Jane                                       email5@abc.com
3:            John Doe             Location1 email1@xyz.com, email1@abc.com
4:              Mother Location4, everywhere                               
5: Save the World Fund             Location2                 email2@abc.com

library(dplyr)
library(tidyr)

# data_frame(txt = txt1) %>%     
data_frame(txt = txt) %>% 
  mutate(txt = if_else(grepl(":", txt), txt, paste("Name:", txt)),
         rn = row_number()) %>% 
  separate(txt, into = c("mytype", "mytext"), sep = ":") %>% 
  spread(key = mytype, value = mytext) %>% 
  select(-rn) %>% 
  fill(Name) %>% 
  group_by(Name) %>% 
  fill(1:2, .direction = "down") %>% 
  fill(1:2, .direction = "up") %>% 
  unique() %>% 
  ungroup() %>% 
  select(3:1)

# # A tibble: 5 x 3
#     Name           Email `City/Town`
#    <chr>           <chr>       <chr>
# 1  Name1  email1@xyz.com   Location1
# 2  Name2  email2@abc.com   Location2
# 3  Name3  email3@pqr.com   Location3
# 4  Name4            <NA>   Location4
# 5  Name5  email5@abc.com        <NA>

txt2 <- c("John Doe", "Email: email1@xyz.com", "Email: email1@abc.com", "City/Town: Location1", "Save the World Fund", 
          "Email: email2@abc.com", "City/Town: Location2", "Best Shoes Ltd.", "Email: email3@pqr.com", 
          "City/Town: Location3", "Mother", "City/Town: Location4", "City/Town: everywhere", "Jane", 
          "Email: email5@abc.com")

library(microbenchmark)
library(data.table)
library(dplyr)
library(tidyr)

microbenchmark(ans.uwe = data.table(text = txt2)[, tstrsplit(text, ": ")
                                                 ][is.na(V2), Name := V1
                                                   ][, Name := zoo::na.locf(Name)
                                                     ][!is.na(V2), dcast(.SD, Name ~ V1, fun = toString, value.var = "V2")],
               ans.zx8754 = data_frame(txt = txt2) %>% 
                 mutate(txt = ifelse(grepl(":", txt), txt, paste("Name:", txt)),
                        rn = row_number()) %>% 
                 separate(txt, into = c("mytype", "mytext"), sep = ":") %>% 
                 spread(key = mytype, value = mytext) %>% 
                 select(-rn) %>% 
                 fill(Name) %>% 
                 group_by(Name) %>% 
                 fill(1:2, .direction = "down") %>% 
                 fill(1:2, .direction = "up") %>% 
                 unique() %>% 
                 ungroup() %>% 
                 select(3:1),
               ans.jaap = data.table(txt = txt2)[!grepl(':', txt), name := txt
                                                 ][, name := zoo::na.locf(name)
                                                   ][grepl('^Email:', txt), email := sub('Email: ','',txt)
                                                     ][grepl('^City/Town:', txt), city_town := sub('City/Town: ','',txt)
                                                       ][txt != name, lapply(.SD, function(x) toString(na.omit(x))), by = name, .SDcols = c('email','city_town')],
               ans.G.Grothendieck = {
                 L <- trimws(readLines(textConnection(txt2)))
                 ix <- !grepl(":", L)
                 L[ix] <- paste("\nName:", L[ix])
                 read.dcf(textConnection(L))},
               times = 1000)

Unit: microseconds
               expr       min         lq       mean     median        uq        max neval  cld
            ans.uwe  4243.754  4885.4765  5305.8688  5139.0580  5390.360  92604.820  1000   c 
         ans.zx8754 39683.911 41771.2925 43940.7646 43168.4870 45291.504 130965.088  1000    d
           ans.jaap  2153.521  2488.0665  2788.8250  2640.1580  2773.150  91862.177  1000  b  
 ans.G.Grothendieck   266.268   304.0415   332.6255   331.8375   349.797    721.261  1000 a

L <- trimws(readLines(textConnection(Lines)))
ix <- !grepl(":", L)
L[ix] <- paste("\nName:", L[ix])
read.dcf(textConnection(L))

     Name    Email            City/Town  
[1,] "Name1" "email1@xyz.com" "Location1"
[2,] "Name2" NA               "Location2"
[3,] "Name3" "email3@pqr.com" NA

Lines <- "Name1
Email: email1@xyz.com
City/Town: Location1
Name2
City/Town: Location2
Name3
Email: email3@pqr.com"