提取数字和后续文本,并在R中创建多个新列

提取数字和后续文本,并在R中创建多个新列,r,text,dplyr,R,Text,Dplyr,我有很多关于特定问题的自由文本数据,我想按如下所示组织它 如果受访者按编号引用某个主题,我可以创建注释中提及该主题的列,但我希望有一种方法可以提取该编号后面的所有文本,直到遇到另一个编号 提前感谢您的帮助 librarytidyverse,warn.conflications=F 数据 df% unnest\u widercol=提及,姓名\u sep=_ 理想输出 df_理想一个选项,带strsplit,可在点\\.-.-元字符将被转义,并在数字或正则表达式查找之前,然后我们使用lappy循环

我有很多关于特定问题的自由文本数据,我想按如下所示组织它

如果受访者按编号引用某个主题,我可以创建注释中提及该主题的列,但我希望有一种方法可以提取该编号后面的所有文本,直到遇到另一个编号

提前感谢您的帮助

librarytidyverse,warn.conflications=F 数据 df% unnest\u widercol=提及,姓名\u sep=_ 理想输出 df_理想一个选项,带strsplit,可在点\\.-.-元字符将被转义,并在数字或正则表达式查找之前,然后我们使用lappy循环输出列表,使用sub从每个字符串的开始处删除所有非数字\\D+,rbind列表元素,并将“comment\uux”列指定给原始数据集“df”

df[paste0('comment_', 1:3)] <- do.call(rbind, lapply(strsplit(df$comment, 
      "(?<=\\.)\\s+(?=[0-9#])", perl = TRUE), function(x) sub("^\\D+", "", x)))
-输出

df
# A tibble: 2 x 7
  comment                                                      mention_1 mention_2 mention_3 comment_1          comment_2                   comment_3    
  <chr>                                                            <dbl>     <dbl>     <dbl> <chr>              <chr>                       <chr>        
1 topic 1: this is fine. 4 this is fine too. #9 not so good            1         4         9 1: this is fine.   4 this is fine too.         9 not so good
2 1 ok this is fine. 17 i do not like this idea. 25 great idea         1        17        25 1 ok this is fine. 17 i do not like this idea. 25 great idea
df
# A tibble: 2 x 9
  comment                                                             mention_1 mention_2 mention_3 mention_4 comment_1          comment_2                 comment_3    comment_4  
  <chr>                                                                   <dbl>     <dbl>     <dbl>     <dbl> <chr>              <chr>                     <chr>        <chr>      
1 topic 1: this is fine. 4 this is fine too. #9 not so good                   1         4         9        NA 1: this is fine.   4 this is fine too.       9 not so go… <NA>       
2 1 ok this is fine. 17 i do not like this idea. 25 great idea 43 co…         1        17        25        43 1 ok this is fine. 17 i do not like this id… 25 great id… 43 cool id…
> 
带strsplit的选项,用于在点\\.-.-后面的一个或多个空格\\s+处拆分元字符将被转义,并在数字或正则表达式查找之前,然后我们使用lappy循环输出列表,使用sub从每个字符串的开始处删除所有非数字\\D+,rbind列表元素,并将“comment\uux”列指定给原始数据集“df”

df[paste0('comment_', 1:3)] <- do.call(rbind, lapply(strsplit(df$comment, 
      "(?<=\\.)\\s+(?=[0-9#])", perl = TRUE), function(x) sub("^\\D+", "", x)))
-输出

df
# A tibble: 2 x 7
  comment                                                      mention_1 mention_2 mention_3 comment_1          comment_2                   comment_3    
  <chr>                                                            <dbl>     <dbl>     <dbl> <chr>              <chr>                       <chr>        
1 topic 1: this is fine. 4 this is fine too. #9 not so good            1         4         9 1: this is fine.   4 this is fine too.         9 not so good
2 1 ok this is fine. 17 i do not like this idea. 25 great idea         1        17        25 1 ok this is fine. 17 i do not like this idea. 25 great idea
df
# A tibble: 2 x 9
  comment                                                             mention_1 mention_2 mention_3 mention_4 comment_1          comment_2                 comment_3    comment_4  
  <chr>                                                                   <dbl>     <dbl>     <dbl>     <dbl> <chr>              <chr>                     <chr>        <chr>      
1 topic 1: this is fine. 4 this is fine too. #9 not so good                   1         4         9        NA 1: this is fine.   4 this is fine too.       9 not so go… <NA>       
2 1 ok this is fine. 17 i do not like this idea. 25 great idea 43 co…         1        17        25        43 1 ok this is fine. 17 i do not like this id… 25 great id… 43 cool id…
> 

基本R方法只是再次读取数据:

read.table(text = gsub("(\\d+)","&\\1",df$comment), sep = "&", fill = TRUE,
           comment.char = "", header = FALSE, strip.white = TRUE, na.strings = "")[,-1]
                  V2                          V3            V4           V5
1   1: this is fine.       4 this is fine too. # 9 not so good         <NA>
2 1 ok this is fine. 17 i do not like this idea. 25 great idea 43 cool idea

基本R方法只是再次读取数据:

read.table(text = gsub("(\\d+)","&\\1",df$comment), sep = "&", fill = TRUE,
           comment.char = "", header = FALSE, strip.white = TRUE, na.strings = "")[,-1]
                  V2                          V3            V4           V5
1   1: this is fine.       4 this is fine too. # 9 not so good         <NA>
2 1 ok this is fine. 17 i do not like this idea. 25 great idea 43 cool idea

你可以继续,而不需要你的摘录和不必要的方法

library(tidyverse)


df %>%
  mutate(mention = map(str_extract_all(comment, "[0-9]+"), as.numeric), 
         new_comment = str_extract_all(comment, '\\d+.*?(?=\\d|$)')) %>%
  unnest_wider(col = new_comment, names_sep = "_") %>%
  unnest_wider(col = mention, names_sep = "_")

#                                                                    comment
#1                 topic 1: this is fine. 4 this is fine too. #9 not so good
#2 1 ok this is fine. 17 i do not like this idea. 25 great idea 43 cool idea

#  mention_1 mention_2 mention_3 mention_4       new_comment_1
#1         1         4         9        NA   1: this is fine. 
#2         1        17        25        43 1 ok this is fine. 

#                 new_comment_2  new_comment_3 new_comment_4
#1        4 this is fine too. #  9 not so good          <NA>
#2 17 i do not like this idea.  25 great idea   43 cool idea

你可以继续,而不需要你的摘录和不必要的方法

library(tidyverse)


df %>%
  mutate(mention = map(str_extract_all(comment, "[0-9]+"), as.numeric), 
         new_comment = str_extract_all(comment, '\\d+.*?(?=\\d|$)')) %>%
  unnest_wider(col = new_comment, names_sep = "_") %>%
  unnest_wider(col = mention, names_sep = "_")

#                                                                    comment
#1                 topic 1: this is fine. 4 this is fine too. #9 not so good
#2 1 ok this is fine. 17 i do not like this idea. 25 great idea 43 cool idea

#  mention_1 mention_2 mention_3 mention_4       new_comment_1
#1         1         4         9        NA   1: this is fine. 
#2         1        17        25        43 1 ok this is fine. 

#                 new_comment_2  new_comment_3 new_comment_4
#1        4 this is fine too. #  9 not so good          <NA>
#2 17 i do not like this idea.  25 great idea   43 cool idea

谢谢你快速的回答,我一直很感激你快速而令人印象深刻的解决方案。这确实有效,但我的数据将有不同数量的响应。有些人可能只回答一个问题,其他人可能回答10个不同的问题。所以它对我的真实数据不起作用。我正在更新示例。@Matt请检查update@Matt在更新中,您是否也在空格处拆分,而不是在后面。@Matt i.e.25个好主意43个好主意!这是我所能想象的。我真的很感谢你的帮助。我可能很快会回来问另一个问题…谢谢你的快速回答,我一直很感谢你快速而令人印象深刻的解决方案。这确实有效,但我的数据将有不同数量的响应。有些人可能只回答一个问题,其他人可能回答10个不同的问题。所以它对我的真实数据不起作用。我正在更新示例。@Matt请检查update@Matt在更新中,您是否也在空格处拆分,而不是在后面。@Matt i.e.25个好主意43个好主意!这是我所能想象的。我真的很感谢你的帮助。我可能很快会回来问另一个问题。。。