Warning: file_get_contents(/data/phpspider/zhask/data//catemap/4/r/82.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
R 提取两个重复字符串之间的文本子字符串_R_Regex_Stringr - Fatal编程技术网

R 提取两个重复字符串之间的文本子字符串

R 提取两个重复字符串之间的文本子字符串,r,regex,stringr,R,Regex,Stringr,我使用readtext()创建了一个数据框。它有两列:doc_id、text。对于每一行(doc_id),我想在文本列中重复n次的两个字符串之间提取一个子字符串(在我的例子中是政府部门的名称)。例如: documents <- data.frame(doc_id = c("doc_1", "doc_2"), text = c("PART 1 Department of Communications \n Matters \n Blah bla

我使用readtext()创建了一个数据框。它有两列:doc_id、text。对于每一行(doc_id),我想在文本列中重复n次的两个字符串之间提取一个子字符串(在我的例子中是政府部门的名称)。例如:

documents <- data.frame(doc_id = c("doc_1", "doc_2"),
                        text = c("PART 1 Department of Communications \n Matters \n Blah blah blah \n PART 2 Department of Forestry \n Matters \n Blah blah blah", "PART 1 Department of Communications \n Matters \n Blah blah blah \n PART 3 Department of Health \n Matters \n Blah blah blah \n PART 5 Department of Sport \n Matters \n Blah blah"))

本质上,我想提取部分和事件之间的字符串。我想在数据帧上使用dplyr::rowwise操作,但不知道如何在两个重复的字符串之间提取多次。

我现在想不出一个
rowwise
解决方案,但也许这也有帮助

library(dplyr)
documents %>%
  mutate(text=strsplit(as.character(text), 'PART ')) %>%
  tidyr::unnest(text) %>%
  mutate(text=trimws(sub('\\d+ (.*) Matters.*', '\\1', text))) %>%
  filter(text != '') %>%
  group_by(doc_id) %>%
  summarise(text=paste(text, collapse=', '))

它基本上是在
部分拆分所有文本,然后我们可以分别处理每个元素,从较长的字符串中剪切出重要的文本。之后,我们根据
doc\u id

将所有内容连接在一起,我们可以使用
strengr
中的
str\u match\u all
并提取“PART”和“Matters”之间的单词。它返回两列矩阵的列表,我们从中选择第二列作为捕获组,并使用
toString
将它们放在一个逗号分隔的字符串中

out <- stringr::str_match_all(documents$text, "PART \\d+ (.*) \n Matters")
sapply(out, function(x) toString(x[, 2]))

#[1] "Department of Communications, Department of Forestry"                   
#[2] "Department of Communications, Department of Health, Department of Sport"
out
#导入Tidyverse
图书馆(tidyverse)
#使用helper变量名存储基于parttern提取的部门的结果
帮手
out <- stringr::str_match_all(documents$text, "PART \\d+ (.*) \n Matters")
sapply(out, function(x) toString(x[, 2]))

#[1] "Department of Communications, Department of Forestry"                   
#[2] "Department of Communications, Department of Health, Department of Sport"
#Import Tidyverse
library(tidyverse)

#Use helper variable name to store resuts of the extracted departments based on the parttern
Helper <- str_extract_all(string = documents$text, pattern = "Department.*\\n")

#Clean Up the columns.
Helper1 <- lapply(Helper, FUN = str_replace_all, pattern=" \\n", replacement = ", ")
documents$Departments<-str_replace(str_trim(unlist(lapply(Helper1, FUN =paste, collapse= ""))), pattern = ",$", replacement = "")

#Remove Previous column of texts
documents <- select(documents, -c("text"))