R 提取有模式的句子

R 提取有模式的句子,r,regex,R,Regex,我有一个包含非结构化文本数据的数据集 从课文中,我想提取包含以下单词的句子: education_vector <- c("university", "academy", "school", "college") education\u vector这里有一个使用grep education <- c("university", "academy", "school", "college") str1 <- "I am a student at the University

我有一个包含非结构化文本数据的数据集

从课文中,我想提取包含以下单词的句子:

education_vector <- c("university", "academy", "school", "college")

education\u vector这里有一个使用
grep

education <- c("university", "academy", "school", "college")

str1 <- "I am a student at the University of Wyoming. My major is biology."
str2 <- "I love statistics and I enjoy working with numbers. I graduated from Walla Wall Community College"
str1 <- tolower(str1) # we use tolower because "university" != "University"
str2 <- tolower(str2)

grep(paste(education, collapse = "|"), unlist(strsplit(str1, "(?<=\\.)\\s+",
                                                       perl = TRUE)),
     value = TRUE)

grep(paste(education, collapse = "|"), unlist(strsplit(str2, "(?<=\\.)\\s+",
                                                       perl = TRUE)),
     value = TRUE)
教育修改答案以选择第一个匹配项

说明:
*?
是模式其余部分的非贪婪匹配。这是为了删除相关句子之前的任何句子

([^\\.]*(大学学院学院)[^\\.]*)
匹配任何字符串,而不是紧跟在某个关键字前后的句点

*
处理相关句子后面的任何内容


这将仅用相关部分替换整个字符串

请在你的问题中包括你调用grep的方式。grep(粘贴(教育向量,折叠=“|”)、unlist(strsplit(str1,“(.*?\ \ \ \…*”)、unlist(strsplit(str1),”)(?如何获得第一个匹配?我的示例只是演示。你的代码可以工作,但在现实中,我的文本很大,有时它不会返回第一个匹配
texts = c("I am a student at the University of Wyoming. My major is biology.",
"I love statistics and I enjoy working with numbers. I graduated from Walla Wall Community College",
"First, I went to the Bowdoin College. Then I went to the University of California.")

gsub(".*?([^\\.]*(university|academy|school|college)[^\\.]*).*", 
    "\\1", texts, ignore.case=TRUE)

[1] "I am a student at the University of Wyoming"   
[2] " I graduated from Walla Wall Community College"
[3] "First, I went to the Bowdoin College"