R 提取有模式的句子_R_Regex - Fatal编程技术网

R 提取有模式的句子

r regex

R 提取有模式的句子,r,regex,R,Regex,我有一个包含非结构化文本数据的数据集从课文中，我想提取包含以下单词的句子： education_vector <- c("university", "academy", "school", "college") education\u vector这里有一个使用grep education <- c("university", "academy", "school", "college") str1 <- "I am a student at the University

我有一个包含非结构化文本数据的数据集

从课文中，我想提取包含以下单词的句子：

education_vector <- c("university", "academy", "school", "college")

education\u vector这里有一个使用grep

education <- c("university", "academy", "school", "college")

str1 <- "I am a student at the University of Wyoming. My major is biology."
str2 <- "I love statistics and I enjoy working with numbers. I graduated from Walla Wall Community College"
str1 <- tolower(str1) # we use tolower because "university" != "University"
str2 <- tolower(str2)

grep(paste(education, collapse = "|"), unlist(strsplit(str1, "(?<=\\.)\\s+",
                                                       perl = TRUE)),
     value = TRUE)

grep(paste(education, collapse = "|"), unlist(strsplit(str2, "(?<=\\.)\\s+",
                                                       perl = TRUE)),
     value = TRUE)

教育修改答案以选择第一个匹配项
说明：
*？
是模式其余部分的非贪婪匹配。这是为了删除相关句子之前的任何句子
（[^\\.]*（大学学院学院）[^\\.]*）
匹配任何字符串，而不是紧跟在某个关键字前后的句点
*
处理相关句子后面的任何内容
这将仅用相关部分替换整个字符串 请在你的问题中包括你调用grep的方式。grep（粘贴（教育向量，折叠=“|”）、unlist（strsplit（str1，“（.*？\ \ \ \…*”）、unlist（strsplit（str1），”）（？如何获得第一个匹配？我的示例只是演示。你的代码可以工作，但在现实中，我的文本很大，有时它不会返回第一个匹配
texts = c("I am a student at the University of Wyoming. My major is biology.",
"I love statistics and I enjoy working with numbers. I graduated from Walla Wall Community College",
"First, I went to the Bowdoin College. Then I went to the University of California.")

gsub(".*?([^\\.]*(university|academy|school|college)[^\\.]*).*", 
    "\\1", texts, ignore.case=TRUE)

[1] "I am a student at the University of Wyoming"   
[2] " I graduated from Walla Wall Community College"
[3] "First, I went to the Bowdoin College"