Regex 如何使用R提取包含特定人名的句子_Regex_R_Tm_Opennlp

Regex 如何使用R提取包含特定人名的句子

regex r

Regex 如何使用R提取包含特定人名的句子,regex,r,tm,opennlp,Regex,R,Tm,Opennlp,我使用R从文本中提取包含特定人名的句子，这里是一个示例段落：他曾与叔丁格恩改革者反对，由马丁·路德大叔Johann Reuchlin推荐给维滕贝格大学打电话。梅拉赫顿21岁时成为维滕堡的希腊语教授。他研究圣经，尤其是保罗的圣经和福音教义。他作为观众出席了莱比锡的辩论（1519年），但参与了他的评论。约翰·埃克抨击了他的观点，梅拉赫顿在反驳约翰·埃克基姆时基于圣经的权威作出了回答在这一小段中，有几个人名，例如：约翰·鲁克林，梅拉赫顿，约翰·埃克。借助于openNLP软件包，可以正确提取和识别

我使用R从文本中提取包含特定人名的句子，这里是一个示例段落：

他曾与叔丁格恩改革者反对，由马丁·路德大叔Johann Reuchlin推荐给维滕贝格大学打电话。梅拉赫顿21岁时成为维滕堡的希腊语教授。他研究圣经，尤其是保罗的圣经和福音教义。他作为观众出席了莱比锡的辩论（1519年），但参与了他的评论。约翰·埃克抨击了他的观点，梅拉赫顿在反驳约翰·埃克基姆时基于圣经的权威作出了回答

在这一小段中，有几个人名，例如：约翰·鲁克林，梅拉赫顿，约翰·埃克。借助于openNLP软件包，可以正确提取和识别三个人名马丁·路德、保罗和梅拉赫顿。那么我有两个问题:

如何提取包含这些名称的句子

由于命名实体识别器的输出不太理想，如果我将“[[]]”添加到每个名称，如[[Johann Reuchlin]]、[[Melanchthon]]，我如何提取包含这些名称表达式的句子编辑5：回答您的其他问题：鉴于：

sentenceR<-"Opposed as a reformer at [[Tübingen]], he accepted a call to the University of [[Wittenberg]] by [[Martin Luther]], recommended by his great-uncle [[Johann Reuchlin]]"

gsub("\\[\\[|\\]\\]", "", regmatches(sentenceR, gregexpr("\\[\\[.*?\\]\\]", sentenceR))[[1]])

这里有一个相当简单的方法，使用两个包quanteda和stringi：

sents <- unlist(quanteda::tokenize(txt, what = "sentence"))
namesToExtract <- c("Martin Luther", "Paul", "Melanchthon")
namesFound <- unlist(stringi::stri_extract_all_regex(sents, paste(namesToExtract, collapse = "|")))
sentList <- split(sents, list(namesFound))

sentList[["Melanchthon"]]
## [1] "Melanchthon became professor of the Greek language in Wittenberg at the age of 21."                                                   
## [2] "Johann Eck having attacked his views, Melanchthon replied based on the authority of Scripture in his Defensio contra Johannem Eckium."

sentList
## $`Martin Luther`
## [1] "Opposed as a reformer at Tübingen, he accepted a call to the University of Wittenberg by Martin Luther, recommended by his great-uncle Johann Reuchlin."
## 
## $Melanchthon
## [1] "Melanchthon became professor of the Greek language in Wittenberg at the age of 21."                                                   
## [2] "Johann Eck having attacked his views, Melanchthon replied based on the authority of Scripture in his Defensio contra Johannem Eckium."
## 
## $Paul
## [1] "He studied the Scripture, especially of Paul, and Evangelical doctrine."

发出了很多thx，但我注意到第一句和第四句分别有两个人名。如果我将“Johann Eck”或“Johann Reuchlin”等名称添加到“toMatch”中并运行上面的代码，我仍然会得到四句话的输出。我的新问题是如何分别（重叠）得到每个人的句子？我不太明白。你要的是a）只包含所有个人姓名的句子，还是b）每个个人姓名的单独报税表（包含马丁·路德的句子，然后包含保罗的句子，等等）？@hui如果新代码回答了你的问题，请告诉我，它是有效的！！！对不起，这个模棱两可的问题。我的意思是后一个：每个名字都有一个单独的报税表（那些有马丁·路德的句子，然后是所有有保罗的句子，等等）。此外有没有办法在不同的句子中分别加上不同的人名，例如“[[2]]Paul[1]”他研究了圣经，尤其是保罗和福音教义“我对你的感激是无法用语言表达的：）最后一个问题与我在问题中的第二个问题相对应，我怎样才能提取包含“[[person A]]”、“[[person B]]”等内容的句子……许多thx。我以前没有使用过这两个软件包，但在这种情况下似乎非常方便：）
toMatch <- c("Martin Luther", "Paul", "Melanchthon")
sentences<-unlist(strsplit(para,split="\\."))
foo<-function(Match){sentences[grep(Match,sentences)]}
lapply(toMatch,foo)

[[1]]
[1] "Opposed as a reformer at Tübingen, he accepted a call to the University of Wittenberg by Martin Luther, recommended by his great-uncle Johann Reuchlin"

[[2]]
[1] " He studied the Scripture, especially of Paul, and Evangelical doctrine"

[[3]]
[1] " Melanchthon became professor of the Greek language in Wittenberg at the age of 21"                                                   
[2] " Johann Eck having attacked his views, Melanchthon replied based on the authority of Scripture in his Defensio contra Johannem Eckium"

foo<-function(Match){c(Match,sentences[grep(Match,sentences)])}

toMatch <- c("Martin Luther", "Paul", "Melanchthon","(?=.*Melanchthon)(?=.*Scripture)")

foo<-function(Match){c(Match,sentences[grep(Match,sentences,perl = T)])}


> lapply(toMatch,foo)
[[1]]
[1] "Martin Luther"                                                                                                                                         
[2] "Opposed as a reformer at Tübingen, he accepted a call to the University of Wittenberg by Martin Luther, recommended by his great-uncle Johann Reuchlin"

[[2]]
[1] "Paul"                                                                   
[2] " He studied the Scripture, especially of Paul, and Evangelical doctrine"

[[3]]
[1] "Melanchthon"                                                                                                                          
[2] " Melanchthon became professor of the Greek language in Wittenberg at the age of 21"                                                   
[3] " Johann Eck having attacked his views, Melanchthon replied based on the authority of Scripture in his Defensio contra Johannem Eckium"

[[4]]
[1] "(?=.*Melanchthon)(?=.*Scripture)"                                                                                                     
[2] " Johann Eck having attacked his views, Melanchthon replied based on the authority of Scripture in his Defensio contra Johannem Eckium"

sentenceR<-"Opposed as a reformer at [[Tübingen]], he accepted a call to the University of [[Wittenberg]] by [[Martin Luther]], recommended by his great-uncle [[Johann Reuchlin]]"

gsub("\\[\\[|\\]\\]", "", regmatches(sentenceR, gregexpr("\\[\\[.*?\\]\\]", sentenceR))[[1]])

> gsub("\\[\\[|\\]\\]", "", regmatches(sentenceR, gregexpr("\\[\\[.*?\\]\\]", sentenceR))[[1]])
[1] "Tübingen"        "Wittenberg"      "Martin Luther"   "Johann Reuchlin"

sents <- unlist(quanteda::tokenize(txt, what = "sentence"))
namesToExtract <- c("Martin Luther", "Paul", "Melanchthon")
namesFound <- unlist(stringi::stri_extract_all_regex(sents, paste(namesToExtract, collapse = "|")))
sentList <- split(sents, list(namesFound))

sentList[["Melanchthon"]]
## [1] "Melanchthon became professor of the Greek language in Wittenberg at the age of 21."                                                   
## [2] "Johann Eck having attacked his views, Melanchthon replied based on the authority of Scripture in his Defensio contra Johannem Eckium."

sentList
## $`Martin Luther`
## [1] "Opposed as a reformer at Tübingen, he accepted a call to the University of Wittenberg by Martin Luther, recommended by his great-uncle Johann Reuchlin."
## 
## $Melanchthon
## [1] "Melanchthon became professor of the Greek language in Wittenberg at the age of 21."                                                   
## [2] "Johann Eck having attacked his views, Melanchthon replied based on the authority of Scripture in his Defensio contra Johannem Eckium."
## 
## $Paul
## [1] "He studied the Scripture, especially of Paul, and Evangelical doctrine."