Regex 如何使用R提取包含特定人名的句子
我使用R从文本中提取包含特定人名的句子,这里是一个示例段落: 他曾与叔丁格恩改革者反对,由马丁·路德大叔Johann Reuchlin推荐给维滕贝格大学打电话。梅拉赫顿21岁时成为维滕堡的希腊语教授。他研究圣经,尤其是保罗的圣经和福音教义。他作为观众出席了莱比锡的辩论(1519年),但参与了他的评论。约翰·埃克抨击了他的观点,梅拉赫顿在反驳约翰·埃克基姆时基于圣经的权威作出了回答 在这一小段中,有几个人名,例如: 约翰·鲁克林,梅拉赫顿,约翰·埃克。借助于openNLP软件包,可以正确提取和识别三个人名马丁·路德、保罗和梅拉赫顿。那么我有两个问题:Regex 如何使用R提取包含特定人名的句子,regex,r,tm,opennlp,Regex,R,Tm,Opennlp,我使用R从文本中提取包含特定人名的句子,这里是一个示例段落: 他曾与叔丁格恩改革者反对,由马丁·路德大叔Johann Reuchlin推荐给维滕贝格大学打电话。梅拉赫顿21岁时成为维滕堡的希腊语教授。他研究圣经,尤其是保罗的圣经和福音教义。他作为观众出席了莱比锡的辩论(1519年),但参与了他的评论。约翰·埃克抨击了他的观点,梅拉赫顿在反驳约翰·埃克基姆时基于圣经的权威作出了回答 在这一小段中,有几个人名,例如: 约翰·鲁克林,梅拉赫顿,约翰·埃克。借助于openNLP软件包,可以正确提取和识别
sentenceR<-"Opposed as a reformer at [[Tübingen]], he accepted a call to the University of [[Wittenberg]] by [[Martin Luther]], recommended by his great-uncle [[Johann Reuchlin]]"
gsub("\\[\\[|\\]\\]", "", regmatches(sentenceR, gregexpr("\\[\\[.*?\\]\\]", sentenceR))[[1]])
这里有一个相当简单的方法,使用两个包quanteda和stringi:
sents <- unlist(quanteda::tokenize(txt, what = "sentence"))
namesToExtract <- c("Martin Luther", "Paul", "Melanchthon")
namesFound <- unlist(stringi::stri_extract_all_regex(sents, paste(namesToExtract, collapse = "|")))
sentList <- split(sents, list(namesFound))
sentList[["Melanchthon"]]
## [1] "Melanchthon became professor of the Greek language in Wittenberg at the age of 21."
## [2] "Johann Eck having attacked his views, Melanchthon replied based on the authority of Scripture in his Defensio contra Johannem Eckium."
sentList
## $`Martin Luther`
## [1] "Opposed as a reformer at Tübingen, he accepted a call to the University of Wittenberg by Martin Luther, recommended by his great-uncle Johann Reuchlin."
##
## $Melanchthon
## [1] "Melanchthon became professor of the Greek language in Wittenberg at the age of 21."
## [2] "Johann Eck having attacked his views, Melanchthon replied based on the authority of Scripture in his Defensio contra Johannem Eckium."
##
## $Paul
## [1] "He studied the Scripture, especially of Paul, and Evangelical doctrine."
发出了很多thx,但我注意到第一句和第四句分别有两个人名。如果我将“Johann Eck”或“Johann Reuchlin”等名称添加到“toMatch”中并运行上面的代码,我仍然会得到四句话的输出。我的新问题是如何分别(重叠)得到每个人的句子?我不太明白。你要的是a)只包含所有个人姓名的句子,还是b)每个个人姓名的单独报税表(包含马丁·路德的句子,然后包含保罗的句子,等等)?@hui如果新代码回答了你的问题,请告诉我,它是有效的!!!对不起,这个模棱两可的问题。我的意思是后一个:每个名字都有一个单独的报税表(那些有马丁·路德的句子,然后是所有有保罗的句子,等等)。此外有没有办法在不同的句子中分别加上不同的人名,例如“[[2]]Paul[1]”他研究了圣经,尤其是保罗和福音教义“我对你的感激是无法用语言表达的:)最后一个问题与我在问题中的第二个问题相对应,我怎样才能提取包含“[[person A]]”、“[[person B]]”等内容的句子……许多thx。我以前没有使用过这两个软件包,但在这种情况下似乎非常方便:)
toMatch <- c("Martin Luther", "Paul", "Melanchthon")
sentences<-unlist(strsplit(para,split="\\."))
foo<-function(Match){sentences[grep(Match,sentences)]}
lapply(toMatch,foo)
[[1]]
[1] "Opposed as a reformer at Tübingen, he accepted a call to the University of Wittenberg by Martin Luther, recommended by his great-uncle Johann Reuchlin"
[[2]]
[1] " He studied the Scripture, especially of Paul, and Evangelical doctrine"
[[3]]
[1] " Melanchthon became professor of the Greek language in Wittenberg at the age of 21"
[2] " Johann Eck having attacked his views, Melanchthon replied based on the authority of Scripture in his Defensio contra Johannem Eckium"
foo<-function(Match){c(Match,sentences[grep(Match,sentences)])}
toMatch <- c("Martin Luther", "Paul", "Melanchthon","(?=.*Melanchthon)(?=.*Scripture)")
foo<-function(Match){c(Match,sentences[grep(Match,sentences,perl = T)])}
> lapply(toMatch,foo)
[[1]]
[1] "Martin Luther"
[2] "Opposed as a reformer at Tübingen, he accepted a call to the University of Wittenberg by Martin Luther, recommended by his great-uncle Johann Reuchlin"
[[2]]
[1] "Paul"
[2] " He studied the Scripture, especially of Paul, and Evangelical doctrine"
[[3]]
[1] "Melanchthon"
[2] " Melanchthon became professor of the Greek language in Wittenberg at the age of 21"
[3] " Johann Eck having attacked his views, Melanchthon replied based on the authority of Scripture in his Defensio contra Johannem Eckium"
[[4]]
[1] "(?=.*Melanchthon)(?=.*Scripture)"
[2] " Johann Eck having attacked his views, Melanchthon replied based on the authority of Scripture in his Defensio contra Johannem Eckium"
sentenceR<-"Opposed as a reformer at [[Tübingen]], he accepted a call to the University of [[Wittenberg]] by [[Martin Luther]], recommended by his great-uncle [[Johann Reuchlin]]"
gsub("\\[\\[|\\]\\]", "", regmatches(sentenceR, gregexpr("\\[\\[.*?\\]\\]", sentenceR))[[1]])
> gsub("\\[\\[|\\]\\]", "", regmatches(sentenceR, gregexpr("\\[\\[.*?\\]\\]", sentenceR))[[1]])
[1] "Tübingen" "Wittenberg" "Martin Luther" "Johann Reuchlin"
sents <- unlist(quanteda::tokenize(txt, what = "sentence"))
namesToExtract <- c("Martin Luther", "Paul", "Melanchthon")
namesFound <- unlist(stringi::stri_extract_all_regex(sents, paste(namesToExtract, collapse = "|")))
sentList <- split(sents, list(namesFound))
sentList[["Melanchthon"]]
## [1] "Melanchthon became professor of the Greek language in Wittenberg at the age of 21."
## [2] "Johann Eck having attacked his views, Melanchthon replied based on the authority of Scripture in his Defensio contra Johannem Eckium."
sentList
## $`Martin Luther`
## [1] "Opposed as a reformer at Tübingen, he accepted a call to the University of Wittenberg by Martin Luther, recommended by his great-uncle Johann Reuchlin."
##
## $Melanchthon
## [1] "Melanchthon became professor of the Greek language in Wittenberg at the age of 21."
## [2] "Johann Eck having attacked his views, Melanchthon replied based on the authority of Scripture in his Defensio contra Johannem Eckium."
##
## $Paul
## [1] "He studied the Scripture, especially of Paul, and Evangelical doctrine."