Regex 如何使用R提取包含特定人名的句子

Regex 如何使用R提取包含特定人名的句子,regex,r,tm,opennlp,Regex,R,Tm,Opennlp,我使用R从文本中提取包含特定人名的句子,这里是一个示例段落: 他曾与叔丁格恩改革者反对,由马丁·路德大叔Johann Reuchlin推荐给维滕贝格大学打电话。梅拉赫顿21岁时成为维滕堡的希腊语教授。他研究圣经,尤其是保罗的圣经和福音教义。他作为观众出席了莱比锡的辩论(1519年),但参与了他的评论。约翰·埃克抨击了他的观点,梅拉赫顿在反驳约翰·埃克基姆时基于圣经的权威作出了回答 在这一小段中,有几个人名,例如: 约翰·鲁克林,梅拉赫顿,约翰·埃克。借助于openNLP软件包,可以正确提取和识别

我使用R从文本中提取包含特定人名的句子,这里是一个示例段落:

他曾与叔丁格恩改革者反对,由马丁·路德大叔Johann Reuchlin推荐给维滕贝格大学打电话。梅拉赫顿21岁时成为维滕堡的希腊语教授。他研究圣经,尤其是保罗的圣经和福音教义。他作为观众出席了莱比锡的辩论(1519年),但参与了他的评论。约翰·埃克抨击了他的观点,梅拉赫顿在反驳约翰·埃克基姆时基于圣经的权威作出了回答

在这一小段中,有几个人名,例如: 约翰·鲁克林,梅拉赫顿,约翰·埃克。借助于openNLP软件包,可以正确提取和识别三个人名马丁·路德保罗梅拉赫顿。那么我有两个问题:

  • 如何提取包含这些名称的句子
  • 由于命名实体识别器的输出不太理想,如果我将“[[]]”添加到每个名称,如[[Johann Reuchlin]]、[[Melanchthon]],我如何提取包含这些名称表达式的句子 编辑5:回答您的其他问题: 鉴于:

    sentenceR<-"Opposed as a reformer at [[Tübingen]], he accepted a call to the University of [[Wittenberg]] by [[Martin Luther]], recommended by his great-uncle [[Johann Reuchlin]]"
    
    gsub("\\[\\[|\\]\\]", "", regmatches(sentenceR, gregexpr("\\[\\[.*?\\]\\]", sentenceR))[[1]])
    

    这里有一个相当简单的方法,使用两个包quantedastringi

    sents <- unlist(quanteda::tokenize(txt, what = "sentence"))
    namesToExtract <- c("Martin Luther", "Paul", "Melanchthon")
    namesFound <- unlist(stringi::stri_extract_all_regex(sents, paste(namesToExtract, collapse = "|")))
    sentList <- split(sents, list(namesFound))
    
    sentList[["Melanchthon"]]
    ## [1] "Melanchthon became professor of the Greek language in Wittenberg at the age of 21."                                                   
    ## [2] "Johann Eck having attacked his views, Melanchthon replied based on the authority of Scripture in his Defensio contra Johannem Eckium."
    
    sentList
    ## $`Martin Luther`
    ## [1] "Opposed as a reformer at Tübingen, he accepted a call to the University of Wittenberg by Martin Luther, recommended by his great-uncle Johann Reuchlin."
    ## 
    ## $Melanchthon
    ## [1] "Melanchthon became professor of the Greek language in Wittenberg at the age of 21."                                                   
    ## [2] "Johann Eck having attacked his views, Melanchthon replied based on the authority of Scripture in his Defensio contra Johannem Eckium."
    ## 
    ## $Paul
    ## [1] "He studied the Scripture, especially of Paul, and Evangelical doctrine."
    

    发出了很多thx,但我注意到第一句和第四句分别有两个人名。如果我将“Johann Eck”或“Johann Reuchlin”等名称添加到“toMatch”中并运行上面的代码,我仍然会得到四句话的输出。我的新问题是如何分别(重叠)得到每个人的句子?我不太明白。你要的是a)只包含所有个人姓名的句子,还是b)每个个人姓名的单独报税表(包含马丁·路德的句子,然后包含保罗的句子,等等)?@hui如果新代码回答了你的问题,请告诉我,它是有效的!!!对不起,这个模棱两可的问题。我的意思是后一个:每个名字都有一个单独的报税表(那些有马丁·路德的句子,然后是所有有保罗的句子,等等)。此外有没有办法在不同的句子中分别加上不同的人名,例如“[[2]]Paul[1]”他研究了圣经,尤其是保罗和福音教义“我对你的感激是无法用语言表达的:)最后一个问题与我在问题中的第二个问题相对应,我怎样才能提取包含“[[person A]]”、“[[person B]]”等内容的句子……许多thx。我以前没有使用过这两个软件包,但在这种情况下似乎非常方便:)
    
    toMatch <- c("Martin Luther", "Paul", "Melanchthon")
    sentences<-unlist(strsplit(para,split="\\."))
    foo<-function(Match){sentences[grep(Match,sentences)]}
    lapply(toMatch,foo)
    
    [[1]]
    [1] "Opposed as a reformer at Tübingen, he accepted a call to the University of Wittenberg by Martin Luther, recommended by his great-uncle Johann Reuchlin"
    
    [[2]]
    [1] " He studied the Scripture, especially of Paul, and Evangelical doctrine"
    
    [[3]]
    [1] " Melanchthon became professor of the Greek language in Wittenberg at the age of 21"                                                   
    [2] " Johann Eck having attacked his views, Melanchthon replied based on the authority of Scripture in his Defensio contra Johannem Eckium"
    
    foo<-function(Match){c(Match,sentences[grep(Match,sentences)])}
    
    toMatch <- c("Martin Luther", "Paul", "Melanchthon","(?=.*Melanchthon)(?=.*Scripture)")
    
    foo<-function(Match){c(Match,sentences[grep(Match,sentences,perl = T)])}
    
    
    > lapply(toMatch,foo)
    [[1]]
    [1] "Martin Luther"                                                                                                                                         
    [2] "Opposed as a reformer at Tübingen, he accepted a call to the University of Wittenberg by Martin Luther, recommended by his great-uncle Johann Reuchlin"
    
    [[2]]
    [1] "Paul"                                                                   
    [2] " He studied the Scripture, especially of Paul, and Evangelical doctrine"
    
    [[3]]
    [1] "Melanchthon"                                                                                                                          
    [2] " Melanchthon became professor of the Greek language in Wittenberg at the age of 21"                                                   
    [3] " Johann Eck having attacked his views, Melanchthon replied based on the authority of Scripture in his Defensio contra Johannem Eckium"
    
    [[4]]
    [1] "(?=.*Melanchthon)(?=.*Scripture)"                                                                                                     
    [2] " Johann Eck having attacked his views, Melanchthon replied based on the authority of Scripture in his Defensio contra Johannem Eckium"
    
    sentenceR<-"Opposed as a reformer at [[Tübingen]], he accepted a call to the University of [[Wittenberg]] by [[Martin Luther]], recommended by his great-uncle [[Johann Reuchlin]]"
    
    gsub("\\[\\[|\\]\\]", "", regmatches(sentenceR, gregexpr("\\[\\[.*?\\]\\]", sentenceR))[[1]])
    
    > gsub("\\[\\[|\\]\\]", "", regmatches(sentenceR, gregexpr("\\[\\[.*?\\]\\]", sentenceR))[[1]])
    [1] "Tübingen"        "Wittenberg"      "Martin Luther"   "Johann Reuchlin"
    
    sents <- unlist(quanteda::tokenize(txt, what = "sentence"))
    namesToExtract <- c("Martin Luther", "Paul", "Melanchthon")
    namesFound <- unlist(stringi::stri_extract_all_regex(sents, paste(namesToExtract, collapse = "|")))
    sentList <- split(sents, list(namesFound))
    
    sentList[["Melanchthon"]]
    ## [1] "Melanchthon became professor of the Greek language in Wittenberg at the age of 21."                                                   
    ## [2] "Johann Eck having attacked his views, Melanchthon replied based on the authority of Scripture in his Defensio contra Johannem Eckium."
    
    sentList
    ## $`Martin Luther`
    ## [1] "Opposed as a reformer at Tübingen, he accepted a call to the University of Wittenberg by Martin Luther, recommended by his great-uncle Johann Reuchlin."
    ## 
    ## $Melanchthon
    ## [1] "Melanchthon became professor of the Greek language in Wittenberg at the age of 21."                                                   
    ## [2] "Johann Eck having attacked his views, Melanchthon replied based on the authority of Scripture in his Defensio contra Johannem Eckium."
    ## 
    ## $Paul
    ## [1] "He studied the Scripture, especially of Paul, and Evangelical doctrine."