Warning: file_get_contents(/data/phpspider/zhask/data//catemap/4/regex/16.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Regex 将两个向量与文本匹配的字符串,由两个向量之间的距离限制_Regex_String_R_Match_String Matching - Fatal编程技术网

Regex 将两个向量与文本匹配的字符串,由两个向量之间的距离限制

Regex 将两个向量与文本匹配的字符串,由两个向量之间的距离限制,regex,string,r,match,string-matching,Regex,String,R,Match,String Matching,我试图找出最有效的方法,将两个字符串向量匹配到第三个字符串。我想将第二个匹配限制为第一个匹配之外的有限数量的单词或字符 假设我有这样一个名称框架: signers <- data.frame( first = c("Benjamin","Thomas","Robert","George","Thomas","Jared","James","John","James","George","George","James","Edmund","George") ,

我试图找出最有效的方法,将两个字符串向量匹配到第三个字符串。我想将第二个匹配限制为第一个匹配之外的有限数量的单词或字符

假设我有这样一个名称框架:

signers <- data.frame(
    first = 
        c("Benjamin","Thomas","Robert","George","Thomas","Jared","James","John","James","George","George","James","Edmund","George") ,
    last = 
        c( "Franklin","Mifflin","Morris","Clymer","Fitzsimons","Ingersoll","Wilson","Blair","Madison","Washington","Mason","McClurg","Randolph","Wythe")
    )
    text <- 
"A lot of people attended the Constitutional Convention in Philadephia, including Alexander Hamilton, Benjamin Franklin and John Adams.  
Not everyone who attended the convention ended up signing the Constitution, including George Wythe, John F. Mercer and Edmund Jennings Randolph who abstained."
      first       last      inparagraph
1  Benjamin   Franklin      1
2    Thomas    Mifflin
3    Robert     Morris
4    George     Clymer
5    Thomas Fitzsimons
6     Jared  Ingersoll
7     James     Wilson
8      John      Blair
9     James    Madison
10   George Washington
11   George      Mason
12    James    McClurg
13   Edmund   Randolph      1
14   George      Wythe      1
我不得不使用
lappy
函数查找名字的位置,但我不确定如何在名字的位置附近搜索

namesfinds <- lapply( signers$first ,  grep, text )

namesfinds这里有一个选项,允许使用正则表达式在名字和姓氏之间最多使用三个单词或首字母:

patterns <- paste0("(.*)(", signers$first, "(\\s+[[:alpha:].]+){,3}\\s+", signers$last, ")(.*)")
signers$inparagraph <- ifelse(sapply(patterns, grepl, text), "1", "")
注:John Blair匹配,因为我出于测试目的修改了
文本
,将他包括在内(见下面的数据)。如果希望允许更少的单词,可以将
{,3}
更改为较低的数字。现在,如果您想实际提取匹配的名称,可以执行以下操作:

unname(sapply(patterns, gsub, "\\2", text))[sapply(patterns, grepl, text)]
# [1] "Benjamin Franklin"        "John W. F. Blair"         "Edmund Jennings Randolph"
# [4] "George Wythe"     
以下是我使用的
文本

text <- 
  "A lot of people attended the Constitutional Convention in Philadephia, including Alexander Hamilton, Benjamin Franklin and John Adams.  
Not everyone who attended the convention ended up signing the Constitution, including George Wythe, John F. Mercer and Edmund Jennings Randolph who abstained and John W. F. Blair ate cake"

text可能不太好看,但这似乎有效。将正则表达式粘贴在一起以捕获中间名称是我使用的技巧。看起来它可以用任何名字。希望它能在您的所有数据中工作

> a <- paste(signers[,1], signers[,2])
> pst <- paste(signers$first, ".*", signers$last, sep = "")
> gg <- gsub("\\.\\*", " ", names(unlist(sapply(pst, grep, text))))
> signers$inparagraph <- ifelse(a %in% gg, "1", "")
> signers
##       first       last inparagraph
## 1  Benjamin   Franklin           1
## 2    Thomas    Mifflin           
## 3    Robert     Morris           
## 4    George     Clymer           
## 5    Thomas Fitzsimons           
## 6     Jared  Ingersoll           
## 7     James     Wilson           
## 8      John      Blair           
## 9     James    Madison           
## 10   George Washington           
## 11   George      Mason           
## 12    James    McClurg           
## 13   Edmund   Randolph           1
## 14   George      Wythe           1
>pst gg签名者$inparagraph签名者
##最后一位
##本杰明·富兰克林
##2托马斯·米夫林
##3罗伯特·莫里斯
##4乔治·克莱默
##5托马斯·菲茨西蒙斯
##6贾里德·英格索尔
##7詹姆斯·威尔逊
##8约翰·布莱尔
##9詹姆斯·麦迪逊
##10乔治华盛顿
##11乔治·梅森
##12詹姆斯·麦克卢格
##13埃德蒙·伦道夫1
##14乔治·怀斯1

我知道已经两年了,但我非常感谢这个答案@马修,没问题,我很感激你的赏识;)
> a <- paste(signers[,1], signers[,2])
> pst <- paste(signers$first, ".*", signers$last, sep = "")
> gg <- gsub("\\.\\*", " ", names(unlist(sapply(pst, grep, text))))
> signers$inparagraph <- ifelse(a %in% gg, "1", "")
> signers
##       first       last inparagraph
## 1  Benjamin   Franklin           1
## 2    Thomas    Mifflin           
## 3    Robert     Morris           
## 4    George     Clymer           
## 5    Thomas Fitzsimons           
## 6     Jared  Ingersoll           
## 7     James     Wilson           
## 8      John      Blair           
## 9     James    Madison           
## 10   George Washington           
## 11   George      Mason           
## 12    James    McClurg           
## 13   Edmund   Randolph           1
## 14   George      Wythe           1