R 如何在句点分隔的句子中找到任意顺序的两个单词

R 如何在句点分隔的句子中找到任意顺序的两个单词,r,R,我试图提取任何定义为两个句点之间的句子,这两个句点中有两个单词column和Barr,它们的顺序是任意的。这是一个棘手的问题,因为目前我创建了一个正则表达式,它只在句点之前按任何顺序查找这两个单词,但是如果这两个单词出现在两个句子中,那么这两个句子之间的所有文本都会被选中。如何使regex语句具体化 输入 企图 str\u extract\u alltry,\..*column.Barr.?\..*Barr.column.?\ 电流输出 这个正则表达式似乎可以满足您的需要: (\\.[^.]*c

我试图提取任何定义为两个句点之间的句子,这两个句点中有两个单词column和Barr,它们的顺序是任意的。这是一个棘手的问题,因为目前我创建了一个正则表达式,它只在句点之前按任何顺序查找这两个单词,但是如果这两个单词出现在两个句子中,那么这两个句子之间的所有文本都会被选中。如何使regex语句具体化

输入

企图

str\u extract\u alltry,\..*column.Barr.?\..*Barr.column.?\

电流输出


这个正则表达式似乎可以满足您的需要:

(\\.[^.]*column[^.]*Barr[^.]*)|(\\.[^.]*Barr[^.]*column[^.]*)
它将从一个点开始。抓住任何不是重点但也有纵队和巴尔的东西。或者两个词的顺序都相同

例如:

结果:

[1] NA                                                    
[2] ". I am a sentence and I contain column but also Barr"
[3] ".I am a sentence and I contain column but also Barr" 
[4] ".I contain column and Barr"                          
[5] ". I contain Barr and column but also Barr"
如果您使用str_extract_,请记住它会返回一个匹配列表

[[1]]
character(0)

[[2]]
[1] ". I am a sentence and I contain column but also Barr"

[[3]]
[1] ".I am a sentence and I contain column but also Barr"

[[4]]
[1] ".I contain column and Barr" ". I have Barr and column"  

[[5]]
[1] ". I contain Barr and column but also Barr"

我添加了一个paste0.,x,以便检测同时包含两个单词的句子,并且它们不是以句点开头的。

这个正则表达式似乎可以满足您的需要:

(\\.[^.]*column[^.]*Barr[^.]*)|(\\.[^.]*Barr[^.]*column[^.]*)
它将从一个点开始。抓住任何不是重点但也有纵队和巴尔的东西。或者两个词的顺序都相同

例如:

结果:

[1] NA                                                    
[2] ". I am a sentence and I contain column but also Barr"
[3] ".I am a sentence and I contain column but also Barr" 
[4] ".I contain column and Barr"                          
[5] ". I contain Barr and column but also Barr"
如果您使用str_extract_,请记住它会返回一个匹配列表

[[1]]
character(0)

[[2]]
[1] ". I am a sentence and I contain column but also Barr"

[[3]]
[1] ".I am a sentence and I contain column but also Barr"

[[4]]
[1] ".I contain column and Barr" ". I have Barr and column"  

[[5]]
[1] ". I contain Barr and column but also Barr"

我添加了一个0.,x,以检测同时包含两个单词的句子,并且它们不是以句号开头的。

这里有一个更一般的尝试,它不需要创建所需单词的每个排列,当需要两个以上的作品时,它会很有帮助

策略是找到每个单词的句子,然后找到结果的交叉点

#split the long text into individual sentences.
sentences<-strsplit(try, "\\.")

#create list of matches for each desired word
columnlist<-lapply(sentences, function(x) {grep("(column)", x)})
barrlist<-lapply(sentences, function(x) {grep("(Barr)", x)})

#find intersection between lists
intersection<-lapply(seq_along(columnlist), function(i){intersect(columnlist[[i]], barrlist[[i]])} )

#extract the sentences out
answer<-sapply(seq_along(intersection), function(i) { 
  if(length(intersection[[i]])) 
    {trimws(sentences[[i]][intersection[[i]] ])}  
  else {NA}
})

这里有一个更普遍的尝试,它不需要创建所需单词的每一个排列,当需要两个以上的作品时,它会很有用

策略是找到每个单词的句子,然后找到结果的交叉点

#split the long text into individual sentences.
sentences<-strsplit(try, "\\.")

#create list of matches for each desired word
columnlist<-lapply(sentences, function(x) {grep("(column)", x)})
barrlist<-lapply(sentences, function(x) {grep("(Barr)", x)})

#find intersection between lists
intersection<-lapply(seq_along(columnlist), function(i){intersect(columnlist[[i]], barrlist[[i]])} )

#extract the sentences out
answer<-sapply(seq_along(intersection), function(i) { 
  if(length(intersection[[i]])) 
    {trimws(sentences[[i]][intersection[[i]] ])}  
  else {NA}
})

要查找以任意顺序出现的两个单词,可以使用两个肯定的lookahead: 例如,grepl?=.*Barr?=.*column,x,perl=T将在每次出现这两个单词时返回TRUE,而不管它们的顺序如何,否则返回FALSE,但这并不考虑句子结构。 当您想要提取文本,并且想要在点之间找到两个单词时,我们可以将其更改为:

library(stringr)
## Example data
x <- c("I am a sentence.I am a sentence and I contain Barr. I contain other things. I contain column as well.","Here we go. I am a sentence and I contain column but also Barr. I only contain Barr. I am too.","Barr and column and also column. But just Barr. And just column. Now again column and Barr")
> x
[1] "I am a sentence.I am a sentence and I contain Barr. I contain other things. I contain column as well."
[2] "Here we go. I am a sentence and I contain column but also Barr. I only contain Barr. I am too."       
[3] "Barr and column and also column. But just Barr. And just column. Now again column and Barr"           

str_extract_all(x,"(\\.|^)(?=[^\\.]*Barr)(?=[^\\.]*column)[^\\.]*(\\.|$)")

要查找以任意顺序出现的两个单词,可以使用两个肯定的lookahead: 例如,grepl?=.*Barr?=.*column,x,perl=T将在每次出现这两个单词时返回TRUE,而不管它们的顺序如何,否则返回FALSE,但这并不考虑句子结构。 当您想要提取文本,并且想要在点之间找到两个单词时,我们可以将其更改为:

library(stringr)
## Example data
x <- c("I am a sentence.I am a sentence and I contain Barr. I contain other things. I contain column as well.","Here we go. I am a sentence and I contain column but also Barr. I only contain Barr. I am too.","Barr and column and also column. But just Barr. And just column. Now again column and Barr")
> x
[1] "I am a sentence.I am a sentence and I contain Barr. I contain other things. I contain column as well."
[2] "Here we go. I am a sentence and I contain column but also Barr. I only contain Barr. I am too."       
[3] "Barr and column and also column. But just Barr. And just column. Now again column and Barr"           

str_extract_all(x,"(\\.|^)(?=[^\\.]*Barr)(?=[^\\.]*column)[^\\.]*(\\.|$)")
[[1]]
character(0)

[[2]]
[1] ". I am a sentence and I contain column but also Barr."

[[3]]
[1] "Barr and column and also column." ". Now again column and Barr"