Regex 在R中使用stringr提取特定单词周围的单词样本_Regex_R_Stringr

Regex 在R中使用stringr提取特定单词周围的单词样本

regex r

Regex 在R中使用stringr提取特定单词周围的单词样本,regex,r,stringr,Regex,R,Stringr,关于这个话题，我在SO上看到了一些类似的问题，但它们似乎用了不恰当的措词（）或不同的语言（）在我的场景中，我认为所有被白色空间包围的词都是一个词。表情符号，数字，不是真正的文字的字母串，我不在乎。我只想获取找到的字符串的上下文，而不必读取整个文件来确定它是否是有效匹配我尝试使用以下内容，但如果您有一个长文本文件，则需要一段时间才能运行： text <- "He served both as Attorney General and Lord Chancellor of England.

关于这个话题，我在SO上看到了一些类似的问题，但它们似乎用了不恰当的措词（）或不同的语言（）

在我的场景中，我认为所有被白色空间包围的词都是一个词。表情符号，数字，不是真正的文字的字母串，我不在乎。我只想获取找到的字符串的上下文，而不必读取整个文件来确定它是否是有效匹配

我尝试使用以下内容，但如果您有一个长文本文件，则需要一段时间才能运行：

text <- "He served both as Attorney General and Lord Chancellor of England. After his death, he remained extremely influential through his works, especially as philosophical advocate and practitioner of the scientific method during the scientific revolution. Bacon has been called the father of empiricism.[6] His works argued for the possibility of scientific knowledge based only upon inductive and careful observation of events in nature. Most importantly, he argued this could be achieved by use of a skeptical and methodical approach whereby scientists aim to avoid misleading themselves. While his own practical ideas about such a method, the Baconian method, did not have a long lasting influence, the general idea of the importance and possibility of a skeptical methodology makes Bacon the father of scientific method. This marked a new turn in the rhetorical and theoretical framework for science, the practical details of which are still central in debates about science and methodology today. Bacon was knighted in 1603 and created Baron Verulam in 1618[4] and Viscount St. Alban in 1621;[3][b] as he died without heirs, both titles became extinct upon his death. Bacon died of pneumonia in 1626, with one account by John Aubrey stating he contracted the condition while studying the effects of freezing on the preservation of meat."

stringr::str_extract(text, "(.*?\\s){1,10}Verulam(\\s.*?){1,10}")

text我会使用unlist（strsplit）
然后对结果向量进行索引。您可以将其设置为一个函数，以便获取pre和post的字数是一个灵活的参数：
getContext <- function(text, look_for, pre = 3, post=pre) {
  # create vector of words (anything separated by a space)
  t_vec <- unlist(strsplit(text, '\\s'))

  # find position of matches
  matches <- which(t_vec==look_for)

  # return words before & after if any matches
  if(length(matches) > 0) {
    out <- 
      list(before = ifelse(m-pre < 1, NA, 
                           sapply(matches, function(m) t_vec[(m - pre):(m - 1)])), 
           after = sapply(matches, function(m) t_vec[(m + 1):(m + post)]))

    return(out)
  } else {
    warning('No matches')
  }
}

如果存在多个匹配项，也适用
getContext(text, 'he')

# $before
#      [,1]     [,2]           [,3]          [,4]     
# [1,] "After"  "nature."      "in"          "John"   
# [2,] "his"    "Most"         "1621;[3][b]" "Aubrey" 
# [3,] "death," "importantly," "as"          "stating"
# 
# $after
#      [,1]          [,2]     [,3]      [,4]        
# [1,] "remained"    "argued" "died"    "contracted"
# [2,] "extremely"   "this"   "without" "the"       
# [3,] "influential" "could"  "heirs,"  "condition" 

getContext(text, 'fruitloops')
# Warning message:
#   In getContext(text, "fruitloops") : No matches

如果您不介意将数据复制三倍，可以制作data.frame，这通常是在R中使用的最佳选项
context <- function(text){
  splittedText <- strsplit(text, ' ', T)[[1]]
  print(splittedText)

  data.frame(
    words  = splittedText,
    before = head(c('', splittedText), -1), 
    after  = tail(c(splittedText, ''), -1)
  )
}

context试试这个：
stringr::str_extract(text, "([^\\s]+\\s){3}Verulam(\\s[^\\s]+){3}")
# alternately, if you like " " more than \\s:
# stringr::str_extract(text, "(?:[^ ]+ ){3}Verulam(?: [^ ]+){3}")

#[1] "and created Baron Verulam in 1618[4] and"

更改{}
中的数字以满足您的需要
您也可以使用非捕获（？：）
组，尽管我还不确定这是否会提高速度
stringr::str_extract(text, "(?:[^\\s]+\\s){3}Verulam(?:\\s[^\\s]+){3}")

很好的解决方案，但需要处理负面索引，否则getContext（text，“He”）
将不起作用。是的，我也非常喜欢这个解决方案，但下面的一行加上一些编辑，更适合这种情况。@fishtank-很好的观点，已编辑。还考虑过使用pmin（0，m-pre）
，但这样一来，“出界”结果对于“之前”和“之后”项目（即两个NA）都是相同的。我非常喜欢单线方法。它很干净，正则表达式也不难理解。我在我的用例中做了一点小小的修改，以允许匹配可能在论题内部或在句子末尾，此外，对于单词接近文本结尾的场景，允许前后有不同数量的单词。它还匹配单词的所有实例，而不是第一个实例<代码>stringr:：str_extract_all（text），（[^\\s]+\\s）{1,5}Verulam（\\s[^\\s]+）{1,5}”）
应该读stringr:：str_extract_all（text），（[^\\s]+\\s）{1,5}Verulam.？（\\s[^\\s]+{1,5}”）
，而不是它所说的内容。我刚刚意识到这一点，现在无法编辑评论。添加的？
允许在单词后加句点、逗号或括号。@brittenb如果您想要文本开头或结尾的单词，我想您需要{0,5}
而不是{1,5}
。您只关心第一个字符串匹配吗？我想你想要的不止这些。@fishtank我想要的不止第一个，这就是为什么我调整了下面的答案，使用stringr:：str_extract_all
而不是stringr:：str_extract
stringr::str_extract(text, "([^\\s]+\\s){3}Verulam(\\s[^\\s]+){3}")
# alternately, if you like " " more than \\s:
# stringr::str_extract(text, "(?:[^ ]+ ){3}Verulam(?: [^ ]+){3}")

#[1] "and created Baron Verulam in 1618[4] and"

stringr::str_extract(text, "(?:[^\\s]+\\s){3}Verulam(?:\\s[^\\s]+){3}")