在R中是否有文字处理功能在字级操作？_R_String_Nlp_Text Processing

在R中是否有文字处理功能在字级操作？

r string nlp

在R中是否有文字处理功能在字级操作？,r,string,nlp,text-processing,R,String,Nlp,Text Processing,我试图在R中找到一组在单词级上运行的函数。e、可以返回单词位置的函数。例如，给定以下语句和查询 sentence <- "A sample sentence for demo" query <- "for" 正如我在评论中提到的，stringr在这些情况下很有用 library(stringr) sentence <- "A sample sentence for demo" wordNumber <- 4L fourthWord <- word(strin

我试图在R中找到一组在单词级上运行的函数。e、可以返回单词位置的函数。例如，给定以下

语句

和

查询

sentence <- "A sample sentence for demo"
query <- "for"

正如我在评论中提到的，

stringr

在这些情况下很有用

library(stringr)

sentence <- "A sample sentence for demo"
wordNumber <- 4L

fourthWord <- word(string = sentence,
                   start = wordNumber)

previousWords <- word(string = sentence,
                       start = wordNumber - 1L,
                       end = wordNumber)

laterWords <- word(string = sentence,
                   start = wordNumber,
                   end = wordNumber + 1L)

我希望这对您有所帮助。

如果您使用

scan

，它将在空白处拆分输入：

> s.scan <- scan(text=sentence, what="")
Read 5 items
> which(s.scan == query)
[1] 4

>s.scan哪个（s.scan==query）
[1] 4

需要

what=”“

告诉扫描需要字符而不是数字输入。如果输入的是完整的英语句子，可能需要使用

gsub

替换标点符号为

patt=“[[：punct:][]”。如果您试图对词类进行分类或处理大型文档，可能还需要查看tm
（文本挖掘）软件包。
我已经编写了自己的函数，indexOf
方法返回单词的索引，如果在句子中找到它，则返回-1
，非常像
答案取决于你所说的“单词”是什么意思。如果你指的是空格分隔的标记，那么@imran ali的答案很好。如果您指的是Unicode定义的单词，特别注意标点符号，那么您需要更复杂的东西
以下内容正确处理标点符号：
library(corpus)
sentence <- "A sample sentence for demo"
query <- "for"

# use text_locate to find all instances of the query, with context
text_locate(sentence, query)
##   text             before              instance              after              
## 1 1                 A sample sentence    for     demo             

# find the number of tokens before, then add 1 to get the position
text_ntoken(text_locate(sentence, query)$before) + 1
## 4

查看stringr:：word
。如：word（字符串，start=1L，end=start，sep=fixed（“”）
。您还可以使用end=-2L获取最后两个单词。
indexOf <- function(sentence, word){
  listOfWords <- strsplit(sentence, split = " ")
  sentenceAsVector <- unlist(listOfWords)

  if(word %in% sentenceAsVector == FALSE){
    result=-1
  }
  else{
  result = which(sentenceAsVector==word)
  }
  return(result)
}

extend <- function(sentence, query, direction){
  listOfWords = strsplit(sentence, split = " ")
  sentenceAsVector = unlist(listOfWords)
  lengthOfSentence = length(sentenceAsVector)
  location = indexOf(sentence, query)
  boundary = FALSE
  if(location == 1 | location == lengthOfSentence){
    boundary = TRUE
  }
  else{
    boundary = FALSE
  } 
  if(!boundary){ 
    if(location> 1 & direction == "right"){  
      return(paste(sentenceAsVector[location], 
                   sentenceAsVector[location + 1],
                   sep=" ")
      )
    }
    else if(location < lengthOfSentence & direction == "left"){
      return(paste(sentenceAsVector[location - 1], 
                   sentenceAsVector[location],
                   sep=" ")
      )

    }
  }
  else{
    if(location == 1 ){
      return(paste(sentenceAsVector[1], sentenceAsVector[2], sep = " "))
    }
    if(location == lengthOfSentence){
      return(paste(sentenceAsVector[lengthOfSentence - 1],
                   sentenceAsVector[lengthOfSentence], sep = " "))
    }
  } 
}

library(corpus)
sentence <- "A sample sentence for demo"
query <- "for"

# use text_locate to find all instances of the query, with context
text_locate(sentence, query)
##   text             before              instance              after              
## 1 1                 A sample sentence    for     demo             

# find the number of tokens before, then add 1 to get the position
text_ntoken(text_locate(sentence, query)$before) + 1
## 4

sentence2 <- "for one, for two! for three? for four"
text_ntoken(text_locate(sentence2, query)$before) + 1
## [1]  1  4  7 10

text_tokens(sentence2)[[1]][c(1, 4, 7, 10)]
## [1] "for" "for" "for" "for"