在R中是否有文字处理功能在字级操作?

在R中是否有文字处理功能在字级操作?,r,string,nlp,text-processing,R,String,Nlp,Text Processing,我试图在R中找到一组在单词级上运行的函数。e、 可以返回单词位置的函数。例如,给定以下语句和查询 sentence <- "A sample sentence for demo" query <- "for" 正如我在评论中提到的,stringr在这些情况下很有用 library(stringr) sentence <- "A sample sentence for demo" wordNumber <- 4L fourthWord <- word(strin

我试图在R中找到一组在单词级上运行的函数。e、 可以返回单词位置的函数。例如,给定以下
语句
查询

sentence <- "A sample sentence for demo"
query <- "for"

正如我在评论中提到的,
stringr
在这些情况下很有用

library(stringr)

sentence <- "A sample sentence for demo"
wordNumber <- 4L

fourthWord <- word(string = sentence,
                   start = wordNumber)

previousWords <- word(string = sentence,
                       start = wordNumber - 1L,
                       end = wordNumber)

laterWords <- word(string = sentence,
                   start = wordNumber,
                   end = wordNumber + 1L)

我希望这对您有所帮助。

如果您使用
scan
,它将在空白处拆分输入:

> s.scan <- scan(text=sentence, what="")
Read 5 items
> which(s.scan == query)
[1] 4
>s.scan哪个(s.scan==query)
[1] 4

需要
what=”“
告诉扫描需要字符而不是数字输入。如果输入的是完整的英语句子,可能需要使用
gsub
替换标点符号为
patt=“[[:punct:][]”。如果您试图对词类进行分类或处理大型文档,可能还需要查看
tm
(文本挖掘)软件包。

我已经编写了自己的函数,
indexOf
方法返回
单词的索引,如果在
句子中找到它,则返回
-1
,非常像


答案取决于你所说的“单词”是什么意思。如果你指的是空格分隔的标记,那么@imran ali的答案很好。如果您指的是Unicode定义的单词,特别注意标点符号,那么您需要更复杂的东西

以下内容正确处理标点符号:

library(corpus)
sentence <- "A sample sentence for demo"
query <- "for"

# use text_locate to find all instances of the query, with context
text_locate(sentence, query)
##   text             before              instance              after              
## 1 1                 A sample sentence    for     demo             

# find the number of tokens before, then add 1 to get the position
text_ntoken(text_locate(sentence, query)$before) + 1
## 4

查看
stringr::word
。如:
word(字符串,start=1L,end=start,sep=fixed(“”)
。您还可以使用
end=-2L
获取最后两个单词。
indexOf <- function(sentence, word){
  listOfWords <- strsplit(sentence, split = " ")
  sentenceAsVector <- unlist(listOfWords)

  if(word %in% sentenceAsVector == FALSE){
    result=-1
  }
  else{
  result = which(sentenceAsVector==word)
  }
  return(result)
}
extend <- function(sentence, query, direction){
  listOfWords = strsplit(sentence, split = " ")
  sentenceAsVector = unlist(listOfWords)
  lengthOfSentence = length(sentenceAsVector)
  location = indexOf(sentence, query)
  boundary = FALSE
  if(location == 1 | location == lengthOfSentence){
    boundary = TRUE
  }
  else{
    boundary = FALSE
  } 
  if(!boundary){ 
    if(location> 1 & direction == "right"){  
      return(paste(sentenceAsVector[location], 
                   sentenceAsVector[location + 1],
                   sep=" ")
      )
    }
    else if(location < lengthOfSentence & direction == "left"){
      return(paste(sentenceAsVector[location - 1], 
                   sentenceAsVector[location],
                   sep=" ")
      )

    }
  }
  else{
    if(location == 1 ){
      return(paste(sentenceAsVector[1], sentenceAsVector[2], sep = " "))
    }
    if(location == lengthOfSentence){
      return(paste(sentenceAsVector[lengthOfSentence - 1],
                   sentenceAsVector[lengthOfSentence], sep = " "))
    }
  } 
}
library(corpus)
sentence <- "A sample sentence for demo"
query <- "for"

# use text_locate to find all instances of the query, with context
text_locate(sentence, query)
##   text             before              instance              after              
## 1 1                 A sample sentence    for     demo             

# find the number of tokens before, then add 1 to get the position
text_ntoken(text_locate(sentence, query)$before) + 1
## 4
sentence2 <- "for one, for two! for three? for four"
text_ntoken(text_locate(sentence2, query)$before) + 1
## [1]  1  4  7 10
text_tokens(sentence2)[[1]][c(1, 4, 7, 10)]
## [1] "for" "for" "for" "for"