在R中是否有文字处理功能在字级操作?
我试图在R中找到一组在单词级上运行的函数。e、 可以返回单词位置的函数。例如,给定以下在R中是否有文字处理功能在字级操作?,r,string,nlp,text-processing,R,String,Nlp,Text Processing,我试图在R中找到一组在单词级上运行的函数。e、 可以返回单词位置的函数。例如,给定以下语句和查询 sentence <- "A sample sentence for demo" query <- "for" 正如我在评论中提到的,stringr在这些情况下很有用 library(stringr) sentence <- "A sample sentence for demo" wordNumber <- 4L fourthWord <- word(strin
语句
和查询
sentence <- "A sample sentence for demo"
query <- "for"
正如我在评论中提到的,
stringr
在这些情况下很有用
library(stringr)
sentence <- "A sample sentence for demo"
wordNumber <- 4L
fourthWord <- word(string = sentence,
start = wordNumber)
previousWords <- word(string = sentence,
start = wordNumber - 1L,
end = wordNumber)
laterWords <- word(string = sentence,
start = wordNumber,
end = wordNumber + 1L)
我希望这对您有所帮助。如果您使用
scan
,它将在空白处拆分输入:
> s.scan <- scan(text=sentence, what="")
Read 5 items
> which(s.scan == query)
[1] 4
>s.scan哪个(s.scan==query)
[1] 4
需要
what=”“
告诉扫描需要字符而不是数字输入。如果输入的是完整的英语句子,可能需要使用gsub
替换标点符号为patt=“[[:punct:][]”。如果您试图对词类进行分类或处理大型文档,可能还需要查看tm
(文本挖掘)软件包。我已经编写了自己的函数,indexOf
方法返回单词的索引,如果在句子中找到它,则返回-1
,非常像
答案取决于你所说的“单词”是什么意思。如果你指的是空格分隔的标记,那么@imran ali的答案很好。如果您指的是Unicode定义的单词,特别注意标点符号,那么您需要更复杂的东西
以下内容正确处理标点符号:
library(corpus)
sentence <- "A sample sentence for demo"
query <- "for"
# use text_locate to find all instances of the query, with context
text_locate(sentence, query)
## text before instance after
## 1 1 A sample sentence for demo
# find the number of tokens before, then add 1 to get the position
text_ntoken(text_locate(sentence, query)$before) + 1
## 4
查看stringr::word
。如:word(字符串,start=1L,end=start,sep=fixed(“”)
。您还可以使用end=-2L
获取最后两个单词。
indexOf <- function(sentence, word){
listOfWords <- strsplit(sentence, split = " ")
sentenceAsVector <- unlist(listOfWords)
if(word %in% sentenceAsVector == FALSE){
result=-1
}
else{
result = which(sentenceAsVector==word)
}
return(result)
}
extend <- function(sentence, query, direction){
listOfWords = strsplit(sentence, split = " ")
sentenceAsVector = unlist(listOfWords)
lengthOfSentence = length(sentenceAsVector)
location = indexOf(sentence, query)
boundary = FALSE
if(location == 1 | location == lengthOfSentence){
boundary = TRUE
}
else{
boundary = FALSE
}
if(!boundary){
if(location> 1 & direction == "right"){
return(paste(sentenceAsVector[location],
sentenceAsVector[location + 1],
sep=" ")
)
}
else if(location < lengthOfSentence & direction == "left"){
return(paste(sentenceAsVector[location - 1],
sentenceAsVector[location],
sep=" ")
)
}
}
else{
if(location == 1 ){
return(paste(sentenceAsVector[1], sentenceAsVector[2], sep = " "))
}
if(location == lengthOfSentence){
return(paste(sentenceAsVector[lengthOfSentence - 1],
sentenceAsVector[lengthOfSentence], sep = " "))
}
}
}
library(corpus)
sentence <- "A sample sentence for demo"
query <- "for"
# use text_locate to find all instances of the query, with context
text_locate(sentence, query)
## text before instance after
## 1 1 A sample sentence for demo
# find the number of tokens before, then add 1 to get the position
text_ntoken(text_locate(sentence, query)$before) + 1
## 4
sentence2 <- "for one, for two! for three? for four"
text_ntoken(text_locate(sentence2, query)$before) + 1
## [1] 1 4 7 10
text_tokens(sentence2)[[1]][c(1, 4, 7, 10)]
## [1] "for" "for" "for" "for"