NLP-识别和替换R中的单词(同义词)
我对R中的代码有问题 我有一个数据集(问题),有4列,观察超过600k,其中一列名为“V3”。 本专栏有诸如“今天是什么日子”之类的问题。 我有两列的第二个数据集(voc),其中一列名为“word”,另一列名为“同义词”。如果在我的第一个数据集(问题)中存在“同义词”列的第二个数据集(voc)中的单词,那么我想将其替换为“单词”列中的单词NLP-识别和替换R中的单词(同义词),r,nlp,gsub,R,Nlp,Gsub,我对R中的代码有问题 我有一个数据集(问题),有4列,观察超过600k,其中一列名为“V3”。 本专栏有诸如“今天是什么日子”之类的问题。 我有两列的第二个数据集(voc),其中一列名为“word”,另一列名为“同义词”。如果在我的第一个数据集(问题)中存在“同义词”列的第二个数据集(voc)中的单词,那么我想将其替换为“单词”列中的单词 questions = cbind(V3=c("What is the day today?","Tom has brown eyes")) question
questions = cbind(V3=c("What is the day today?","Tom has brown eyes"))
questions <- data.frame(questions)
V3
1 what is the day today?
2 Tom has brown eyes
voc = cbind(word=c("weather", "a","blue"),synonyms=c("day", "the", "brown"))
voc <- data.frame(voc)
word synonyms
1 weather day
2 a the
3 blue brown
Desired output
V3 V5
1 what is the day today? what is a weather today?
2 Tom has brown eyes Tom has blue eyes
首先,在程序级别或在数据导入期间使用
stringsAsFactors=FALSE
选项非常重要。这是因为除非您另外指定,否则R默认将字符串转换为因子。因子在建模中很有用,但您希望对文本本身进行分析,所以您应该确保您的文本不受因子的约束
我的方法是编写一个函数,将每个字符串“分解”成一个向量,然后使用match替换单词。向量被重新组合成一个字符串
我不确定你的60万记录会有多好的表现。您可以查看一些处理字符串的R包,例如stringr
或stringi
,因为它们可能有一些函数来完成这些任务<“代码>匹配”在速度上似乎还可以,但是%中的%可能是一个真正的野兽,这取决于字符串的长度和其他因素
# Start with options to make sure strings are represented correctly
# The rest is your original code (mildly tidied to my own standard)
options(stringsAsFactors = FALSE)
questions <- cbind(V3 = c("What is the day today?","Tom has brown eyes"))
questions <- data.frame(questions)
voc <- cbind(word = c("weather","a","blue"),
synonyms = c("day","the","brown"))
voc <- data.frame(voc)
# This function takes:
# - an input string
# - a vector of words to replace
# - a vector of the words to use as replacements
# It returns a list of the original input and the changed version
uFunc_FindAndReplace <- function(input_string,words_to_repl,repl_words) {
# Start by breaking the input string into a vector
# Note that we use [[1]] to get first list element of strsplit output
# Obviously this relies on breaking sentences by spacing
orig_words <- strsplit(x = input_string,split = " ")[[1]]
# If we find at least one of the words to replace in the original words, proceed
if(sum(orig_words %in% words_to_repl) > 0) {
# The right side selects the elements of orig_words that match words to be replaced
# The left side uses match to find the numeric index of those replacements within the words_to_repl vector
# This numeric vector is used to select the values from repl_words
# These then replace the values in orig_words
orig_words[orig_words %in% words_to_repl] <- repl_words[match(x = orig_words,table = words_to_repl,nomatch = 0)]
# We rebuild the sentence again, and return a list with original and new version
new_sent <- paste(orig_words,collapse = " ")
return(list(original = input_string,new = new_sent))
} else {
# Otherwise we return the original version since no changes are needed
return(list(original = input_string,new = input_string))
}
}
# Using do.call and rbind.data.frame, we can collapse the output of a lapply()
do.call(what = rbind.data.frame,
args = lapply(X = questions$V3,
FUN = uFunc_FindAndReplace,
words_to_repl = voc$synonyms,
repl_words = voc$word))
>
original new
1 What is the day today? What is a weather today?
2 Tom has brown eyes Tom has blue eyes
#从选项开始,确保正确表示字符串
#剩下的是您的原始代码(根据我自己的标准稍微整理)
选项(stringsAsFactors=FALSE)
问题您的问题不太可能吸引答案,请提供一些示例数据(涉及的数据框的前几行),一个所需输出的示例也很好。好的!:)谢谢你的建议,干得好!非常感谢:)它在我的大数据集上正常工作
for( i in 1:nrow(questions))
{
for( j in 1:nrow(voc))
{
if (grepl(voc[j,k],do.call(rbind,strsplit(questions[i,]," "))) == TRUE)
{
new=matrix(gsub(do.call(rbind,strsplit(questions[i,]," "))[which(do.call(rbind,strsplit(questions[i,]," "))== voc[j,2])], voc[j,1], questions[i,]))
questions[i,]=new
}
}
questions = cbind(questions,c(new))
}
# Start with options to make sure strings are represented correctly
# The rest is your original code (mildly tidied to my own standard)
options(stringsAsFactors = FALSE)
questions <- cbind(V3 = c("What is the day today?","Tom has brown eyes"))
questions <- data.frame(questions)
voc <- cbind(word = c("weather","a","blue"),
synonyms = c("day","the","brown"))
voc <- data.frame(voc)
# This function takes:
# - an input string
# - a vector of words to replace
# - a vector of the words to use as replacements
# It returns a list of the original input and the changed version
uFunc_FindAndReplace <- function(input_string,words_to_repl,repl_words) {
# Start by breaking the input string into a vector
# Note that we use [[1]] to get first list element of strsplit output
# Obviously this relies on breaking sentences by spacing
orig_words <- strsplit(x = input_string,split = " ")[[1]]
# If we find at least one of the words to replace in the original words, proceed
if(sum(orig_words %in% words_to_repl) > 0) {
# The right side selects the elements of orig_words that match words to be replaced
# The left side uses match to find the numeric index of those replacements within the words_to_repl vector
# This numeric vector is used to select the values from repl_words
# These then replace the values in orig_words
orig_words[orig_words %in% words_to_repl] <- repl_words[match(x = orig_words,table = words_to_repl,nomatch = 0)]
# We rebuild the sentence again, and return a list with original and new version
new_sent <- paste(orig_words,collapse = " ")
return(list(original = input_string,new = new_sent))
} else {
# Otherwise we return the original version since no changes are needed
return(list(original = input_string,new = input_string))
}
}
# Using do.call and rbind.data.frame, we can collapse the output of a lapply()
do.call(what = rbind.data.frame,
args = lapply(X = questions$V3,
FUN = uFunc_FindAndReplace,
words_to_repl = voc$synonyms,
repl_words = voc$word))
>
original new
1 What is the day today? What is a weather today?
2 Tom has brown eyes Tom has blue eyes