R 跟踪单词接近度_R_Text Mining

R 跟踪单词接近度

R 跟踪单词接近度,r,text-mining,R,Text Mining,我正在从事一个小项目，该项目涉及在文档集合中进行基于词典的文本搜索。我的字典中有积极的信号词（也称为好词），但在文档集合中，仅仅查找一个词并不能保证得到积极的结果，因为可能有消极的词，例如（不，不重要的）可能在这些积极词的附近。我想构造一个矩阵，这样它就包含了文档编号、肯定词及其与否定词的接近度谁能给我一个建议吗。我的项目处于非常早期的阶段，所以我给我的文本一个基本的例子 No significant drug interactions have been reported in studie

我正在从事一个小项目，该项目涉及在文档集合中进行基于词典的文本搜索。我的字典中有积极的信号词（也称为好词），但在文档集合中，仅仅查找一个词并不能保证得到积极的结果，因为可能有消极的词，例如（不，不重要的）可能在这些积极词的附近。我想构造一个矩阵，这样它就包含了文档编号、肯定词及其与否定词的接近度

谁能给我一个建议吗。我的项目处于非常早期的阶段，所以我给我的文本一个基本的例子

No significant drug interactions have been reported in studies of candesartan cilexetil given with other drugs such as glyburide, nifedipine, digoxin, warfarin, hydrochlorothiazide.

这是我的示例文档，其中坎地沙坦酯、格列本脲、硝苯地平、地高辛、华法林、氢氯噻嗪是我的肯定词，而非否定词。我想在我的肯定词和否定词之间做一个接近（基于词的）映射

有人能给我一些有用的建议吗
首先，我建议不要将R用于此任务。R在很多方面都很好，但文本操作不是其中之一。Python可能是一个很好的选择
也就是说，如果我要在R中实现这一点，我可能会这样做（非常非常粗略）：

#您可能会从外部文件或数据库中读取这些内容 goodWords你有没有看这两本书中的任何一本 CRAN上的任务视图，或 CRAN上的文本挖掘包不错的包装，我不知道！不过，我不认为R是进行此类分析的最佳工具。是的，我经常使用tm软件包！我几乎得到了我想要的东西。谢谢你，尼科！ # You will probably read these from an external file or a database goodWords <- c("candesartan cilexetil", "glyburide", "nifedipine", "digoxin", "blabla", "warfarin", "hydrochlorothiazide") badWords <- c("no significant", "other drugs") mytext <- "no significant drug interactions have been reported in studies of candesartan cilexetil given with other drugs such as glyburide, nifedipine, digoxin, warfarin, hydrochlorothiazide." mytext <- tolower(mytext) # Let's make life a little bit easier... goodPos <- NULL badPos <- NULL # First we find the good words for (w in goodWords) { pos <- regexpr(w, mytext) if (pos != -1) { cat(paste(w, "found at position", pos, "\n")) } else { pos <- NA cat(paste(w, "not found\n")) } goodPos <- c(goodPos, pos) } # And then the bad words for (w in badWords) { pos <- regexpr(w, mytext) if (pos != -1) { cat(paste(w, "found at position", pos, "\n")) } else { pos <- NA cat(paste(w, "not found\n")) } badPos <- c(badPos, pos) } # Note that we use -badPos so that when can calculate the distance with rowSums comb <- expand.grid(goodPos, -badPos) wordcomb <- expand.grid(goodWords, badWords) dst <- cbind(wordcomb, abs(rowSums(comb))) mn <- which.min(dst[,3]) cat(paste("The closest good-bad word pair is: ", dst[mn, 1],"-", dst[mn, 2],"\n"))