Python （生物医学）单词'；茎_Python_R_Nlp_Bioinformatics_Text Mining

Python （生物医学）单词'；茎

python r nlp

Python （生物医学）单词'；茎,python,r,nlp,bioinformatics,text-mining,Python,R,Nlp,Bioinformatics,Text Mining,我熟悉R中tm软件包的词干和补全我试图想出一种快速而肮脏的方法来查找给定单词的所有变体（在一些语料库中）。例如，如果我的输入是“白细胞”，我想得到“白细胞”和“白细胞” 如果我必须现在就做的话，我可能会选择这样的方式： library(tm) library(RWeka) dictionary <- unique(unlist(lapply(crude, words))) grep(pattern = LovinsStemmer("company"), ignore.case

我熟悉R中tm软件包的词干和补全

我试图想出一种快速而肮脏的方法来查找给定单词的所有变体（在一些语料库中）。例如，如果我的输入是“白细胞”，我想得到“白细胞”和“白细胞”

如果我必须现在就做的话，我可能会选择这样的方式：

library(tm)
library(RWeka)
dictionary <- unique(unlist(lapply(crude, words)))
grep(pattern = LovinsStemmer("company"), 
    ignore.case = T, x = dictionary, value = T)

library（tm）
图书馆（鲁韦卡）
字典此解决方案需要对语料库进行预处理。但一旦完成，就可以快速查找字典了
from collections import defaultdict
from stemming.porter2 import stem

with open('/usr/share/dict/words') as f:
    words = f.read().splitlines()

stems = defaultdict(list)

for word in words:
    word_stem = stem(word)
    stems[word_stem].append(word)

if __name__ == '__main__':
    word = 'leukocyte'
    word_stem = stem(word)
    print(stems[word_stem])

对于/usr/share/dict/words
语料库，这将生成结果
['leukocyte', "leukocyte's", 'leukocytes']

它使用可与一起安装的模块
pip install stemming

你能举个例子说明你提出的解决方案没有达到你想要的效果吗？你说波特·斯泰默不够咄咄逼人是什么意思？回到这里。在Ubuntu16.04.2和Python2.7.12下运行，它不会返回“白细胞”的任何内容。也许我们的文字档案不一样？！有了“嫉妒”，一定要得到[“嫉妒”、“嫉妒”、“嫉妒”。“muscle”是get[“muscle”，“muscled”，“muscling”，但不是“muscular”。所以我想这是一个局部解决方案。也许我对逆词干的期望太高了，尤其是生物医学术语。我可以从中得到复数形式和形容词形式