Warning: file_get_contents(/data/phpspider/zhask/data//catemap/4/algorithm/12.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181

Warning: file_get_contents(/data/phpspider/zhask/data//catemap/9/apache-flex/4.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python NLTK-自动翻译相似单词_Python_Algorithm_Nltk_Wordnet_Gensim - Fatal编程技术网

Python NLTK-自动翻译相似单词

Python NLTK-自动翻译相似单词,python,algorithm,nltk,wordnet,gensim,Python,Algorithm,Nltk,Wordnet,Gensim,总体目标:我正在使用NLTK和Gensim用Python制作产品评论的LDA模型。我想在不同的n克上运行这个 问题:使用Unigram一切都很好,但当我使用bigrams时,我开始获得重复信息的主题。例如,主题1可能包含:[“好产品”,“好价值”],主题4可能包含:[“好产品”,“好价值”]。对人类来说,这些显然是在传达相同的信息,但显然“好产品”和“好产品”是截然不同的两个字母。我如何从算法上确定“好产品”和“好产品”足够相似,这样我就可以将其中一个的所有出现转换为另一个(可能是在语料库中出现

总体目标:我正在使用NLTK和Gensim用Python制作产品评论的LDA模型。我想在不同的n克上运行这个

问题:使用Unigram一切都很好,但当我使用bigrams时,我开始获得重复信息的主题。例如,主题1可能包含:
[“好产品”,“好价值”]
,主题4可能包含:
[“好产品”,“好价值”]
。对人类来说,这些显然是在传达相同的信息,但显然“好产品”和“好产品”是截然不同的两个字母。我如何从算法上确定
“好产品”
“好产品”
足够相似,这样我就可以将其中一个的所有出现转换为另一个(可能是在语料库中出现得更频繁的一个)

我尝试过的:我玩了WordNet的语法集树,运气不好。原来
good
是一个“形容词”,但
great
是一个“形容词附属词”,因此返回路径相似性
None
。我的思考过程如下:

  • 词性标记句子
  • 使用这些POS查找正确的Synset
  • 计算两个语法集的相似性
  • 如果它们高于某个阈值,则计算两个单词的出现次数
  • 将出现最少的单词替换为出现最多的单词
  • 不过,理想情况下,我希望有一个算法可以确定
    good
    great
    在我的语料库中是相似的(可能是共现意义上的),这样它就可以扩展到不属于普通英语的词,但出现在我的语料库中,并且可以扩展到n-gram(可能
    Oracle
    terrable
    在我的语料库中是同义词,或者
    featureengineering
    feature creation
    是相似的)


    有没有关于算法的建议,或者让WordNet语法集正常工作的建议?

    如果你要使用WordNet,你必须

    问题1:词义消歧(WSD),即如何自动确定要使用的语法集

    >>> for i in wn.synsets('good','a'):
    ...     print i.name, i.definition
    ... 
    good.a.01 having desirable or positive qualities especially those suitable for a thing specified
    full.s.06 having the normally expected amount
    good.a.03 morally admirable
    estimable.s.02 deserving of esteem and respect
    beneficial.s.01 promoting or enhancing well-being
    good.s.06 agreeable or pleasing
    good.s.07 of moral excellence
    adept.s.01 having or showing knowledge and skill and aptitude
    good.s.09 thorough
    dear.s.02 with or in a close or intimate relationship
    dependable.s.04 financially sound
    good.s.12 most suitable or right for a particular purpose
    good.s.13 resulting favorably
    effective.s.04 exerting force or influence
    good.s.15 capable of pleasing
    good.s.16 appealing to the mind
    good.s.17 in excellent physical condition
    good.s.18 tending to promote physical well-being; beneficial to health
    good.s.19 not forged
    good.s.20 not left to spoil
    good.s.21 generally admired
    
    >>> for i in wn.synsets('great','a'):
    ...     print i.name, i.definition
    ... 
    great.s.01 relatively large in size or number or extent; larger than others of its kind
    great.s.02 of major significance or importance
    great.s.03 remarkable or out of the ordinary in degree or magnitude or effect
    bang-up.s.01 very good
    capital.s.03 uppercase
    big.s.13 in an advanced stage of pregnancy
    
    比如说,你得到了正确的感觉,也许你尝试了类似这样的方法(),比如说你得到了正确的词性和语法集:

    好的,好的有可取的或积极的品质的,尤其是那些 适合特定事物的 在大小、数量或范围上相对较大;比同类中的其他人大

    问题2:如何比较这两个语法集

    让我们尝试一下相似性函数,但您意识到它们不会给您评分:

    >>> good = wn.synsets('good','a')[0]
    >>> great = wn.synsets('great','a')[0]
    >>> print max(wn.path_similarity(good,great), wn.path_similarity(great, good))
    None
    >>> print max(wn.wup_similarity(good,great), wn.wup_similarity(great, good))
    
    >>> print max(wn.res_similarity(good,great,semcor_ic), wn.res_similarity(great, good,semcor_ic))
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
      File "/usr/local/lib/python2.7/dist-packages/nltk/corpus/reader/wordnet.py", line 1312, in res_similarity
        return synset1.res_similarity(synset2, ic, verbose)
      File "/usr/local/lib/python2.7/dist-packages/nltk/corpus/reader/wordnet.py", line 738, in res_similarity
        ic1, ic2, lcs_ic = _lcs_ic(self, other, ic)
      File "/usr/local/lib/python2.7/dist-packages/nltk/corpus/reader/wordnet.py", line 1643, in _lcs_ic
        (synset1, synset2))
    nltk.corpus.reader.wordnet.WordNetError: Computing the least common subsumer requires Synset('good.a.01') and Synset('great.s.01') to have the same part of speech.
    >>> print max(wn.jcn_similarity(good,great,semcor_ic), wn.jcn_similarity(great, good,semcor_ic))
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
      File "/usr/local/lib/python2.7/dist-packages/nltk/corpus/reader/wordnet.py", line 1316, in jcn_similarity
        return synset1.jcn_similarity(synset2, ic, verbose)
      File "/usr/local/lib/python2.7/dist-packages/nltk/corpus/reader/wordnet.py", line 759, in jcn_similarity
        ic1, ic2, lcs_ic = _lcs_ic(self, other, ic)
      File "/usr/local/lib/python2.7/dist-packages/nltk/corpus/reader/wordnet.py", line 1643, in _lcs_ic
        (synset1, synset2))
    nltk.corpus.reader.wordnet.WordNetError: Computing the least common subsumer requires Synset('good.a.01') and Synset('great.s.01') to have the same part of speech.
    >>> print max(wn.lin_similarity(good,great,semcor_ic), wn.lin_similarity(great, good,semcor_ic))
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
      File "/usr/local/lib/python2.7/dist-packages/nltk/corpus/reader/wordnet.py", line 1320, in lin_similarity
        return synset1.lin_similarity(synset2, ic, verbose)
      File "/usr/local/lib/python2.7/dist-packages/nltk/corpus/reader/wordnet.py", line 789, in lin_similarity
        ic1, ic2, lcs_ic = _lcs_ic(self, other, ic)
      File "/usr/local/lib/python2.7/dist-packages/nltk/corpus/reader/wordnet.py", line 1643, in _lcs_ic
        (synset1, synset2))
    nltk.corpus.reader.wordnet.WordNetError: Computing the least common subsumer requires Synset('good.a.01') and Synset('great.s.01') to have the same part of speech.
    >>> print max(wn.lch_similarity(good,great), wn.lch_similarity(great, good))
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
      File "/usr/local/lib/python2.7/dist-packages/nltk/corpus/reader/wordnet.py", line 1304, in lch_similarity
        return synset1.lch_similarity(synset2, verbose, simulate_root)
      File "/usr/local/lib/python2.7/dist-packages/nltk/corpus/reader/wordnet.py", line 638, in lch_similarity
        (self, other))
    nltk.corpus.reader.wordnet.WordNetError: Computing the lch similarity requires Synset('good.a.01') and Synset('great.s.01') to have the same part of speech.
    
    您意识到在
    附属形容词之间仍然没有相似性信息可供比较:

    >>> print max(wn.lin_similarity(good,great,semcor_ic), wn.lin_similarity(great, good,semcor_ic))
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
      File "/usr/local/lib/python2.7/dist-packages/nltk/corpus/reader/wordnet.py", line 1320, in lin_similarity
        return synset1.lin_similarity(synset2, ic, verbose)
      File "/usr/local/lib/python2.7/dist-packages/nltk/corpus/reader/wordnet.py", line 789, in lin_similarity
        ic1, ic2, lcs_ic = _lcs_ic(self, other, ic)
      File "/usr/local/lib/python2.7/dist-packages/nltk/corpus/reader/wordnet.py", line 1645, in _lcs_ic
        ic1 = information_content(synset1, ic)
      File "/usr/local/lib/python2.7/dist-packages/nltk/corpus/reader/wordnet.py", line 1666, in information_content
        raise WordNetError(msg % synset.pos)
    nltk.corpus.reader.wordnet.WordNetError: Information content file has no entries for part-of-speech: s
    >>> print max(wn.path_similarity(good,great), wn.path_similarity(great, good))None
    None
    
    >>打印最大值(wn.lin\u相似性(好,好,semcor\u ic),wn.lin\u相似性(好,好,semcor\u ic))
    回溯(最近一次呼叫最后一次):
    文件“”,第1行,在
    文件“/usr/local/lib/python2.7/dist packages/nltk/corpus/reader/wordnet.py”,第1320行,lin_
    返回synset1.lin_相似性(synset2,ic,verbose)
    文件“/usr/local/lib/python2.7/dist packages/nltk/corpus/reader/wordnet.py”,第789行,lin_
    ic1、ic2、lcs_ic=_lcs_ic(自身、其他、ic)
    文件“/usr/local/lib/python2.7/dist-packages/nltk/corpus/reader/wordnet.py”,第1645行,在
    ic1=信息内容(synset1,ic)
    文件“/usr/local/lib/python2.7/dist packages/nltk/corpus/reader/wordnet.py”,第1666行,信息内容
    引发WordNetError(消息%synset.pos)
    nltk.corpus.reader.wordnet.WordNetError:信息内容文件中没有词性的条目:s
    >>>打印最大值(wn.path\u相似性(好,好),wn.path\u相似性(好,好))无
    没有一个
    
    现在看来WordNet产生的问题比它解决的任何问题都要多,让我们试试另一种方法,让我们试试单词聚类,看吧

    此时我也放弃回答OP发布的广泛而开放的问题,因为在集群方面做了很多工作,这些工作对于像我这样的凡人来说都是自动化的=)

    你说的(重点补充):

    不过,理想情况下,我希望有一个算法可以确定好的和好的在我的语料库中是相似的(可能是在一个共现的意义上)

    您可以通过测量这些单词与其他单词在同一句子中出现的频率(即共现)来衡量单词的相似性。为了获得更多的语义关联性,您可能还可以获得搭配,也就是说,单词在单词附近的同一窗口中出现的频率

    处理词义消歧(WSD),它使用搭配和周围词(共现)作为其特征空间的一部分。结果非常好,所以我想您可以使用相同的功能解决您的问题

    在Python中,您可以使用,特别是您可能希望查看(带有示例代码)以帮助您入门

    总体思路如下:

  • 找一对你想要检查相似性的双格图
  • 使用您的语料库,为每个二元图生成搭配和共现特征
  • 训练支持向量机学习第一个二元图的特征
  • 对其他Bigram的出现情况运行SVM(您在这里得到一些分数)
  • 可能的情况下,使用分数确定两个二元图是否相似

  • 这些并不向我传达相同的信息。伟大比美好更强大。此外,“良好价值”意味着产品的质量水平具有诱人的价格。“好产品”意味着产品质量高。最新的Mac Pro看起来是一款很棒的产品,我不会说它有多大价值。一种方法是问,用好的替代好的,还是用产品的价值替代好的,实际上会改变一些感兴趣的结果。@ChrisP-我明白你的意思。但这里有一个
    >>> print max(wn.lin_similarity(good,great,semcor_ic), wn.lin_similarity(great, good,semcor_ic))
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
      File "/usr/local/lib/python2.7/dist-packages/nltk/corpus/reader/wordnet.py", line 1320, in lin_similarity
        return synset1.lin_similarity(synset2, ic, verbose)
      File "/usr/local/lib/python2.7/dist-packages/nltk/corpus/reader/wordnet.py", line 789, in lin_similarity
        ic1, ic2, lcs_ic = _lcs_ic(self, other, ic)
      File "/usr/local/lib/python2.7/dist-packages/nltk/corpus/reader/wordnet.py", line 1645, in _lcs_ic
        ic1 = information_content(synset1, ic)
      File "/usr/local/lib/python2.7/dist-packages/nltk/corpus/reader/wordnet.py", line 1666, in information_content
        raise WordNetError(msg % synset.pos)
    nltk.corpus.reader.wordnet.WordNetError: Information content file has no entries for part-of-speech: s
    >>> print max(wn.path_similarity(good,great), wn.path_similarity(great, good))None
    None