Python-使用逐点互信息的情绪分析_Python_Nlp_Nltk_Sentiment Analysis

Python-使用逐点互信息的情绪分析

python nlp

Python-使用逐点互信息的情绪分析,python,nlp,nltk,sentiment-analysis,Python,Nlp,Nltk,Sentiment Analysis,我需要这段代码来计算逐点互信息，这可以用来将评论分为正面或负面。基本上，我使用Turney（2002）指定的技术：作为情绪分析的无监督分类方法的示例正如论文中所解释的，如果一个短语与“差”一词的关联性更强，那么该短语的语义取向是否定的；如果它与“优秀”一词的关联性更强，那么该短语的语义取向是肯定的上面的代码计算短语的SO。我使用谷歌来计算点击数和SO（因为AltaVista现在不存在）计算的值非常不稳定。他们不坚持一个特定的模式。例如，SO（“丑陋的产品”）结果是2.8546209854

我需要这段代码来计算逐点互信息，这可以用来将评论分为正面或负面。基本上，我使用Turney（2002）指定的技术：作为情绪分析的无监督分类方法的示例

正如论文中所解释的，如果一个短语与“差”一词的关联性更强，那么该短语的语义取向是否定的；如果它与“优秀”一词的关联性更强，那么该短语的语义取向是肯定的

上面的代码计算短语的SO。我使用谷歌来计算点击数和SO（因为AltaVista现在不存在）

计算的值非常不稳定。他们不坚持一个特定的模式。例如，SO（“丑陋的产品”）结果是2.85462098541，而SO（“美丽的产品”）结果是1.71395061117。而前者预计为负，另一个为正

代码有问题吗？对于任何Python库（比如NLTK），有没有更简单的方法来计算短语的SO（使用PMI）？我尝试了NLTK，但找不到任何计算PMI的明确方法。

一般来说，计算PMI很棘手，因为公式会根据您想要考虑的ngram的大小而变化：

从数学上讲，对于Bigram，您可以简单地考虑：

from __future__ import division
import urllib
import json
from math import log


def hits(word1,word2=""):
    query = "http://ajax.googleapis.com/ajax/services/search/web?v=1.0&q=%s"
    if word2 == "":
        results = urllib.urlopen(query % word1)
    else:
        results = urllib.urlopen(query % word1+" "+"AROUND(10)"+" "+word2)
    json_res = json.loads(results.read())
    google_hits=int(json_res['responseData']['cursor']['estimatedResultCount'])
    return google_hits


def so(phrase):
    num = hits(phrase,"excellent")
    #print num
    den = hits(phrase,"poor")
    #print den
    ratio = num / den
    #print ratio
    sop = log(ratio)
    return sop

print so("ugly product")

通过编程，假设您已经计算了语料库中单字和双字的所有频率，您可以这样做：

log(p(a,b) / ( p(a) * p(b) ))

这是一个来自MWE库的代码片段，但它还处于开发前阶段（）。但请注意，它是用于并行MWE提取的，所以这里介绍了如何“破解”它以提取单语MWE：

def pmi(word1, word2, unigram_freq, bigram_freq):
  prob_word1 = unigram_freq[word1] / float(sum(unigram_freq.values()))
  prob_word2 = unigram_freq[word2] / float(sum(unigram_freq.values()))
  prob_word1_word2 = bigram_freq[" ".join([word1, word2])] / float(sum(bigram_freq.values()))
  return math.log(prob_word1_word2/float(prob_word1*prob_word2),2)

[out]：

$ wget https://dl.dropboxusercontent.com/u/45771499/mwe.py
$ printf "This is a foo bar sentence .\nI need multi-word expression from this text file.\nThe text file is messed up , I know you foo bar multi-word expression thingy .\n More foo bar is needed , so that the text file is populated with some sort of foo bar bigrams to extract the multi-word expression ." > src.txt
$ printf "" > trg.txt
$ python
>>> import codecs
>>> from mwe import load_ngramfreq, extract_mwe

>>> # Calculates the unigrams and bigrams counts.
>>> # More superfluously, "Training a bigram 'language model'."
>>> unigram, bigram, _ , _ = load_ngramfreq('src.txt','trg.txt')

>>> sent = "This is another foo bar sentence not in the training corpus ."

>>> for threshold in range(-2, 4):
...     print threshold, [mwe for mwe in extract_mwe(sent.strip().lower(), unigram, bigram, threshold)]

关于进一步的细节，我发现这篇论文是对MWE提取的一个快速而简单的介绍：“扩展对数似然度量以改进搭配识别”，请参见

要回答为什么你的结果不稳定，重要的是要知道谷歌搜索不是一个可靠的词频来源。引擎返回的频率仅仅是估计值，在查询多个单词时，这些估计值尤其不准确，并且可能相互矛盾。这并不是要抨击谷歌，但它不是一个频率统计工具。因此，您的实现可能很好，但在此基础上的结果仍然可能是非理性的

要更深入地讨论这个问题，请阅读Adam Kilgarriff的“”。

Python库包含关于共现矩阵的内容

例如：

-2 ['this is', 'is another', 'another foo', 'foo bar', 'bar sentence', 'sentence not', 'not in', 'in the', 'the training', 'training corpus', 'corpus .']
-1 ['this is', 'is another', 'another foo', 'foo bar', 'bar sentence', 'sentence not', 'not in', 'in the', 'the training', 'training corpus', 'corpus .']
0 ['this is', 'foo bar', 'bar sentence']
1 ['this is', 'foo bar', 'bar sentence']
2 ['this is', 'foo bar', 'bar sentence']
3 ['foo bar', 'bar sentence']
4 []

参考文献：乔治亚娜·迪努、恩希亚·法姆和马可·巴罗尼。 2013. . 在系统演示程序中保加利亚索非亚2013年ACL大会主席