Python 在使用NLTK创建语料库时评估bigrams'；行不通_Python_Twitter_Nltk_Sentiment Analysis

Python 在使用NLTK创建语料库时评估bigrams'；行不通

python twitter

Python 在使用NLTK创建语料库时评估bigrams'；行不通,python,twitter,nltk,sentiment-analysis,Python,Twitter,Nltk,Sentiment Analysis,我目前正在使用NLTK创建一个自定义语料库，对Twitter消息进行情绪分析我的语料库中有正面和负面的推特。我给了相关文件夹与原始“movie_reviews”文件夹相同的结构：它被称为my_movies_reviews 25K，带有pos&neg子文件夹，每个文件夹包含25K个文本文件，带有1个pos或neg tweet 现在，当我构建和评估这个定制语料库时，它可以完美地工作，代码如下： #this code creates corpora of my own pos/neg tweets.

我目前正在使用NLTK创建一个自定义语料库，对Twitter消息进行情绪分析

我的语料库中有正面和负面的推特。我给了相关文件夹与原始“movie_reviews”文件夹相同的结构：它被称为my_movies_reviews 25K，带有pos&neg子文件夹，每个文件夹包含25K个文本文件，带有1个pos或neg tweet

现在，当我构建和评估这个定制语料库时，它可以完美地工作，代码如下：

#this code creates corpora of my own pos/neg tweets. 
import nltk.classify.util
from nltk.classify import NaiveBayesClassifier
from nltk.corpus.reader import CategorizedPlaintextCorpusReader
from collections import defaultdict
import numpy as np
import collections

root_folder = 'C:\Users\gerbuiker\Desktop\my_movie_reviews25K'
movie_reviews = CategorizedPlaintextCorpusReader(root_folder, r'.*\.txt', cat_pattern='(\w+)')
movie_reviews.categories()

# define the split of % training / % test
SPLIT = 0.8

def word_feats(words):
    return dict([(word, True) for word in words])

posids = movie_reviews.fileids('pos')
negids = movie_reviews.fileids('neg')

negfeats = [(word_feats(movie_reviews.words(fileids=[f])), 'neg') for f in negids]
posfeats = [(word_feats(movie_reviews.words(fileids=[f])), 'pos') for f in posids]

cutoff = int(len(posfeats) * SPLIT)

trainfeats = negfeats[:cutoff] + posfeats[:cutoff]
testfeats = negfeats[cutoff:] + posfeats[cutoff:]

print 'Train on %d instances\nTest on %d instances' % (len(trainfeats),len(testfeats))

classifier = NaiveBayesClassifier.train(trainfeats)
print 'Accuracy:', nltk.classify.util.accuracy(classifier, testfeats)

classifier.show_most_informative_features()


refsets = collections.defaultdict(set)
testsets = collections.defaultdict(set)

for i, (feats, label) in enumerate(testfeats):
    refsets[label].add(i)
    observed = classifier.classify(feats)
    testsets[observed].add(i)

print 'pos precision:', nltk.metrics.precision(refsets['pos'], testsets['pos'])
print 'pos recall:', nltk.metrics.recall(refsets['pos'], testsets['pos'])
print 'pos F-measure:', nltk.metrics.f_measure(refsets['pos'], testsets['pos'])
print 'neg precision:', nltk.metrics.precision(refsets['neg'], testsets['neg'])
print 'neg recall:', nltk.metrics.recall(refsets['neg'], testsets['neg'])
print 'neg F-measure:', nltk.metrics.f_measure(refsets['neg'], testsets['neg'])

输出：

Train on 40000 instances
Test on 10000 instances
Accuracy: 0.7449
Most Informative Features
            followfriday = True              pos : neg    =    161.0 : 1.0
                  bummed = True              neg : pos    =     27.7 : 1.0
                  female = True              neg : pos    =     22.2 : 1.0
                   hurts = True              neg : pos    =     20.5 : 1.0
                anywhere = True              neg : pos    =     19.7 : 1.0
                 snowing = True              neg : pos    =     19.0 : 1.0
                      ff = True              pos : neg    =     18.1 : 1.0
                  throat = True              neg : pos    =     17.2 : 1.0
                 hurting = True              neg : pos    =     17.0 : 1.0
                   essay = True              neg : pos    =     16.6 : 1.0
pos precision: 0.831393775372
pos recall: 0.6144
pos F-measure: 0.706612995975
neg precision: 0.694210943695
neg recall: 0.8754
neg F-measure: 0.77434763379

为了提高准确性，我想加入bigrams。为此，我使用以下代码：

#this code creates corpora of my own pos/neg tweets. Includes bigrams
from nltk.corpus.reader import CategorizedPlaintextCorpusReader
import collections, itertools
import nltk.classify.util, nltk.metrics
from nltk.classify import NaiveBayesClassifier
from nltk.corpus import movie_reviews, stopwords
from nltk.collocations import BigramCollocationFinder
from nltk.metrics import BigramAssocMeasures
from nltk.probability import FreqDist, ConditionalFreqDist


root_folder = 'C:\Users\gerbuiker\Desktop\Sentiment Analyse\my_movie_reviews25K'
movie_reviews = CategorizedPlaintextCorpusReader(root_folder, r'.*\.txt', cat_pattern='(\w+)')
movie_reviews.categories()

def evaluate_classifier(featx):
    negids = movie_reviews.fileids('neg')
    posids = movie_reviews.fileids('pos')

    negfeats = [(featx(movie_reviews.words(fileids=[f])), 'neg') for f in negids]
    posfeats = [(featx(movie_reviews.words(fileids=[f])), 'pos') for f in posids]

    negcutoff = len(negfeats)*3/4
    poscutoff = len(posfeats)*3/4

    trainfeats = negfeats[:negcutoff] + posfeats[:poscutoff]
    testfeats = negfeats[negcutoff:] + posfeats[poscutoff:]

    classifier = NaiveBayesClassifier.train(trainfeats)
    refsets = collections.defaultdict(set)
    testsets = collections.defaultdict(set)

    for i, (feats, label) in enumerate(testfeats):
            refsets[label].add(i)
            observed = classifier.classify(feats)
            testsets[observed].add(i)

    print 'accuracy:', nltk.classify.util.accuracy(classifier, testfeats)
    print 'pos precision:', nltk.metrics.precision(refsets['pos'], testsets['pos'])
    print 'pos recall:', nltk.metrics.recall(refsets['pos'], testsets['pos'])
    print 'neg precision:', nltk.metrics.precision(refsets['neg'], testsets['neg'])
    print 'neg recall:', nltk.metrics.recall(refsets['neg'], testsets['neg'])
    classifier.show_most_informative_features()

def word_feats(words):
    return dict([(word, True) for word in words])

print 'evaluating single word features'
evaluate_classifier(word_feats)

word_fd = FreqDist()
label_word_fd = ConditionalFreqDist()

for word in movie_reviews.words(categories=['pos']):
    word_fd[word.lower()] += 1
    label_word_fd['pos'][word.lower()] += 1

for word in movie_reviews.words(categories=['neg']):
    word_fd[word.lower()] += 1
    label_word_fd['neg'][word.lower()] += 1

# n_ii = label_word_fd[label][word]
# n_ix = word_fd[word]
# n_xi = label_word_fd[label].N()
# n_xx = label_word_fd.N()

pos_word_count = label_word_fd['pos'].N()
neg_word_count = label_word_fd['neg'].N()
total_word_count = pos_word_count + neg_word_count

word_scores = {}

for word, freq in word_fd.iteritems():
    pos_score = BigramAssocMeasures.chi_sq(label_word_fd['pos'][word],
        (freq, pos_word_count), total_word_count)
    neg_score = BigramAssocMeasures.chi_sq(label_word_fd['neg'][word],
        (freq, neg_word_count), total_word_count)
    word_scores[word] = pos_score + neg_score

best = sorted(word_scores.iteritems(), key=lambda (w,s): s, reverse=True)[:10000]
bestwords = set([w for w, s in best])

def best_word_feats(words):
    return dict([(word, True) for word in words if word in bestwords])

print 'evaluating best word features'
evaluate_classifier(best_word_feats)

def best_bigram_word_feats(words, score_fn=BigramAssocMeasures.chi_sq, n=200):
    bigram_finder = BigramCollocationFinder.from_words(words)
    bigrams = bigram_finder.nbest(score_fn, n)
    d = dict([(bigram, True) for bigram in bigrams])
    d.update(best_word_feats(words))
    return d

print 'evaluating best words + bigram chi_sq word features'
evaluate_classifier(best_bigram_word_feats)

但现在我收到以下错误消息：

C:\Users\gerbuiker\Anaconda\python.exe E:/bigrams.py
Traceback (most recent call last):
  File "E:/bigrams.py", line 30, in <module>
    negfeats = [(bigram_word_feats(movie_reviews.words(fileids=[f])), 'neg') for f in negids]
  File "E:/bigrams.py", line 24, in bigram_word_feats
    bigrams = bigram_finder.nbest(score_fn, n)
  File "C:\Users\gerbuiker\AppData\Roaming\Python\Python27\site-packages\nltk\collocations.py", line 112, in nbest
    return [p for p, s in self.score_ngrams(score_fn)[:n]]
  File "C:\Users\gerbuiker\AppData\Roaming\Python\Python27\site-packages\nltk\collocations.py", line 108, in score_ngrams
    return sorted(self._score_ngrams(score_fn), key=lambda t: (-t[1], t[0]))
  File "C:\Users\gerbuiker\AppData\Roaming\Python\Python27\site-packages\nltk\collocations.py", line 100, in _score_ngrams
    score = self.score_ngram(score_fn, *tup)
  File "C:\Users\gerbuiker\AppData\Roaming\Python\Python27\site-packages\nltk\collocations.py", line 169, in score_ngram
    return score_fn(n_ii, (n_ix, n_xi), n_all)
  File "C:\Users\gerbuiker\AppData\Roaming\Python\Python27\site-packages\nltk\metrics\association.py", line 220, in chi_sq
    return n_xx * cls.phi_sq(n_ii, (n_ix, n_xi), n_xx)
  File "C:\Users\gerbuiker\AppData\Roaming\Python\Python27\site-packages\nltk\metrics\association.py", line 212, in phi_sq
    ((n_ii + n_io) * (n_ii + n_oi) * (n_io + n_oo) * (n_oi + n_oo)))
ZeroDivisionError: float division by zero

Process finished with exit code 1

C:\Users\gerbuiker\Anaconda\python.exe E:/bigrams.py
回溯（最近一次呼叫最后一次）：
文件“E:/bigrams.py”，第30行，在
negfeats=[（bigram_word_fets（movie_reviews.words（fileid=[f]）），'neg'）表示negids中的f]
文件“E:/bigrams.py”，第24行，在bigram\u word\u专长中
bigrams=bigram\u finder.nbest（分数fn，n）
文件“C:\Users\gerbuiker\AppData\Roaming\Python27\site packages\nltk\collabons.py”，第112行，在nbest中
返回[p代表p，s在自我记分图（记分图）[:n]]
文件“C:\Users\gerbuiker\AppData\Roaming\Python\Python27\site packages\nltk\collabons.py”，第108行，在记分图中
返回排序（自评分图（评分fn），key=lambda t:（-t[1]，t[0]））
文件“C:\Users\gerbuiker\AppData\Roaming\Python\Python27\site packages\nltk\collabons.py”，第100行，在记分图中
分数=自我分数图（分数fn，*tup）
文件“C:\Users\gerbuiker\AppData\Roaming\Python\Python27\site packages\nltk\collabons.py”，第169行，在记分卡中
返回分数（n_ii，（n_ix，n_xi），n_all）
文件“C:\Users\gerbuiker\AppData\Roaming\Python27\site packages\nltk\metrics\association.py”，第220行，chi_sq
返回n_xx*cls.phi_sq（n_ii，（n_ix，n_xi），n_xx）
文件“C:\Users\gerbuiker\AppData\Roaming\Python27\site packages\nltk\metrics\association.py”，第212行，phi_sq
（（n_ii+n_io）*（n_ii+n_oi）*（n_io+n_oo）*（n_oi+n_oo）））
ZeroDivisionError：浮点除以零
进程已完成，退出代码为1

有人能帮我吗

大多数代码来自：

对于bigram的情况：

您好，问题是您没有更新这行代码

label\u word\u fd['pos'][word.lower（）]+=1

也许你正试着把一份口述或一份清单放在一个单词接一个单词的地方，希望能有所帮助

试试这个，也许：电影评论中的单词（类别=['neg']）：打印（word）#把这行放在这里，打印出发生了什么 word_fd[word.lower（）]+=1

label_word_fd['neg'][word.lower（）]+=1

我有一个非常类似的情况，这就是我发现的。这与剧本无关。这可能是我的情况所独有的，但你可能会发现它很有用

我把我的文本分成几个部分，看看是否有什么原因导致了这个问题，因为我有一个工作非常好的训练数据集，但是我的测试数据集给出了这个错误消息。最后我发现了一行引起问题的文字，基本上是一个单词完全相同的句子，例如，“work work”或“hello hello”

一旦我去掉那条线，问题就消失了。希望这能有所帮助。

是的，我的情况就是这样。但我不能删除该记录，所以请考虑其他选项。@Sushankulkarni仍然存在问题，我删除重复的句子。您是否解决了此问题？