Python 2.7 在特定文件上测试NLTK分类器

Python 2.7 在特定文件上测试NLTK分类器,python-2.7,nlp,classification,nltk,text-classification,Python 2.7,Nlp,Classification,Nltk,Text Classification,下面的代码运行朴素贝叶斯电影评论分类器。 该代码生成了一个信息量最大的特性列表 注意:**movie review**文件夹位于nltk中 from itertools import chain from nltk.corpus import stopwords from nltk.probability import FreqDist from nltk.classify import NaiveBayesClassifier from nltk.corpus import movie_rev

下面的代码运行朴素贝叶斯电影评论分类器。 该代码生成了一个信息量最大的特性列表

注意:
**movie review**
文件夹位于
nltk

from itertools import chain
from nltk.corpus import stopwords
from nltk.probability import FreqDist
from nltk.classify import NaiveBayesClassifier
from nltk.corpus import movie_reviews
stop = stopwords.words('english')

documents = [([w for w in movie_reviews.words(i) if w.lower() not in stop and w.lower() not in string.punctuation], i.split('/')[0]) for i in movie_reviews.fileids()]


word_features = FreqDist(chain(*[i for i,j in documents]))
word_features = word_features.keys()[:100]

numtrain = int(len(documents) * 90 / 100)
train_set = [({i:(i in tokens) for i in word_features}, tag) for tokens,tag in documents[:numtrain]]
test_set = [({i:(i in tokens) for i in word_features}, tag) for tokens,tag  in documents[numtrain:]]

classifier = NaiveBayesClassifier.train(train_set)
print nltk.classify.accuracy(classifier, test_set)
classifier.show_most_informative_features(5)

如何在特定文件上测试分类器


如果我的问题不明确或错误,请告诉我。

您可以使用classifier.classify()在一个文件上进行测试。此方法将一个字典作为其输入,其中特征作为其键,True或False作为其值,具体取决于该特征是否出现在文档中。它根据分类器输出文件的最可能标签。然后可以将此标签与文件的正确标签进行比较,以查看分类是否正确

在训练集和测试集中,特征词典始终是元组中的第一项,标签是元组中的第二项

因此,您可以对测试集中的第一个文档进行如下分类:

(my_document, my_label) = test_set[0]
if classifier.classify(my_document) == my_label:
    print "correct!"
else:
    print "incorrect!"

首先,请仔细阅读这些答案,其中包含您需要的部分答案,并简要说明分类器的功能及其在NLTK中的工作方式:


在注释数据上测试分类器

现在回答你的问题。我们假设您的问题是该问题的后续问题:

如果您的测试文本的结构与
movie\u review
语料库的结构相同,那么您可以像读取培训数据一样简单地读取测试数据:

为了防止代码的解释不清楚,这里有一个演练:

traindir = '/home/alvas/my_movie_reviews'
mr = CategorizedPlaintextCorpusReader(traindir, r'(?!\.).*\.txt', cat_pattern=r'(neg|pos)/.*', encoding='ascii')
上面的两行是用来读取目录
my_movie_reviews
,其结构如下:

\my_movie_reviews
    \pos
        123.txt
        234.txt
    \neg
        456.txt
        789.txt
    README
然后,下一行使用目录结构的
pos/neg
标记提取文档

documents = [([w for w in mr.words(i) if w.lower() not in stop and w not in string.punctuation], i.split('/')[0]) for i in mr.fileids()]
下面是对上述行的解释:

# This extracts the pos/neg tag
labels = [i for i.split('/')[0]) for i in mr.fileids()]
# Reads the words from the corpus through the CategorizedPlaintextCorpusReader object
words = [w for w in mr.words(i)]
# Removes the stopwords
words = [w for w in mr.words(i) if w.lower() not in stop]
# Removes the punctuation
words = [w for w in mr.words(i) w not in string.punctuation]
# Removes the stopwords and punctuations
words = [w for w in mr.words(i) if w.lower() not in stop and w not in string.punctuation]
# Removes the stopwords and punctuations and put them in a tuple with the pos/neg labels
documents = [([w for w in mr.words(i) if w.lower() not in stop and w not in string.punctuation], i.split('/')[0]) for i in mr.fileids()]
读取测试数据时应采用相同的过程

现在转到特征处理:

以下几行是分类器的额外前100个功能:

# Extract the words features and put them into FreqDist
# object which records the no. of times each unique word occurs
word_features = FreqDist(chain(*[i for i,j in documents]))
# Cuts the FreqDist to the top 100 words in terms of their counts.
word_features = word_features.keys()[:100]
在将文档处理为可分类格式之后:

# Splits the training data into training size and testing size
numtrain = int(len(documents) * 90 / 100)
# Process the documents for training data
train_set = [({i:(i in tokens) for i in word_features}, tag) for tokens,tag in documents[:numtrain]]
# Process the documents for testing data
test_set = [({i:(i in tokens) for i in word_features}, tag) for tokens,tag  in documents[numtrain:]]
现在解释
train\u set
和`test\u set的长列表理解:

# Take the first `numtrain` no. of documents
# as training documents
train_docs = documents[:numtrain]
# Takes the rest of the documents as test documents.
test_docs = documents[numtrain:]
# These extract the feature sets for the classifier
# please look at the full explanation on https://stackoverflow.com/questions/20827741/nltk-naivebayesclassifier-training-for-sentiment-analysis/
train_set = [({i:(i in tokens) for i in word_features}, tag) for tokens,tag  in train_docs]
您也需要像上面那样处理文档,以便在测试文档中提取特征

下面是如何读取测试数据:

stop = stopwords.words('english')

# Reads the training data.
traindir = '/home/alvas/my_movie_reviews'
mr = CategorizedPlaintextCorpusReader(traindir, r'(?!\.).*\.txt', cat_pattern=r'(neg|pos)/.*', encoding='ascii')

# Converts training data into tuples of [(words,label), ...]
documents = [([w for w in mr.words(i) if w.lower() not in stop and w not in string.punctuation], i.split('/')[0]) for i in mr.fileids()]

# Now do the same for the testing data.
testdir = '/home/alvas/test_reviews'
mr_test = CategorizedPlaintextCorpusReader(testdir, r'(?!\.).*\.txt', cat_pattern=r'(neg|pos)/.*', encoding='ascii')
# Converts testing data into tuples of [(words,label), ...]
test_documents = [([w for w in mr_test.words(i) if w.lower() not in stop and w not in string.punctuation], i.split('/')[0]) for i in mr_test.fileids()]
然后继续执行上述处理步骤,只需执行此操作即可获得测试文档的标签@yvespeirsman回答:

#### FOR TRAINING DATA ####
stop = stopwords.words('english')

# Reads the training data.
traindir = '/home/alvas/my_movie_reviews'
mr = CategorizedPlaintextCorpusReader(traindir, r'(?!\.).*\.txt', cat_pattern=r'(neg|pos)/.*', encoding='ascii')

# Converts training data into tuples of [(words,label), ...]
documents = [([w for w in mr.words(i) if w.lower() not in stop and w not in string.punctuation], i.split('/')[0]) for i in mr.fileids()]
# Extract training features.
word_features = FreqDist(chain(*[i for i,j in documents]))
word_features = word_features.keys()[:100]
# Assuming that you're using full data set
# since your test set is different.
train_set = [({i:(i in tokens) for i in word_features}, tag) for tokens,tag  in documents]

#### TRAINS THE TAGGER ####
# Train the tagger
classifier = NaiveBayesClassifier.train(train_set)

#### FOR TESTING DATA ####
# Now do the same reading and processing for the testing data.
testdir = '/home/alvas/test_reviews'
mr_test = CategorizedPlaintextCorpusReader(testdir, r'(?!\.).*\.txt', cat_pattern=r'(neg|pos)/.*', encoding='ascii')
# Converts testing data into tuples of [(words,label), ...]
test_documents = [([w for w in mr_test.words(i) if w.lower() not in stop and w not in string.punctuation], i.split('/')[0]) for i in mr_test.fileids()]
# Reads test data into features:
test_set = [({i:(i in tokens) for i in word_features}, tag) for tokens,tag  in test_documents]

#### Evaluate the classifier ####
for doc, gold_label in test_set:
    tagged_label = classifier.classify(doc)
    if tagged_label == gold_label:
        print("Woohoo, correct")
    else:
        print("Boohoo, wrong")
如果上述代码和解释对您没有意义,则您必须在继续之前阅读本教程:


现在假设您的测试数据中没有注释,即您的
test.txt
不在目录结构中,就像
movie\u review
一样,只是一个普通的文本文件:

\test_movie_reviews
    \1.txt
    \2.txt
那么将其读入分类语料库就没有意义了,您只需阅读并标记文档,即:

for infile in os.listdir(`test_movie_reviews): 
  for line in open(infile, 'r'):
       tagged_label = classifier.classify(doc)
但是如果没有注释,您无法评估结果,因此如果
如果else
,您无法检查标记,并且如果您不使用CategorizedPlaintextCorpusReader,您需要标记文本

如果您只想标记一个纯文本文件
test.txt

import string
from itertools import chain
from nltk.corpus import stopwords
from nltk.probability import FreqDist
from nltk.classify import NaiveBayesClassifier
from nltk.corpus import movie_reviews
from nltk import word_tokenize

stop = stopwords.words('english')

# Extracts the documents.
documents = [([w for w in movie_reviews.words(i) if w.lower() not in stop and w.lower() not in string.punctuation], i.split('/')[0]) for i in movie_reviews.fileids()]
# Extract the features.
word_features = FreqDist(chain(*[i for i,j in documents]))
word_features = word_features.keys()[:100]
# Converts documents to features.
train_set = [({i:(i in tokens) for i in word_features}, tag) for tokens,tag in documents]
# Train the classifier.
classifier = NaiveBayesClassifier.train(train_set)

# Tag the test file.
with open('test.txt', 'r') as fin:
    for test_sentence in fin:
        # Tokenize the line.
        doc = word_tokenize(test_sentence.lower())
        featurized_doc = {i:(i in doc) for i in word_features}
        tagged_label = classifier.classify(featurized_doc)
        print(tagged_label)

再一次,请不要只是复制和粘贴解决方案,并尝试了解它的工作原理。

请给我一个完整的示例,如果可能,请您的示例与我的示例一致。我对Python非常陌生。你能告诉我你为什么在
test\u set[0]
中写
0
吗这是一个完整的例子:如果你在问题中的代码之后立即粘贴代码,它就会工作。
0
只需获取测试集中的第一个文档(列表中的第一项具有索引0)。非常感谢。是否有办法在
测试集[0]
中写入文件的
名称而不是
0
?我不知道,
test_set
准确地指示了哪个文件,因为我们有2个文件夹
pos | neg
,每个文件夹都有自己的文件。我这样问是因为
信息量最大的
单词是
不好的
(这是我这个例子的结果)。第一个文件包含100多个“坏”字。但是程序在输出中显示
不正确
。我的错误在哪里?首先,
test\u set
不包含文件名,因此如果您想使用它来识别文件,一种方法是直接读取文件并将其作为我上面描述的特征字典传递给分类器。其次,当前分类器使用二进制特征。它只是检查一个单词是否出现在文档中,而忽略该单词出现的频率。这可能就是为什么它错误地分类了一个出现了很多错误的文件。谢谢你的完整解释,我试着去理解它们。但我经常遇到错误的结果。我的意思是它应该是
pos
,但是程序显示
neg
。我不知道原因。原因有很多,但并不完美,可能(I)数据不足,(ii)特征不够好,(iii)分类器选择等。请参加本课程了解更多信息。如果可以的话,我强烈鼓励你们参加,通过找出正确的频率来评估输出。分类器学习要注意哪些特征,以及如何在做出决策时将它们结合起来。没有逻辑规则,只有统计和权重。您的文件
cv081.txt
与您的功能集一起显示为
pos
——还有什么需要了解的?通过coursea链接上的机器学习课程,您将了解分类器工作的原因和方式。我开始将它们用作黑盒,一旦您了解它们是如何生成注释的,就可以更容易地进行编码和编辑