Python-NLTK训练/测试分割
我一直在关注SentDex关于NLTK和Python的内容,并构建了一个脚本,该脚本使用各种模型(如逻辑回归)确定评论情绪。我担心的是,我认为SentDex的方法在确定用于训练的单词时包括了测试集,这显然是不可取的(训练/测试分离发生在特征选择之后) (根据穆罕默德·卡西夫的评论编辑) 完整代码:Python-NLTK训练/测试分割,python,machine-learning,scikit-learn,nlp,nltk,Python,Machine Learning,Scikit Learn,Nlp,Nltk,我一直在关注SentDex关于NLTK和Python的内容,并构建了一个脚本,该脚本使用各种模型(如逻辑回归)确定评论情绪。我担心的是,我认为SentDex的方法在确定用于训练的单词时包括了测试集,这显然是不可取的(训练/测试分离发生在特征选择之后) (根据穆罕默德·卡西夫的评论编辑) 完整代码: import nltk import numpy as np from nltk.classify.scikitlearn import SklearnClassifier from nltk.cla
import nltk
import numpy as np
from nltk.classify.scikitlearn import SklearnClassifier
from nltk.classify import ClassifierI
from nltk.corpus import movie_reviews
from sklearn.naive_bayes import MultinomialNB
documents = [ (list(movie_reviews.words(fileid)), category)
for category in movie_reviews.categories()
for fileid in movie_reviews.fileids(category) ]
all_words = []
for w in movie_reviews.words():
all_words.append(w.lower())
all_words = nltk.FreqDist(all_words)
word_features = list(all_words.keys())[:3000]
def find_features(documents):
words = set(documents)
features = {}
for w in word_features:
features[w] = (w in words)
return features
featuresets = [(find_features(rev), category) for (rev, category) in documents]
np.random.shuffle(featuresets)
training_set = featuresets[:1800]
testing_set = featuresets[1800:]
MNB_classifier = SklearnClassifier(MultinomialNB())
MNB_classifier.train(training_set)
print("MNB_classifier accuracy:", (nltk.classify.accuracy(MNB_classifier, testing_set)) *100)
已尝试:
documents = [ (list(movie_reviews.words(fileid)), category)
for category in movie_reviews.categories()
for fileid in movie_reviews.fileids(category) ]
np.random.shuffle(documents)
training_set = documents[:1800]
testing_set = documents[1800:]
all_words = []
for w in documents.words():
all_words.append(w.lower())
all_words = nltk.FreqDist(all_words)
word_features = list(all_words.keys())[:3000]
def find_features(training_set):
words = set(training_set)
features = {}
for w in word_features:
features[w] = (w in words)
return features
featuresets = [(find_features(rev), category) for (rev, category) in training_set]
np.random.shuffle(featuresets)
training_set = featuresets
testing_set = testing_set
MNB_classifier = SklearnClassifier(MultinomialNB())
MNB_classifier.train(training_set)
print("MNB_classifier accuracy:", (nltk.classify.accuracy(MNB_classifier, testing_set)) *100)
产生以下错误:
回溯(最近一次呼叫最后一次):
文件“”,第34行,在
打印(“MNB_分类器精度:”,(nltk.Classification.Accurance(MNB_分类器,测试集))*100)
文件“C:\ProgramData\Anaconda3\lib\site packages\nltk\classify\util.py”,第87行,精确
结果=分类器。分类\u多个([fs表示(fs,l)为金色])
文件“C:\ProgramData\Anaconda3\lib\site packages\nltk\classify\scikitlearn.py”,第85行,在classify\u many中
X=自向量器变换(特征集)
文件“C:\ProgramData\Anaconda3\lib\site packages\sklearn\feature\u extraction\dict\u vectorizer.py”,第291行,在转换中
返回self.\u变换(X,fitting=False)
文件“C:\ProgramData\Anaconda3\lib\site packages\sklearn\feature\u extraction\dict\u vectorizer.py”,第166行,在转换中
对于f,v在六个元素中。i(x):
文件“C:\ProgramData\Anaconda3\lib\site packages\sklearn\externals\six.py”,第439行,在iteritems中
返回iter(getattr(d,_iteritems)(**kw))
AttributeError:“列表”对象没有属性“项”
好的,代码中有几个错误。我们将一个接一个地检查它们 首先,您的
文档
列表是一个元组列表,它没有words()
方法。要访问所有单词,请像这样更改for循环
all_words = []
for words_list, categ in documents: #<-- each wordlist is a list of words
for w in words_list: #<-- Then access each word in list
all_words.append(w.lower())
所以最后的代码变成了
documents = [ (list(movie_reviews.words(fileid)), category)
for category in movie_reviews.categories()
for fileid in movie_reviews.fileids(category) ]
np.random.shuffle(documents)
training_set = documents[:1800]
testing_set = documents[1800:]
all_words = []
for words_list, categ in documents:
for w in words_list:
all_words.append(w.lower())
all_words = nltk.FreqDist(all_words)
word_features = list(all_words.keys())[:3000]
def find_features(training_set):
words = set(training_set)
features = {}
for w in word_features:
features[w] = (w in words)
return features
featuresets = [(find_features(rev), category) for (rev, category) in documents]
np.random.shuffle(featuresets)
training_set = featuresets[:1800]
testing_set = featuresets[1800:]
MNB_classifier = SklearnClassifier(MultinomialNB())
MNB_classifier.train(training_set)
print("MNB_classifier accuracy:", (nltk.classify.accuracy(MNB_classifier, testing_set)) *100)
更新了帖子,包括尝试解决方案的完整代码和跟踪。如果您运行nltk.download('all'),那么您应该能够按原样运行代码。还包括了该视频系列的链接。感谢您的帮助!这种方法是否解决了我对测试集包含在培训数据中的担忧,或者我遗漏了什么?欢迎!是的,这种方法也解决了您对列车测试拆分的担忧。如果这个答案对您有帮助,请您投票并将其标记为正确答案:-)
documents = [ (list(movie_reviews.words(fileid)), category)
for category in movie_reviews.categories()
for fileid in movie_reviews.fileids(category) ]
np.random.shuffle(documents)
training_set = documents[:1800]
testing_set = documents[1800:]
all_words = []
for words_list, categ in documents:
for w in words_list:
all_words.append(w.lower())
all_words = nltk.FreqDist(all_words)
word_features = list(all_words.keys())[:3000]
def find_features(training_set):
words = set(training_set)
features = {}
for w in word_features:
features[w] = (w in words)
return features
featuresets = [(find_features(rev), category) for (rev, category) in documents]
np.random.shuffle(featuresets)
training_set = featuresets[:1800]
testing_set = featuresets[1800:]
MNB_classifier = SklearnClassifier(MultinomialNB())
MNB_classifier.train(training_set)
print("MNB_classifier accuracy:", (nltk.classify.accuracy(MNB_classifier, testing_set)) *100)