Python 如何在nltk naivebayes分类器中添加频率?
我现在正在使用nltk学习朴素贝叶斯分类器 在document()1.3文档分类中,有一个featureset示例Python 如何在nltk naivebayes分类器中添加频率?,python,nltk,naivebayes,nl-classifier,Python,Nltk,Naivebayes,Nl Classifier,我现在正在使用nltk学习朴素贝叶斯分类器 在document()1.3文档分类中,有一个featureset示例 featuresets = [(document_features(d), c) for (d,c) in documents] train_set, test_set = featuresets[100:], featuresets[:100] classifier = nltk.NaiveBayesClassifier.train(train_set) all_words =
featuresets = [(document_features(d), c) for (d,c) in documents]
train_set, test_set = featuresets[100:], featuresets[:100]
classifier = nltk.NaiveBayesClassifier.train(train_set)
all_words = nltk.FreqDist(w.lower() for w in movie_reviews.words())
word_features = list(all_words)[:2000] [1]
def document_features(document): [2]
document_words = set(document) [3]
features = {}
for word in word_features:
features['contains({})'.format(word)] = (word in document_words)
return features
因此,FeatureSet形式的示例是{('contains(waste)'False,'contains(lot)'False,…},'neg')…}
但我想将单词词典形式从'contains(waste):False更改为'contains(waste):2。我认为那个表单(‘包含(废物)’:2)很好地解释了文档,因为它可以计算世界的频率。所以特性集应该是{('contains(waste)'2,'contains(lot)'5,…},'neg')…}
但我担心的是“包含(废物)”2和“包含(废物)”1是否与naivebayesclassifier完全不同。然后它无法解释“包含(废物)”的相似性:2和“包含(废物)”的相似性:1
{'contains(lot)'1和'contains(waste)'1}和{'contains(waste)'2和'contains(waste)'1}可以与程序相同
nltk.naivebayesclassifier能理解单词的频率吗
这是我使用的代码
def split_and_count_word(data):
#belongs_to : Main
#Role : make featuresets from korean words using konlpy.
#Parameter : dictionary data(dict of contents ex.{'politic':{'parliament': [content,content]}..})
#Return : list featuresets([{'word':True',...},'politic'] == featureset + category)
featuresets = []
twitter = konlpy.tag.Twitter()#Korean word splitter
for big_cat in data:
for small_cat in data[big_cat]:
#save category name needed in featuresets
category = str(big_cat[0:3])+'/'+str(small_cat)
count = 0; print(small_cat)
for one_news in data[big_cat][small_cat]:
count+=1; if count%100==0: print(count,end=' ')
#one_news is list in list so open it!
doc = one_news
#split word as using konlpy
list_of_splited_word = twitter.morphs(doc[:-63])#delete useless sentences.
#get word length is higher than two and get list of splited words
list_of_up_two_word = [word for word in list_of_splited_word if len(word)>1]
dict_of_featuresets = make_featuresets(list_of_up_two_word)
#save
featuresets.append((dict_of_featuresets,category))
return featuresets
def make_featuresets(data):
#belongs_to : split_and_count_word
#Role : make featuresets
#Parameter : list list_of_up_two_word(ex.['비누','떨어','지다']
#Return : dictionary {word : True for word in data}
#PROBLEM :(
#cannot consider the freqency of word
return {word : True for word in data}
def naive_train(featuresets):
#belongs_to : Main
#Role : Learning by naive bayes rule
#Parameter : list featuresets([{'word':True',...},'pol/pal'])
#Return : object classifier(nltk naivebayesclassifier object),
# list test_set(the featuresets that are randomly selected)
random.shuffle(featuresets)
train_set, test_set = featuresets[1000:], featuresets[:1000]
classifier = naivebayes.NaiveBayesClassifier.train(train_set)
return classifier,test_set
featuresets = split_and_count_word(data)
classifier,test_set = naive_train(featuresets)
nltk的朴素贝叶斯分类器将特征值视为逻辑上不同的。值不限于
True
和False
,但它们从不被视为数量。如果您有功能f=2
和f=3
,则它们将计为不同的值。向这样的模型中添加数量的唯一方法是将它们分类为“桶”,例如f=1
,f=“少数”
(2-5),f=“多个”
(6-10),f=“多个”
(11个或更多)。(注意:如果你走这条路线,有一些算法可以为桶选择好的值范围。)即使这样,模型也不知道“很少”在“一”和“几个”之间。你需要一个不同的机器学习工具来直接处理数量 谢谢你给我这个主意。那么你的意思是我不能添加已经包含在功能字典中的单词?例如,字典应该是{“hello”:True,“hello”:True,“my”:True…}。那么,你能推荐其他有用的机器学习模块吗?正如你在对@aberger的评论中指出的,不,你不能在一个dict中有两个相同的键。不能直接向你指出一个量化的解决方案,对不起。nltk使用数字权重,但它们通常由API根据您提供的“标称”特性创建;所以你必须四处寻找正确的使用方法。还可以看看scikit learn。最好的分类器取决于您的任务,因此请尝试使用一些!谢谢,我试试看!