Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/python/363.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python 多项式nb()预测所有测试文档的相同类别_Python_Scikit Learn_Tf Idf - Fatal编程技术网

Python 多项式nb()预测所有测试文档的相同类别

Python 多项式nb()预测所有测试文档的相同类别,python,scikit-learn,tf-idf,Python,Scikit Learn,Tf Idf,我有一堆文档,分为大约350个类。我试图建立一个TF-IDF多项式模型来预测新文档的类别。一切似乎都很正常,除了测试预测只接受一个值,即使我在数千个文档上运行测试。我错过了什么 以下是相关代码: stop_words = set(stopwords.words('english')) tokenizer = RegexpTokenizer(r'\w+') stemmer = SnowballStemmer("english") count_vect = CountVectorizer() t

我有一堆文档,分为大约350个类。我试图建立一个TF-IDF多项式模型来预测新文档的类别。一切似乎都很正常,除了测试预测只接受一个值,即使我在数千个文档上运行测试。我错过了什么

以下是相关代码:

stop_words = set(stopwords.words('english'))
tokenizer = RegexpTokenizer(r'\w+')
stemmer = SnowballStemmer("english")

count_vect = CountVectorizer()

tfidf_transformer = TfidfTransformer(norm='l1', use_idf=True, smooth_idf=False, sublinear_tf=False)

clf = MultinomialNB()    

mycsv = pd.read_csv("C:/DocumentsToClassify.csv", encoding='latin-1')

Document_text=mycsv.document.str.lower()
y=mycsv.document_group

Y=[]
stemmed_documents = []

for i in range(0, 50000 ,2):
    tokenized_document = tokenizer.tokenize(Document_text[i])

    stemmed_document = ""

    for w in tokenized_document:
        if w not in stop_words:
            w = re.sub(r'\d+', '', w)
            if w is not None:
                stemmed_document=stemmed_document+" "+stemmer.stem(w)

    stemmed_documents=np.append(stemmed_documents,stemmed_document)
    Y=np.append(Y,y[i])

Y_correct=[]
test_documents = []
for i in range(1,50000,4):
    tokenized_document = tokenizer.tokenize(Document_text[i])      
    stemmed_document = ""
    for w in tokenized_document:
        if w not in stop_words:
            w = re.sub(r'\d+', '', w)
            if w is not None:
                stemmed_document=stemmed_document+" "+stemmer.stem(w)

    test_documents=np.append(test_documents,stemmed_document)
    Y_correct=np.append(Y_correct,y[i])

Word_counts = count_vect.fit_transform(stemmed_documents)
Words_tfidf = tfidf_transformer.fit_transform(Word_counts)

Word_counts_test = count_vect.transform(test_documents)
Words_tfidf_test = tfidf_transformer.transform(Word_counts_test)

# Training
clf.fit(Words_tfidf, Y)

# Test
Ynew=clf.predict(Words_tfidf_test)

经过昨天的一段时间的努力,我找到了一个解决方案——从多项式NB转换为SGDClassizer。我不知道为什么它不适用于多项式NB,但SDG非常有效。下面是相关的代码,也被大大缩短了

text_clf = Pipeline([('vect', CountVectorizer()),
                     ('tfidf', TfidfTransformer(norm='l1', use_idf=True, smooth_idf=True, sublinear_tf=False)),
                     ('clf', SGDClassifier(loss='hinge', penalty='l2',alpha=1e-3, random_state=42)),
])

# Training dataset
train_data = pd.read_csv("A:/DocumentsWithGroupTrain.csv", encoding='latin-1')

# Test dataset
test_data = pd.read_csv("A:/DocumentsWithGroupTest.csv", encoding='latin-1')

text_clf.fit(train_data.document, train_data.doc_group)
predicted = text_clf.predict(test_data.document)
print(np.mean(predicted == test_data.doc_group))