Python 在训练了朴素贝叶斯文本分类算法后,如何预测单个文本文件的主题
我使用文本和训练数据训练并测试了朴素贝叶斯算法。现在我想预测单个文本文件的主题 这是我的密码Python 在训练了朴素贝叶斯文本分类算法后,如何预测单个文本文件的主题,python,python-3.x,Python,Python 3.x,我使用文本和训练数据训练并测试了朴素贝叶斯算法。现在我想预测单个文本文件的主题 这是我的密码 #importing test, train data import sklearn.datasets as skd categories = ['business', 'entertainment','local', 'sports', 'world'] sinhala_train = skd.load_files('Cleant data\stemmed_filtered_sinhala-set1
#importing test, train data
import sklearn.datasets as skd
categories = ['business', 'entertainment','local', 'sports', 'world']
sinhala_train = skd.load_files('Cleant data\stemmed_filtered_sinhala-set1', categories= categories, encoding= 'utf-8')
sinhala_test = skd.load_files('Cleant data\stemmed_filtered_sinhala-set2',categories= categories, encoding= 'utf-8')
name_file = "adaderana_67571.txt"
A = open(name_file, encoding='utf-8')
new_file = A.read()
from sklearn.feature_extraction.text import CountVectorizer
count_vectorization = CountVectorizer()
train_data_tf = count_vectorization.fit_transform(sinhala_train.data)
train_data_tf.shape
from sklearn.feature_extraction.text import TfidfTransformer
tfidf_trans = TfidfTransformer()
train_data_tfidf = tfidf_trans.fit_transform(train_data_tf)
train_data_tfidf.shape
from sklearn.naive_bayes import MultinomialNB
clf = MultinomialNB().fit(train_data_tfidf, sinhala_train.target)
test_data_tf = count_vectorization.transform(sinhala_test.data)
test_data_tfidf = tfidf_trans.fit_transform(test_data_tf)
predicted = clf.predict(test_data_tfidf)
from sklearn import metrics
from sklearn.metrics import accuracy_score
print("Accuracy of the model:", accuracy_score(sinhala_test.target, predicted))
print(metrics.classification_report(sinhala_test.target, predicted, target_names=sinhala_test.target_names)),
metrics.confusion_matrix(sinhala_test.target, predicted)
这是我的输出
Accuracy of the model: 0.864
precision recall f1-score support
business 0.78 0.94 0.85 100
entertainment 0.95 0.86 0.90 100
local 0.89 0.65 0.75 100
sports 0.91 0.93 0.92 100
world 0.83 0.94 0.88 100
micro avg 0.86 0.86 0.86 500
macro avg 0.87 0.86 0.86 500
weighted avg 0.87 0.86 0.86 500
array([[94, 2, 4, 0, 0],
[ 2, 86, 2, 4, 6],
[19, 0, 65, 5, 11],
[ 1, 3, 1, 93, 2],
[ 5, 0, 1, 0, 94]], dtype=int64)
现在我想预测文本文件new\u file
的主题
有人能帮我写代码来预测这个文本文件的主题吗。我解决了我的问题。这是我用来预测主题的代码
docs_new1 = sinhala_test_1
docs_new = [docs_new1]
X_new_counts = count_vectorization.transform(docs_new)
X_new_tfidf = tfidf_trans.transform(X_new_counts)
predicted_topic = clf.predict(X_new_tfidf)
for doc, category in zip(docs_new, predicted_topic):
topic = ( sinhala_train.target_names[category])
return topic