Machine learning 机器学习/NLP文本分类:从文本文件语料库训练模型-scikit学习
我对机器学习非常陌生,我想知道是否有人能带我看完这段代码,为什么它不起作用。这是我自己的scikit学习教程的变体,可以在:中找到,这基本上就是我要做的。我需要用带标签的训练集训练模型,这样当我使用测试集时,它可以预测测试集的标签。如果有人能告诉我如何保存和加载模型,那将非常有用。非常感谢你。这就是我到目前为止所做的:Machine learning 机器学习/NLP文本分类:从文本文件语料库训练模型-scikit学习,machine-learning,scikit-learn,nlp,text-classification,Machine Learning,Scikit Learn,Nlp,Text Classification,我对机器学习非常陌生,我想知道是否有人能带我看完这段代码,为什么它不起作用。这是我自己的scikit学习教程的变体,可以在:中找到,这基本上就是我要做的。我需要用带标签的训练集训练模型,这样当我使用测试集时,它可以预测测试集的标签。如果有人能告诉我如何保存和加载模型,那将非常有用。非常感谢你。这就是我到目前为止所做的: import codecs import os import numpy as np import pandas as pd from Text_Pre_Processing
import codecs
import os
import numpy as np
import pandas as pd
from Text_Pre_Processing import Pre_Processing
filenames = os.listdir(
"...scikit-machine-learning/training_set")
files = []
array_data = []
array_label = []
for file in filenames:
with codecs.open("...scikit-machine-learning/training_set/" + file, "r",
encoding='utf-8', errors='ignore') as file_data:
open_file = file_data.read()
open_file = Pre_Processing.lower_case(open_file)
open_file = Pre_Processing.remove_punctuation(open_file)
open_file = Pre_Processing.clean_text(open_file)
files.append(open_file)
# ----------------------------------------------------
# PUTTING LABELS INTO LIST
for file in files:
if 'inheritance' in file:
array_data.append(file)
array_label.append('Inheritance (object-oriented programming)')
elif 'pagerank' in file:
array_data.append(file)
array_label.append('PageRank')
elif 'vector space model' in file:
array_data.append(file)
array_label.append('Vector Space Model')
elif 'bayes' in file:
array_data.append(file)
array_label.append('Bayes' + "'" + ' Theorem')
else:
array_data.append(file)
array_label.append('Dynamic programming')
#----------------------------------------------------------
csv_array = []
for i in range(0, len(array_data)):
csv_array.append([array_data[i], array_label[i]])
# format of array [[string, label], [string, label], [string, label]]
import csv
with open('data.csv', 'w') as target:
writer = csv.writer(target)
writer.writerows(zip(test_array))
data = pd.read_csv('data.csv')
numpy_array = data.as_matrix()
X = numpy_array[:, 0]
Y = numpy_array[:, 1]
from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.4, random_state=42)
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline
text_clf = Pipeline(['vect', CountVectorizer(stop_words='english'), 'tfidf', TfidfTransformer(),
'clf', MultinomialNB()])
text_clf = text_clf.fit(X_train, Y_train)
predicted = text_clf.predict(X_test)
np.mean(predicted == Y_test)
我在网上看到人们使用csv文件输入数据,所以我也尝试了,我可能不需要它,所以如果这是错误的,我道歉
显示错误:
C:.../scikit-machine-learning/train.py:63: FutureWarning: Method .as_matrix will be removed in a future version. Use .values instead.
numpy_array = data.as_matrix()
Traceback (most recent call last):
File "C:/...scikit-machine-learning/train.py", line 66, in <module>
Y = numpy_array[:,1]
IndexError: index 1 is out of bounds for axis 1 with size 1
您需要从csv中删除字符['和'],因为read_csv将它们视为字符串(一列)而不是两列数据帧。 在text_clf=Pipeline行上还有一个打字错误,所以我也修复了它。祝你好运
data = pd.read_csv('data.csv', header=None)
numpy_array = data.as_matrix()
strarr = numpy_array[:, 0]
X=[strarr[i].split(",")[0].replace("[",'').replace("'",'') for i in range(len(strarr))]
Y=[strarr[i].split(",")[1].replace("]",'').replace("'",'') for i in range(len(strarr))]
from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.4, random_state=42)
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline
text_clf = Pipeline([('vect', CountVectorizer(stop_words='english')), ('tfidf', TfidfTransformer()), ('clf', MultinomialNB())])
text_clf = text_clf.fit(X_train, Y_train)
predicted = text_clf.predict(X_test)
np.mean(predicted == Y_test)
请在data.csv上为我们提供一个示例数据。谢谢。@Xmo添加:)谢谢您可以使用Pickle保存和加载模型。这非常有效!非常感谢-当您想在predict方法中使用自己的文本时,也要感谢所有使用此方法的人,您需要在其周围加上方括号:str=“hello world”text\u clf.predict([str])
data = pd.read_csv('data.csv', header=None)
numpy_array = data.as_matrix()
strarr = numpy_array[:, 0]
X=[strarr[i].split(",")[0].replace("[",'').replace("'",'') for i in range(len(strarr))]
Y=[strarr[i].split(",")[1].replace("]",'').replace("'",'') for i in range(len(strarr))]
from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.4, random_state=42)
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline
text_clf = Pipeline([('vect', CountVectorizer(stop_words='english')), ('tfidf', TfidfTransformer()), ('clf', MultinomialNB())])
text_clf = text_clf.fit(X_train, Y_train)
predicted = text_clf.predict(X_test)
np.mean(predicted == Y_test)