Machine learning 机器学习/NLP文本分类:从文本文件语料库训练模型-scikit学习

Machine learning 机器学习/NLP文本分类:从文本文件语料库训练模型-scikit学习,machine-learning,scikit-learn,nlp,text-classification,Machine Learning,Scikit Learn,Nlp,Text Classification,我对机器学习非常陌生,我想知道是否有人能带我看完这段代码,为什么它不起作用。这是我自己的scikit学习教程的变体,可以在:中找到,这基本上就是我要做的。我需要用带标签的训练集训练模型,这样当我使用测试集时,它可以预测测试集的标签。如果有人能告诉我如何保存和加载模型,那将非常有用。非常感谢你。这就是我到目前为止所做的: import codecs import os import numpy as np import pandas as pd from Text_Pre_Processing

我对机器学习非常陌生,我想知道是否有人能带我看完这段代码,为什么它不起作用。这是我自己的scikit学习教程的变体,可以在:中找到,这基本上就是我要做的。我需要用带标签的训练集训练模型,这样当我使用测试集时,它可以预测测试集的标签。如果有人能告诉我如何保存和加载模型,那将非常有用。非常感谢你。这就是我到目前为止所做的:

import codecs
import os

import numpy as np
import pandas as pd

from Text_Pre_Processing import Pre_Processing

filenames = os.listdir(
    "...scikit-machine-learning/training_set")
files = []
array_data = []
array_label = []
for file in filenames:
    with codecs.open("...scikit-machine-learning/training_set/" + file, "r",
                     encoding='utf-8', errors='ignore') as file_data:
        open_file = file_data.read()
        open_file = Pre_Processing.lower_case(open_file)
        open_file = Pre_Processing.remove_punctuation(open_file)
        open_file = Pre_Processing.clean_text(open_file)
        files.append(open_file)
# ----------------------------------------------------
# PUTTING LABELS INTO LIST
for file in files:
    if 'inheritance' in file:
        array_data.append(file)
        array_label.append('Inheritance (object-oriented programming)')
    elif 'pagerank' in file:
        array_data.append(file)
        array_label.append('PageRank')
    elif 'vector space model' in file:
        array_data.append(file)
        array_label.append('Vector Space Model')
    elif 'bayes' in file:
        array_data.append(file)
        array_label.append('Bayes' + "'" + ' Theorem')
    else:
        array_data.append(file)
        array_label.append('Dynamic programming')
#----------------------------------------------------------

csv_array = []
for i in range(0, len(array_data)):
    csv_array.append([array_data[i], array_label[i]])

# format of array [[string, label], [string, label], [string, label]]
import csv

with open('data.csv', 'w') as target:
    writer = csv.writer(target)
    writer.writerows(zip(test_array))

data = pd.read_csv('data.csv')
numpy_array = data.as_matrix()

X = numpy_array[:, 0]
Y = numpy_array[:, 1]

from sklearn.model_selection import train_test_split

X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.4, random_state=42)

from sklearn.feature_extraction.text import CountVectorizer

from sklearn.feature_extraction.text import TfidfTransformer

from sklearn.naive_bayes import MultinomialNB

from sklearn.pipeline import Pipeline

text_clf = Pipeline(['vect', CountVectorizer(stop_words='english'), 'tfidf', TfidfTransformer(),
                     'clf', MultinomialNB()])

text_clf = text_clf.fit(X_train, Y_train)

predicted = text_clf.predict(X_test)
np.mean(predicted == Y_test)
我在网上看到人们使用csv文件输入数据,所以我也尝试了,我可能不需要它,所以如果这是错误的,我道歉

显示错误:

C:.../scikit-machine-learning/train.py:63: FutureWarning: Method .as_matrix will be removed in a future version. Use .values instead.
  numpy_array = data.as_matrix()
Traceback (most recent call last):
  File "C:/...scikit-machine-learning/train.py", line 66, in <module>
    Y = numpy_array[:,1]
IndexError: index 1 is out of bounds for axis 1 with size 1

您需要从csv中删除字符['和'],因为read_csv将它们视为字符串(一列)而不是两列数据帧。 在text_clf=Pipeline行上还有一个打字错误,所以我也修复了它。祝你好运

data = pd.read_csv('data.csv', header=None)
numpy_array = data.as_matrix()

strarr = numpy_array[:, 0]
X=[strarr[i].split(",")[0].replace("[",'').replace("'",'') for i in range(len(strarr))]
Y=[strarr[i].split(",")[1].replace("]",'').replace("'",'') for i in range(len(strarr))]

from sklearn.model_selection import train_test_split

X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.4, random_state=42)

from sklearn.feature_extraction.text import CountVectorizer

from sklearn.feature_extraction.text import TfidfTransformer

from sklearn.naive_bayes import MultinomialNB

from sklearn.pipeline import Pipeline

text_clf = Pipeline([('vect', CountVectorizer(stop_words='english')), ('tfidf', TfidfTransformer()), ('clf', MultinomialNB())])

text_clf = text_clf.fit(X_train, Y_train)

predicted = text_clf.predict(X_test)
np.mean(predicted == Y_test)

请在data.csv上为我们提供一个示例数据。谢谢。@Xmo添加:)谢谢您可以使用Pickle保存和加载模型。这非常有效!非常感谢-当您想在predict方法中使用自己的文本时,也要感谢所有使用此方法的人,您需要在其周围加上方括号:str=“hello world”text\u clf.predict([str])
data = pd.read_csv('data.csv', header=None)
numpy_array = data.as_matrix()

strarr = numpy_array[:, 0]
X=[strarr[i].split(",")[0].replace("[",'').replace("'",'') for i in range(len(strarr))]
Y=[strarr[i].split(",")[1].replace("]",'').replace("'",'') for i in range(len(strarr))]

from sklearn.model_selection import train_test_split

X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.4, random_state=42)

from sklearn.feature_extraction.text import CountVectorizer

from sklearn.feature_extraction.text import TfidfTransformer

from sklearn.naive_bayes import MultinomialNB

from sklearn.pipeline import Pipeline

text_clf = Pipeline([('vect', CountVectorizer(stop_words='english')), ('tfidf', TfidfTransformer()), ('clf', MultinomialNB())])

text_clf = text_clf.fit(X_train, Y_train)

predicted = text_clf.predict(X_test)
np.mean(predicted == Y_test)