Python机器学习训练的分类器错误指数超出范围_Python_Machine Learning_Classification_Svm

Python机器学习训练的分类器错误指数超出范围

python machine-learning

Python机器学习训练的分类器错误指数超出范围,python,machine-learning,classification,svm,Python,Machine Learning,Classification,Svm,我有一个训练有素的分类器，一直运行良好我试图修改它，以使用循环处理多个.csv文件，但这已经破坏了它，以至于原始代码（工作正常）现在返回与以前处理的.csv文件相同的错误，没有任何问题我非常困惑，不知道是什么突然导致了这个错误的出现，而之前一切正常。原始（工作）代码为 # -*- coding: utf-8 -*- import csv import pandas import numpy as np import sklearn.ensemble

我有一个训练有素的分类器，一直运行良好
我试图修改它，以使用循环处理多个.csv文件，但这已经破坏了它，以至于原始代码（工作正常）现在返回与以前处理的.csv文件相同的错误，没有任何问题
我非常困惑，不知道是什么突然导致了这个错误的出现，而之前一切正常。原始（工作）代码为

# -*- coding: utf-8 -*- import csv import pandas import numpy as np import sklearn.ensemble as ske import re import os import collections import pickle from sklearn.externals import joblib from sklearn import model_selection, tree, linear_model, svm # Load dataset url = 'test_6_During_100.csv' dataset = pandas.read_csv(url) dataset.set_index('Name', inplace = True) ##dataset = dataset[['ProcessorAffinity','ProductVersion','Handle','Company', ## 'UserProcessorTime','Path','Product','Description',]] # Open file to output everything to new_url = re.sub('\.csv$', '', url) f = open(new_url + " output report", 'w') f.write(new_url + " output report\n") f.write("\n") # shape print(dataset.shape) print("\n") f.write("Dataset shape " + str(dataset.shape) + "\n") f.write("\n") clf = joblib.load(os.path.join( os.path.dirname(os.path.realpath(__file__)), 'classifier/classifier.pkl')) Class_0 = [] Class_1 = [] prob = [] for index, row in dataset.iterrows(): res = clf.predict([row]) if res == 0: if index in malware: Class_0.append(index) elif index in Class_1: Class_1.append(index) else: print "Is ", index, " recognised?" designation = raw_input() if designation == "No": Class_0.append(index) else: Class_1.append(index) dataset['Type'] = 1 dataset.loc[dataset.index.str.contains('|'.join(Class_0)), 'Type'] = 0 print "\n" results = [] results.append(collections.OrderedDict.fromkeys(dataset.index[dataset['Type'] == 0])) print (results) X = dataset.drop(['Type'], axis=1).values Y = dataset['Type'].values clf.set_params(n_estimators = len(clf.estimators_) + 40, warm_start = True) clf.fit(X, Y) joblib.dump(clf, 'classifier/classifier.pkl') output = collections.Counter(Class_0) print "Class_0; \n" f.write ("Class_0; \n") for key, value in output.items(): f.write(str(key) + " ; " + str(value) + "\n") print(str(key) + " ; " + str(value)) print "\n" f.write ("\n") output_1 = collections.Counter(Class_1) print "Class_1; \n" f.write ("Class_1; \n") for key, value in output_1.items(): f.write(str(key) + " ; " + str(value) + "\n") print(str(key) + " ; " + str(value)) print "\n" f.close()
我的新代码是相同的，但包装在两个嵌套的循环中，为了在文件夹中有文件要处理时保持脚本运行，新代码（导致错误的代码）如下所示

# -*- coding: utf-8 -*- import csv import pandas import numpy as np import sklearn.ensemble as ske import re import os import time import collections import pickle from sklearn.externals import joblib from sklearn import model_selection, tree, linear_model, svm # Our arrays which we'll store our process details in and then later print out data for Class_0 = [] Class_1 = [] prob = [] results = [] # Open file to output our report too timestr = time.strftime("%Y%m%d%H%M%S") f = open(timestr + " output report.txt", 'w') f.write(timestr + " output report\n") f.write("\n") count = len(os.listdir('.')) while (count > 0): # Load dataset for filename in os.listdir('.'): if filename.endswith('.csv') and filename.startswith("processes_"): url = filename dataset = pandas.read_csv(url) dataset.set_index('Name', inplace = True) clf = joblib.load(os.path.join( os.path.dirname(os.path.realpath(__file__)), 'classifier/classifier.pkl')) for index, row in dataset.iterrows(): res = clf.predict([row]) if res == 0: if index in Class_0: Class_0.append(index) elif index in Class_1: Class_1.append(index) else: print "Is ", index, " recognised?" designation = raw_input() if designation == "No": Class_0.append(index) else: Class_1.append(index) dataset['Type'] = 1 dataset.loc[dataset.index.str.contains('|'.join(Class_0)), 'Type'] = 0 print "\n" results.append(collections.OrderedDict.fromkeys(dataset.index[dataset['Type'] == 0])) print (results) X = dataset.drop(['Type'], axis=1).values Y = dataset['Type'].values clf.set_params(n_estimators = len(clf.estimators_) + 40, warm_start = True) clf.fit(X, Y) joblib.dump(clf, 'classifier/classifier.pkl') os.remove(filename) output = collections.Counter(Class_0) print "Class_0; \n" f.write ("Class_0; \n") for key, value in output.items(): f.write(str(key) + " ; " + str(value) + "\n") print(str(key) + " ; " + str(value)) print "\n" f.write ("\n") output_1 = collections.Counter(Class_1) print "Class_1; \n" f.write ("Class_1; \n") for key, value in output_1.items(): f.write(str(key) + " ; " + str(value) + "\n") print(str(key) + " ; " + str(value)) print "\n" f.close()
错误（
索引器错误：索引1超出大小1的界限）引用了预测行res=clf.predict（[row]）。据我所知，问题在于没有足够的“类”或标签类型来存储数据（我选择的是二进制分类器）？但是我以前一直在使用这个精确的方法（在嵌套循环之外），没有任何问题 -包含上述.csv文件的.csv数据的代码共享链接。问题在于[row]是长度为1的数组。您的程序尝试访问不存在的索引1（索引以0开头）。看起来您可能需要执行res=clf.predict（row）或查看row变量。希望这能有所帮助。所以我意识到了问题所在我已经创建了一种加载分类器的格式，然后使用warm_start重新拟合数据以更新分类器，以尝试和模拟增量/在线学习。当我处理同时包含两种类型的类的数据时，这种方法非常有效。然而，如果数据仅为正值，那么当我重新拟合分类器时，它会将其破坏现在我已经评论了以下内容 clf.set_params(n_estimators = len(clf.estimators_) + 40, warm_start = True) clf.fit(X, Y) joblib.dump(clf, 'classifier/classifier.pkl') 解决了这个问题。接下来，我可能会添加（另一个！）条件语句，看看是否应该重新拟合数据我很想删除这个问题，但是由于我在搜索过程中没有发现任何涉及这个事实的内容，我想我会把这个问题和答案一起保留，以防有人发现他们有相同的问题