Pandas Sci工具包学习管道返回索引器错误:数组的索引太多
我正试图为一些简单的机器学习项目掌握sci kit learn,但我不知道我做错了什么 我正在努力解决一个问题 这是我的密码:Pandas Sci工具包学习管道返回索引器错误:数组的索引太多,pandas,scikit-learn,kaggle,Pandas,Scikit Learn,Kaggle,我正试图为一些简单的机器学习项目掌握sci kit learn,但我不知道我做错了什么 我正在努力解决一个问题 这是我的密码: import pandas as pd train = pd.read_csv(local path to training data) train_labels = pd.read_csv(local path to labels) from sklearn.decomposition import PCA from sklearn.svm import Lin
import pandas as pd
train = pd.read_csv(local path to training data)
train_labels = pd.read_csv(local path to labels)
from sklearn.decomposition import PCA
from sklearn.svm import LinearSVC
from sklearn.grid_search import GridSearchCV
pca = PCA()
clf = LinearSVC()
n_components = arange(1, 39)
loss =['l1','l2']
penalty =['l1','l2']
C = arange(0, 1, .1)
whiten = [True, False]
from sklearn.pipeline import Pipeline
#set up pipeline
pipe = Pipeline(steps=[('pca', pca), ('clf', clf)])
#set up GridsearchCV
estimator = GridSearchCV(pipe, dict(pca__n_components = n_components, pca__whiten = whiten,
clf__loss = loss, clf__penalty = penalty, clf__C = C))
estimator
返回:
GridSearchCV(cv=None,
estimator=Pipeline(steps=[('pca', PCA(copy=True, n_components=None, whiten=False)), ('clf', LinearSVC(C=1.0, class_weight=None, dual=True, fit_intercept=True,
intercept_scaling=1, loss='l2', multi_class='ovr', penalty='l2',
random_state=None, tol=0.0001, verbose=0))]),
fit_params={}, iid=True, loss_func=None, n_jobs=1,
param_grid={'clf__penalty': ['l1', 'l2'], 'clf__loss': ['l1', 'l2'], 'clf__C': array([ 0. , 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9]), 'pca__n_components': array([ 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17,
18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34,
35, 36, 37, 38]), 'pca__whiten': [True, False]},
pre_dispatch='2*n_jobs', refit=True, score_func=None, scoring=None,
verbose=0)
但当我尝试训练数据时:
estimator.fit(train, train_labels)
错误是:
428 for test_fold_idx, per_label_splits in enumerate(zip(*per_label_cvs)):
429 for label, (_, test_split) in zip(unique_labels, per_label_splits):
--> 430 label_test_folds = test_folds[y == label]
431 # the test split can be too big because we used
432 # KFold(max(c, self.n_folds), self.n_folds) instead of
IndexError: too many indices for array
有人能给我指出正确的方向吗?结果是熊猫数据框的形状不对
estimator.fit(train.values, train_labels[0].values)
虽然我也不得不放弃惩罚项,但仍然有效。不,这是sci工具包/numpy没有正确调用数组属性的问题,有时有效有时无效。我发现最好显式调用
.values
属性,以便它总是将numpy数组作为参数传递,以确保兼容性,我会记得的。我认为train_标签
应该是pd.Series
而不是pd.DataFrame
(它是目标值的1D数组;这与恰好只有一列的2D数组不同)。这就是为什么在为列列标签[0]
编制索引时它会起作用的原因;不需要.values
@EdChum,我已经尝试过了。但是values和[0]都不起作用。我得到一个ValueError:找到了样本数不一致的输入变量:[6800,1]