使用scikit learn预测python中的截断
当我第一次使用python进行数据挖掘时,我面临着调整参数和获得最佳参数值(cutoff、classwt、sampsize)的问题。我正在尝试使用scikit learn中的随机林查找不同类的截止值。我正在使用以下代码使用scikit learn预测python中的截断,python,scikit-learn,Python,Scikit Learn,当我第一次使用python进行数据挖掘时,我面临着调整参数和获得最佳参数值(cutoff、classwt、sampsize)的问题。我正在尝试使用scikit learn中的随机林查找不同类的截止值。我正在使用以下代码 def cutoff_predict(rf,trainArr,cutoff): return (rf.predict_prob(trainArr)[:,1]>cutoff).astype(int) score=[] def custom_f1(cutoff):
def cutoff_predict(rf,trainArr,cutoff):
return (rf.predict_prob(trainArr)[:,1]>cutoff).astype(int)
score=[]
def custom_f1(cutoff):
def f1_cutoff(rf,trainArr,y):
ypred=cutoff_predict(rf,trainArr,cutoff)
return sklearn.metrics.f1_score(Actualres,results)
return f1_cutoff
for cutoff in np.arange(0.1,0.9,0.1):
rf = RandomForestClassifier(n_estimators=100) #Random forest generation for Classification
rf.fit(trainArr, trainRes) #Fit the random forest model
validated=cross_val_score(rf,trainArr,trainRes,cv=10,scoring=custom_f1(cutoff))
score.append(validated)
但我得到了以下错误
IndexError Traceback (most recent call last)
<ipython-input-14-f8b808ce9a4d> in <module>()
94 rf = RandomForestClassifier(n_estimators=100) #Random forest generation for Classification
95 rf.fit(trainArr, trainRes) #Fit the random forest model
---> 96 validated=cross_val_score(rf,trainArr,trainRes,cv=10,scoring=custom_f1(cutoff))
97 score.append(validated)
C:\Python27\Anaconda\lib\site-packages\sklearn\cross_validation.pyc in cross_val_score(estimator, X, y, scoring, cv, n_jobs, verbose, fit_params, pre_dispatch)
1350 X, y = indexable(X, y)
1351
-> 1352 cv = _check_cv(cv, X, y, classifier=is_classifier(estimator))
1353 scorer = check_scoring(estimator, scoring=scoring)
1354 # We clone the estimator to make sure that all the folds are
C:\Python27\Anaconda\lib\site-packages\sklearn\cross_validation.pyc in _check_cv(cv, X, y, classifier, warn_mask)
1604 if classifier:
1605 if type_of_target(y) in ['binary', 'multiclass']:
-> 1606 cv = StratifiedKFold(y, cv, indices=needs_indices)
1607 else:
1608 cv = KFold(_num_samples(y), cv, indices=needs_indices)
C:\Python27\Anaconda\lib\site-packages\sklearn\cross_validation.pyc in __init__(self, y, n_folds, indices, shuffle, random_state)
432 for test_fold_idx, per_label_splits in enumerate(zip(*per_label_cvs)):
433 for label, (_, test_split) in zip(unique_labels, per_label_splits):
--> 434 label_test_folds = test_folds[y == label]
435 # the test split can be too big because we used
436 # KFold(max(c, self.n_folds), self.n_folds) instead of
IndexError: too many indices for array
索引器错误回溯(最近一次调用)
在()
94 rf=随机森林分类器(n_估计器=100)#用于分类的随机森林生成
95 rf.拟合(trainArr,trainRes)#拟合随机森林模型
--->96已验证=交叉评分(rf、trainArr、trainRes、cv=10,评分=自定义评分f1(截止))
97分。追加(已验证)
C:\Python27\Anaconda\lib\site packages\sklearn\cross\u validation.pyc in cross\u val\u score(估计器、X、y、评分、cv、n\u作业、详细信息、拟合参数、预调度)
1350 X,y=可转位(X,y)
1351
->1352 cv=_检查_cv(cv,X,y,分类器=is_分类器(估计器))
1353评分员=检查评分(评估员,评分=评分)
1354#我们克隆估计器以确保所有褶皱
C:\Python27\Anaconda\lib\site packages\sklearn\cross\u validation.pyc in\u check\u cv(cv、X、y、分类器、警告掩码)
1604如果分类器:
1605如果['binary','multiclass']中的_目标(y)的类型_:
->1606 cv=层状褶皱(y,cv,指数=需求指数)
1607其他:
1608 cv=KFold(_num_样本(y),cv,指数=需求指数)
C:\Python27\Anaconda\lib\site packages\sklearn\cross\u validation.pyc in\uuuuuu init\uuuuu(self、y、n\u折叠、索引、无序、随机状态)
432对于test_fold_idx,枚举(zip(*per_label_cvs))中的每个标签分割:
433对于zip中的标签(U,测试_U拆分)(唯一的_U标签,每个_U标签拆分):
-->434标签\测试\折叠=测试\折叠[y==标签]
435#由于我们使用了
436#KFold(最大(c,self.n_折叠),self.n_折叠)而不是
索引器:数组的索引太多
这里有什么问题?另外:在'R'中,我们可以选择调整'cutoff'参数(cutoff=1/(类数))。在随机林(scikit学习包)中是否有类似的参数可以在python中调优 你犯了什么错误?您的帖子没有指定。@ASCIITHENASI抱歉。。我更新了问题。现在看起来好多了:-)