Scikit learn 利用sklearn的id3算法训练决策树
我正在尝试使用id3算法训练决策树。 其目的是获得所选特征的索引,估计发生率,并建立总混淆矩阵 该算法应将数据集拆分为训练集和测试集,并使用4倍交叉验证 我是新来的,我读过关于sklearn的教程和关于学习过程的理论,但我还是很困惑 我试着做的是:Scikit learn 利用sklearn的id3算法训练决策树,scikit-learn,python-3.5,decision-tree,cross-validation,confusion-matrix,Scikit Learn,Python 3.5,Decision Tree,Cross Validation,Confusion Matrix,我正在尝试使用id3算法训练决策树。 其目的是获得所选特征的索引,估计发生率,并建立总混淆矩阵 该算法应将数据集拆分为训练集和测试集,并使用4倍交叉验证 我是新来的,我读过关于sklearn的教程和关于学习过程的理论,但我还是很困惑 我试着做的是: from sklearn.model_selection import cross_val_predict,KFold,cross_val_score, train_test_split, learning_curve from sklearn.t
from sklearn.model_selection import cross_val_predict,KFold,cross_val_score,
train_test_split, learning_curve
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import confusion_matrix
X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=1)
clf = DecisionTreeClassifier(criterion='entropy', random_state=0)
clf.fit(X_train,y_train)
results = cross_val_score(estimator=clf, X=X_train, y=y_train, cv=4)
print("Accuracy: %0.2f (+/- %0.2f)" % (results.mean(), results.std()))
y_pred = cross_val_predict(estimator=clf, X=x, y=y, cv=4)
conf_mat = confusion_matrix(y,y_pred)
print(conf_mat)
dot_data = tree.export_graphviz(clf, out_file='tree.dot')
我有一些问题:
feature\u list=x.columns
正如你所知道的,并不是每个特征在预测中都有用。您可以看到,在训练模型之后,使用
clf.功能\u重要性\u
要素列表中的要素索引与要素重要性列表中的要素索引相同cross_val_分数完成了交易,但获得分数的更好方法是使用cross_Valid。它的工作方式与cross_val_score相同,但您可以检索更多的分数值,只需使用make_score创建所需的每个分数并通过,下面是一个示例:
from sklearn.model_selection import train_test_split, cross_validate
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score, f1_score, precision_score, make_scorer, recall_score
import pandas as pd, numpy as np
x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
dtc = DecisionTreeClassifier()
dtc_fit = dtc.fit(x_train, y_train)
def tn(y_true, y_pred): return confusion_matrix(y_true, y_pred)[0, 0]
def fp(y_true, y_pred): return confusion_matrix(y_true, y_pred)[0, 1]
def fn(y_true, y_pred): return confusion_matrix(y_true, y_pred)[1, 0]
def tp(y_true, y_pred): return confusion_matrix(y_true, y_pred)[1, 1]
scoring = {
'tp' : make_scorer(tp),
'tn' : make_scorer(tn),
'fp' : make_scorer(fp),
'fn' : make_scorer(fn),
'accuracy' : make_scorer(accuracy_score),
'precision': make_scorer(precision_score),
'f1_score' : make_scorer(f1_score),
'recall' : make_scorer(recall_score)
}
sc = cross_validate(dtc_fit, x_train, y_train, cv=5, scoring=scoring)
print("Accuracy: %0.2f (+/- %0.2f)" % (sc['test_accuracy'].mean(), sc['test_accuracy'].std() * 2))
print("Precision: %0.2f (+/- %0.2f)" % (sc['test_precision'].mean(), sc['test_precision'].std() * 2))
print("f1_score: %0.2f (+/- %0.2f)" % (sc['test_f1_score'].mean(), sc['test_f1_score'].std() * 2))
print("Recall: %0.2f (+/- %0.2f)" % (sc['test_recall'].mean(), sc['test_recall'].std() * 2), "\n")
stp = math.ceil(sc['test_tp'].mean())
stn = math.ceil(sc['test_tn'].mean())
sfp = math.ceil(sc['test_fp'].mean())
sfn = math.ceil(sc['test_fn'].mean())
confusion_matrix = pd.DataFrame(
[[stn, sfp], [sfn, stp]],
columns=['Predicted 0', 'Predicted 1'],
index=['True 0', 'True 1']
)
print(conf_m)
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import StratifiedShuffleSplit
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score, f1_score, precision_score, make_scorer, recall_score
import pandas as pd, numpy as np
precision = []; recall = []; f1score = []; accuracy = []
sss = StratifiedShuffleSplit(n_splits=10, test_size=0.2)
dtc = DecisionTreeClassifier()
for train_index, test_index in sss.split(X, y):
X_train, X_test = X.iloc[train_index], X.iloc[test_index]
y_train, y_test = y.iloc[train_index], y.iloc[test_index]
dtc.fit(X_train, y_train)
pred = dtc.predict(X_test)
precision.append(precision_score(y_test, pred))
recall.append(recall_score(y_test, pred))
f1score.append(f1_score(y_test, pred))
accuracy.append(accuracy_score(y_test, pred))
print("Accuracy: %0.2f (+/- %0.2f)" % (np.mean(accuracy),np.std(accuracy) * 2))
print("Precision: %0.2f (+/- %0.2f)" % (np.mean(precision),np.std(precision) * 2))
print("f1_score: %0.2f (+/- %0.2f)" % (np.mean(f1score),np.std(f1score) * 2))
print("Recall: %0.2f (+/- %0.2f)" % (np.mean(recall),np.std(recall) * 2))
如果你需要保持班级平衡,总是需要一个好的得分决策树分类,你必须使用分层折叠。如果要随机洗牌折叠中包含的值,可以使用StratifiedShuffleSplit。这里有一个例子:
from sklearn.model_selection import train_test_split, cross_validate
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score, f1_score, precision_score, make_scorer, recall_score
import pandas as pd, numpy as np
x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
dtc = DecisionTreeClassifier()
dtc_fit = dtc.fit(x_train, y_train)
def tn(y_true, y_pred): return confusion_matrix(y_true, y_pred)[0, 0]
def fp(y_true, y_pred): return confusion_matrix(y_true, y_pred)[0, 1]
def fn(y_true, y_pred): return confusion_matrix(y_true, y_pred)[1, 0]
def tp(y_true, y_pred): return confusion_matrix(y_true, y_pred)[1, 1]
scoring = {
'tp' : make_scorer(tp),
'tn' : make_scorer(tn),
'fp' : make_scorer(fp),
'fn' : make_scorer(fn),
'accuracy' : make_scorer(accuracy_score),
'precision': make_scorer(precision_score),
'f1_score' : make_scorer(f1_score),
'recall' : make_scorer(recall_score)
}
sc = cross_validate(dtc_fit, x_train, y_train, cv=5, scoring=scoring)
print("Accuracy: %0.2f (+/- %0.2f)" % (sc['test_accuracy'].mean(), sc['test_accuracy'].std() * 2))
print("Precision: %0.2f (+/- %0.2f)" % (sc['test_precision'].mean(), sc['test_precision'].std() * 2))
print("f1_score: %0.2f (+/- %0.2f)" % (sc['test_f1_score'].mean(), sc['test_f1_score'].std() * 2))
print("Recall: %0.2f (+/- %0.2f)" % (sc['test_recall'].mean(), sc['test_recall'].std() * 2), "\n")
stp = math.ceil(sc['test_tp'].mean())
stn = math.ceil(sc['test_tn'].mean())
sfp = math.ceil(sc['test_fp'].mean())
sfn = math.ceil(sc['test_fn'].mean())
confusion_matrix = pd.DataFrame(
[[stn, sfp], [sfn, stp]],
columns=['Predicted 0', 'Predicted 1'],
index=['True 0', 'True 1']
)
print(conf_m)
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import StratifiedShuffleSplit
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score, f1_score, precision_score, make_scorer, recall_score
import pandas as pd, numpy as np
precision = []; recall = []; f1score = []; accuracy = []
sss = StratifiedShuffleSplit(n_splits=10, test_size=0.2)
dtc = DecisionTreeClassifier()
for train_index, test_index in sss.split(X, y):
X_train, X_test = X.iloc[train_index], X.iloc[test_index]
y_train, y_test = y.iloc[train_index], y.iloc[test_index]
dtc.fit(X_train, y_train)
pred = dtc.predict(X_test)
precision.append(precision_score(y_test, pred))
recall.append(recall_score(y_test, pred))
f1score.append(f1_score(y_test, pred))
accuracy.append(accuracy_score(y_test, pred))
print("Accuracy: %0.2f (+/- %0.2f)" % (np.mean(accuracy),np.std(accuracy) * 2))
print("Precision: %0.2f (+/- %0.2f)" % (np.mean(precision),np.std(precision) * 2))
print("f1_score: %0.2f (+/- %0.2f)" % (np.mean(f1score),np.std(f1score) * 2))
print("Recall: %0.2f (+/- %0.2f)" % (np.mean(recall),np.std(recall) * 2))
feature\u list=x.columns
正如你所知道的,并不是每个特征在预测中都有用。您可以看到,在训练模型之后,使用
clf.功能\u重要性\u
要素列表中的要素索引与要素重要性列表中的要素索引相同cross_val_分数完成了交易,但获得分数的更好方法是使用cross_Valid。它的工作方式与cross_val_score相同,但您可以检索更多的分数值,只需使用make_score创建所需的每个分数并通过,下面是一个示例:
from sklearn.model_selection import train_test_split, cross_validate
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score, f1_score, precision_score, make_scorer, recall_score
import pandas as pd, numpy as np
x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
dtc = DecisionTreeClassifier()
dtc_fit = dtc.fit(x_train, y_train)
def tn(y_true, y_pred): return confusion_matrix(y_true, y_pred)[0, 0]
def fp(y_true, y_pred): return confusion_matrix(y_true, y_pred)[0, 1]
def fn(y_true, y_pred): return confusion_matrix(y_true, y_pred)[1, 0]
def tp(y_true, y_pred): return confusion_matrix(y_true, y_pred)[1, 1]
scoring = {
'tp' : make_scorer(tp),
'tn' : make_scorer(tn),
'fp' : make_scorer(fp),
'fn' : make_scorer(fn),
'accuracy' : make_scorer(accuracy_score),
'precision': make_scorer(precision_score),
'f1_score' : make_scorer(f1_score),
'recall' : make_scorer(recall_score)
}
sc = cross_validate(dtc_fit, x_train, y_train, cv=5, scoring=scoring)
print("Accuracy: %0.2f (+/- %0.2f)" % (sc['test_accuracy'].mean(), sc['test_accuracy'].std() * 2))
print("Precision: %0.2f (+/- %0.2f)" % (sc['test_precision'].mean(), sc['test_precision'].std() * 2))
print("f1_score: %0.2f (+/- %0.2f)" % (sc['test_f1_score'].mean(), sc['test_f1_score'].std() * 2))
print("Recall: %0.2f (+/- %0.2f)" % (sc['test_recall'].mean(), sc['test_recall'].std() * 2), "\n")
stp = math.ceil(sc['test_tp'].mean())
stn = math.ceil(sc['test_tn'].mean())
sfp = math.ceil(sc['test_fp'].mean())
sfn = math.ceil(sc['test_fn'].mean())
confusion_matrix = pd.DataFrame(
[[stn, sfp], [sfn, stp]],
columns=['Predicted 0', 'Predicted 1'],
index=['True 0', 'True 1']
)
print(conf_m)
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import StratifiedShuffleSplit
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score, f1_score, precision_score, make_scorer, recall_score
import pandas as pd, numpy as np
precision = []; recall = []; f1score = []; accuracy = []
sss = StratifiedShuffleSplit(n_splits=10, test_size=0.2)
dtc = DecisionTreeClassifier()
for train_index, test_index in sss.split(X, y):
X_train, X_test = X.iloc[train_index], X.iloc[test_index]
y_train, y_test = y.iloc[train_index], y.iloc[test_index]
dtc.fit(X_train, y_train)
pred = dtc.predict(X_test)
precision.append(precision_score(y_test, pred))
recall.append(recall_score(y_test, pred))
f1score.append(f1_score(y_test, pred))
accuracy.append(accuracy_score(y_test, pred))
print("Accuracy: %0.2f (+/- %0.2f)" % (np.mean(accuracy),np.std(accuracy) * 2))
print("Precision: %0.2f (+/- %0.2f)" % (np.mean(precision),np.std(precision) * 2))
print("f1_score: %0.2f (+/- %0.2f)" % (np.mean(f1score),np.std(f1score) * 2))
print("Recall: %0.2f (+/- %0.2f)" % (np.mean(recall),np.std(recall) * 2))
如果你需要保持班级平衡,总是需要一个好的得分决策树分类,你必须使用分层折叠。如果要随机洗牌折叠中包含的值,可以使用StratifiedShuffleSplit。这里有一个例子:
from sklearn.model_selection import train_test_split, cross_validate
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score, f1_score, precision_score, make_scorer, recall_score
import pandas as pd, numpy as np
x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
dtc = DecisionTreeClassifier()
dtc_fit = dtc.fit(x_train, y_train)
def tn(y_true, y_pred): return confusion_matrix(y_true, y_pred)[0, 0]
def fp(y_true, y_pred): return confusion_matrix(y_true, y_pred)[0, 1]
def fn(y_true, y_pred): return confusion_matrix(y_true, y_pred)[1, 0]
def tp(y_true, y_pred): return confusion_matrix(y_true, y_pred)[1, 1]
scoring = {
'tp' : make_scorer(tp),
'tn' : make_scorer(tn),
'fp' : make_scorer(fp),
'fn' : make_scorer(fn),
'accuracy' : make_scorer(accuracy_score),
'precision': make_scorer(precision_score),
'f1_score' : make_scorer(f1_score),
'recall' : make_scorer(recall_score)
}
sc = cross_validate(dtc_fit, x_train, y_train, cv=5, scoring=scoring)
print("Accuracy: %0.2f (+/- %0.2f)" % (sc['test_accuracy'].mean(), sc['test_accuracy'].std() * 2))
print("Precision: %0.2f (+/- %0.2f)" % (sc['test_precision'].mean(), sc['test_precision'].std() * 2))
print("f1_score: %0.2f (+/- %0.2f)" % (sc['test_f1_score'].mean(), sc['test_f1_score'].std() * 2))
print("Recall: %0.2f (+/- %0.2f)" % (sc['test_recall'].mean(), sc['test_recall'].std() * 2), "\n")
stp = math.ceil(sc['test_tp'].mean())
stn = math.ceil(sc['test_tn'].mean())
sfp = math.ceil(sc['test_fp'].mean())
sfn = math.ceil(sc['test_fn'].mean())
confusion_matrix = pd.DataFrame(
[[stn, sfp], [sfn, stp]],
columns=['Predicted 0', 'Predicted 1'],
index=['True 0', 'True 1']
)
print(conf_m)
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import StratifiedShuffleSplit
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score, f1_score, precision_score, make_scorer, recall_score
import pandas as pd, numpy as np
precision = []; recall = []; f1score = []; accuracy = []
sss = StratifiedShuffleSplit(n_splits=10, test_size=0.2)
dtc = DecisionTreeClassifier()
for train_index, test_index in sss.split(X, y):
X_train, X_test = X.iloc[train_index], X.iloc[test_index]
y_train, y_test = y.iloc[train_index], y.iloc[test_index]
dtc.fit(X_train, y_train)
pred = dtc.predict(X_test)
precision.append(precision_score(y_test, pred))
recall.append(recall_score(y_test, pred))
f1score.append(f1_score(y_test, pred))
accuracy.append(accuracy_score(y_test, pred))
print("Accuracy: %0.2f (+/- %0.2f)" % (np.mean(accuracy),np.std(accuracy) * 2))
print("Precision: %0.2f (+/- %0.2f)" % (np.mean(precision),np.std(precision) * 2))
print("f1_score: %0.2f (+/- %0.2f)" % (np.mean(f1score),np.std(f1score) * 2))
print("Recall: %0.2f (+/- %0.2f)" % (np.mean(recall),np.std(recall) * 2))
我希望我已经回答了你需要的一切 非常感谢你!1.我想在结果中得到所选的特征,所以我假设我可以使用特征的重要性?2.如果我使用“accurancy”参数进行评分,它会如何改变算法?难道我不需要提供一个评分函数来实现使用熵的增益信息,使其成为“id3”吗?3.交叉验证是否保持了类别平衡?1。是的,你说得对。只需观察特征u重要性u您可以选择预测中最重要的特征,以降低预测模型的复杂性(此步骤也称为特征选择)2。当你评估一个模型时,你不仅要使用准确度,还要使用精确度、回忆和其他分数。此分数可以更改您的功能选择过程3。交叉验证支持类平衡。使用我发布的解决方案,您也可以检索自己的混淆矩阵。简而言之,我已经向您展示了交叉验证在这一点上是如何工作的。您是否使用pandas作为pd?我发现一个错误“AttributeError:module'pandas'没有属性'DataFrame'”,这帮了大忙谢谢,还有一个问题-你说过精度、回忆和其他分数与学习过程相关,那么为什么在拟合后选择它们呢?它们还与id3特别相关吗?是的,我使用熊猫作为pd,对不起,我忘了包括它。无论如何,精确性、召回率和其他分数都是在验证步骤之后测量的。这些在每个预测模型中都是相关的,比如ID3。有些模型,如DecisionTree,可能会出现过度拟合()。精度为0.99的模型可能拟合过度。你只要看分数就可以看到这些东西非常感谢你!1.我想在结果中得到所选的特征,所以我假设我可以使用特征的重要性?2.如果我使用“accurancy”参数进行评分,它会如何改变算法?难道我不需要提供一个评分函数来实现使用熵的增益信息,使其成为“id3”吗?3.交叉验证是否保持了类别平衡?1。是的,你说得对。只需观察特征的重要性,就可以选择预测中最重要的特征,以降低预测的复杂性