Python 决策树的分类报告参数（精度、召回率、f1分数、支持度）为1.0_Python_Machine Learning_Scikit Learn_Decision Tree

Python 决策树的分类报告参数（精度、召回率、f1分数、支持度）为1.0

python machine-learning scikit-learn

Python 决策树的分类报告参数（精度、召回率、f1分数、支持度）为1.0,python,machine-learning,scikit-learn,decision-tree,Python,Machine Learning,Scikit Learn,Decision Tree,我有一个数据框，有21列和260616行。我想为状态变量建立一个决策树分类模型。我已经清理了数据，并将所有变量转换成适当的数据类型。下面是数据中所有列的摘要- 数据列（共21列）：品牌（260616非空对象）订单行Id（260616非空对象）订单类型（260616非空对象） OLSC_FC_名称（260616非空对象） FC_Id（260616非空对象）每周的OLSC天（260616非空对象） OLSAC_原因（260616非空对象）库存数量（260616非空int64）性别（260

我有一个数据框，有21列和260616行。我想为状态变量建立一个决策树分类模型。我已经清理了数据，并将所有变量转换成适当的数据类型。下面是数据中所有列的摘要-

数据列（共21列）：

品牌（260616非空对象）

订单行Id（260616非空对象）

订单类型（260616非空对象）

OLSC_FC_名称（260616非空对象）

FC_Id（260616非空对象）

每周的OLSC天（260616非空对象）

OLSAC_原因（260616非空对象）

库存数量（260616非空int64）

性别（260616非空对象）

类别（260616非空对象）

子类别（260616非空对象）

子品牌（260616非空对象）

季节（260616非空对象）

FC_类型（260616非空对象）

订单月份（260616非空int64）

OLSC_月（260616非空int64）

OLSC_小时（260616非空int64）

分配时间（260616非空int64）

每周的分配日（260616非空对象）

A2MPF（260616非空浮点64）

OLSC_状态（260616非空对象）

OLSC_FC_名称拥有540个独特价值和大约26个品牌价值
OLSC_状态有两个值，这也是我的分类因变量

我选择了决策树模型作为开始，看看我是否能看到任何有意义的结果来提升它

我使用标签编码器对所有分类变量进行编码，将OLSC_状态数据类型设置为“类别”，以70:30的比例拆分测试序列，并使用DecisionTreeClassifier编写算法代码

但准确度、回忆、f1成绩和支持度的得分都是1.0，这很奇怪

这棵树原来只有两层

这是完全错误的。我需要帮助来理解我做错了什么？另外，对于这类问题，哪种算法是最优的

下面是我使用的决策树算法的代码

# Splitting the data into train and test
X_train, X_test, y_train, y_test = train_test_split(X, y, 
                                                    test_size=0.30, 
                                                    random_state = 99)
X_train.head()

y_train.value_counts()

y_test.value_counts()

# Importing decision tree classifier from sklearn library
from sklearn.tree import DecisionTreeClassifier

# Fitting the decision tree with default hyperparameters, apart from
# max_depth which is 5 so that we can plot and read the tree.
dt_default = DecisionTreeClassifier(max_depth=10)
dt_default.fit(X_train, y_train)

# Let's check the evaluation metrics of our default model

# Importing classification report and confusion matrix from sklearn metrics
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score

# Making predictions
y_pred_default = dt_default.predict(X_test)

# Printing classification report
print(classification_report(y_test, y_pred_default))

# Printing confusion matrix and accuracy
print(confusion_matrix(y_test,y_pred_default))
print(accuracy_score(y_test,y_pred_default))

# Importing required packages for visualization
from IPython.display import Image  
from sklearn.externals.six import StringIO  
from sklearn.tree import export_graphviz
import pydotplus, graphviz

# Putting features
features = list(df_ca.columns[1:])
features

# plotting tree with max_depth=10
dot_data = StringIO()  
export_graphviz(dt_default, out_file=dot_data,
                feature_names=features, filled=True,rounded=True)

graph = pydotplus.graph_from_dot_data(dot_data.getvalue())  
Image(graph.create_png())

你能告诉我们更多的目标值吗？对于分类问题，您可以使用LGBM，这将有望给您带来更好的结果，并尝试将变量编码为非LabelEncoder，而使用OneHotEncoder。如果您有8个类别，LabelEncoder将比0更重视8。您可以做的另一个调整是更改max_depth或更小的值，然后查看结果。

我希望有帮助

我使用标签编码器将分类变量转换为数字标签

le=preprocessing.LabelEncoder（）df\u categorical.apply（le.fit\u transform）

检查

功能的重要性，可能是一个变量可以正确分类数据。