Python 我的决策树实现太慢了。我怎样才能让它更快，还是我做错了？_Python_Numpy_Machine Learning_Decision Tree

Python 我的决策树实现太慢了。我怎样才能让它更快，还是我做错了？

python numpy machine-learning

Python 我的决策树实现太慢了。我怎样才能让它更快，还是我做错了？,python,numpy,machine-learning,decision-tree,Python,Numpy,Machine Learning,Decision Tree,我试图从头开始实现一个决策树，但当我使用Kaggle的bluebook数据集测试它时，速度太慢了。这是我到目前为止写的代码。节点类只包含属性 class Node(): # the node of the decision tree # goes left if less than or equal to threshold and right if greater than threshold # column is the def __init__(self, average, colum

我试图从头开始实现一个决策树，但当我使用Kaggle的bluebook数据集测试它时，速度太慢了。这是我到目前为止写的代码。节点类只包含属性

class Node():
# the node of the decision tree
# goes left if less than or equal to threshold and right if greater than threshold
# column is the 
def __init__(self, average, column=None, threshold=None, error=None, parent=None, left=None, right=None, is_leaf=False):
    self.average = average
    self.column = column
    self.threshold = threshold
    self.error = error
    self.parent = parent
    self.left = left
    self.right = right
    self.is_leaf = is_leaf

DecisionTree类构建决策树。重要的功能是

create_tree

和

split

功能。这是它的代码

class DecisionTree():

    def __init__(self, x, y, min_samples_leaf=1):
        #x is the dataframe with independent variables
        self.x = x
        #y is the dependent variable (price)
        self.y = y
        #min_samples_leaf is the minimum number of items in the leaf
        self.min_samples_leaf = min_samples_leaf
        self.create_tree()

    def create_tree(self):     
        initial_avg = self.y.mean()
        initial_error = math.sqrt(((self.y - initial_avg)**2).sum()/len(self.y))
        initial_idx = np.arange(len(self.x))
        self.root = Node(average=initial_avg, error=initial_error)
        self.split(initial_idx, self.root)

    def split(self, idx, parent_node):
        #idx are the indexes of the rows that are a part of the node
        columns = self.x.columns
        parent_is_leaf = True
        best_error = parent_node.error
        best_column = columns[0]
        best_threshold = 0
        best_left_error = sys.float_info.max
        best_right_error = sys.float_info.max
        best_left_idx = array([])
        best_right_idx = array([])
        #loop through columns
        for col in columns:
            for value in self.x[col].unique():
                left_idx = np.where(self.x[col].iloc[idx] <= value)
                right_idx = np.where(self.x[col].iloc[idx] > value)
                if len(left_idx) < self.min_samples_leaf or len(right_idx) < self.min_samples_leaf:
                    continue
                left_error = self.find_error(left_idx)
                right_error = self.find_error(right_idx)
                if left_error < best_error and right_error < best_error:
                    best_error = (left_error + right_error)/2
                    best_column = col
                    best_threshold = value
                    best_left_error = left_error
                    best_right_error = right_error
                    best_left_idx = left_idx
                    best_right_idx = right_idx
                    parent_is_leaf = False

        if parent_is_leaf:
            parent_node.is_leaf = True
        else:
            parent_node.column = best_column
            parent_node.threshold = best_threshold
            left_node = Node(average=find_average(best_left_idx), error=best_left_error, parent=parent_node)
            right_node = Node(average=find_average(best_right_idx), error=best_right_error, parent=parent_node)
            parent_node.left = left_node
            parent_node.right = right_node
            self.split(best_left_idx, left_node)
            self.split(best_right_idx, right_node)

    def find_error(self, idx):
        #rmse
        avg = self.y.iloc[idx].mean()
        return math.sqrt(((self.y.iloc[idx] - avg)**2).sum()/len(idx))

    def find_average(self, idx):
        #average
        return self.y.iloc[idx].mean()

类决策树（）：
定义初始值（自、x、y、最小样本叶=1）：
#x是具有自变量的数据帧
self.x=x
#y是因变量（价格）
self.y=y
#min_samples_leaf是叶中项目的最小数量
self.min\u samples\u leaf=min\u samples\u leaf
self.create_树（）
def创建_树（自）：
初始平均值=自身y平均值（）
初始误差=math.sqrt（（（self.y-初始平均值）**2.sum（）/len（self.y））
初始值=np.arange（len（self.x））
self.root=节点（平均值=初始平均值，误差=初始误差）
self.split（初始_idx，self.root）
def拆分（自、idx、父节点）：
#idx是作为节点一部分的行的索引
columns=self.x.columns
父项为叶=真
最佳错误=父节点错误
最佳_列=列[0]
最佳阈值=0
最佳左错误=sys.float\u info.max
最佳正确错误=sys.float\u info.max
最佳左\u idx=数组（[]）
最佳右键idx=数组（[]）
#循环通过列
对于列中的列：
对于self.x[col].unique（）中的值：
left_idx=np.where（self.x[col].iloc[idx]值）
如果len（左）小于self.min\u samples\u leaf或len（右）小于self.min\u samples\u leaf：
持续
left\u error=self.find\u error（left\u idx）
right\u error=self.find\u error（right\u idx）
如果左错误<最佳错误，右错误<最佳错误：
最佳错误=（左错误+右错误）/2
最佳列=列
最佳阈值=值
最佳左错误=左错误
最佳正确错误=正确错误
最佳左idx=左idx
最佳右idx=右idx
父项为叶=假
如果父项_为_叶：
父节点为叶=真
其他：
parent_node.column=最佳_列
parent\u node.threshold=最佳\u阈值
左\u节点=节点（平均值=查找\u平均值（最佳左\u idx），错误=最佳左\u错误，父节点=父节点）
右\u节点=节点（平均值=查找平均值（最佳右\u idx），错误=最佳右\u错误，父节点=父节点）
父节点\u node.left=左节点
parent\u node.right=右\u节点
self.split（最佳左\u idx，左\u节点）
self.split（最佳\u右\u idx，右\u节点）
def find_错误（自我，idx）：
#rmse
avg=self.y.iloc[idx].mean（）
返回math.sqrt（（（self.y.iloc[idx]-avg）**2.sum（）/len（idx））
def find_平均值（自身、idx）：
#平均值
返回self.y.iloc[idx].mean（）

问题在于拆分函数。具体而言，这些行：

 left_idx = np.where(self.x[col].iloc[idx] <= value)
 right_idx = np.where(self.x[col].iloc[idx] > value)

left_idx=np.where（self.x[col].iloc[idx]值）

迭代一列的所有可能值以用于分割数据需要很长时间。这需要很长的时间，以至于在运行代码时，递归甚至不会发生。有没有办法加快这一进程，还是我一起走错了路

这是一张我想做的事情的图表。除了分类，它是回归。

我不明白你在做什么。您不是在查看列中的每个值，而是在查看每个列中的每个值。这样的树不是通常只基于一列吗？@TimRoberts认为这种类型的树应该基于所有列。它应该能够根据一组独立的值来预测一个值，比如a列，B列和C列。你说得对，我收回我的反对意见。