Python 从头开始执行描述树的分支时遇到问题(长)
我目前正在使用Python从头开始实现决策树算法。 我在执行树的分支时遇到问题。在当前的实现中,我没有使用深度参数 发生的情况是,要么分支结束得太快(如果我使用标志来防止无限递归),要么如果我删除标志,我会遇到无限递归。我也很难理解我是在主循环中还是在递归循环中 我的数据非常简单:Python 从头开始执行描述树的分支时遇到问题(长),python,recursion,machine-learning,decision-tree,Python,Recursion,Machine Learning,Decision Tree,我目前正在使用Python从头开始实现决策树算法。 我在执行树的分支时遇到问题。在当前的实现中,我没有使用深度参数 发生的情况是,要么分支结束得太快(如果我使用标志来防止无限递归),要么如果我删除标志,我会遇到无限递归。我也很难理解我是在主循环中还是在递归循环中 我的数据非常简单: d = {'one' : [1., 2., 3., 4.], 'two' : [4., 3., 2., 1.]} df = pd.DataFrame(d) df['three'] = (0,0,1,1) df
d = {'one' : [1., 2., 3., 4.],
'two' : [4., 3., 2., 1.]}
df = pd.DataFrame(d)
df['three'] = (0,0,1,1)
df = np.array(df)
从而产生以下输出:
array([[ 1., 4., 0.],
[ 2., 3., 0.],
[ 3., 2., 1.],
[ 4., 1., 1.]])
我将使用基尼乌指数进行分割。这个函数不是解决我的问题的积分,所以我将把它放在这个问题的末尾,以帮助再现性
我正在使用dictionary对象y
,随着分支的扩展,它将继续包含嵌套的dictionary
y
/ \
y['left'] y['right']
/ \ \
y['left']['left'] y['left']['right'] y['right'] ['right']
接下来,我将分解创建树的函数,在这里我遇到了一些问题
def create_tree2(node, flag ): #node is a dictionary containing the root, which will contain nested dictionaries as this function recursively calls itself.
left, right =node['Groups'] # ['Groups'] is a key contains that contains two groups which will be used for the next split; I'm assigning them to left and right here
left,right = np.array(left), np.array(right) #just converting them to array because my other functions rely on the data to be in array format.
print ('left_group', left) #these are for debugging purposes.
print('right_group', right)
if flag == True and (right.size ==0 or left.size ==0):
node['left'] = left
node['right'] = right
flag = False
return
#This above portion is to prevent infinite loops.
关于无限递归,发生的是,如果我有两行数据,而不是将这两行分割成两个不同的节点,
我得到一个节点没有行,另一个节点有两行
如果一个节点中的数据行少于两行,我的循环通常会停止。
因此空节点将终止,
但是,包含两行数据的节点将再次拆分为一个空节点和一个两行填充的节点。此过程将永远继续。
所以我尝试使用一个标志来阻止这个无限循环。
标志的唯一问题是,它似乎提前激活了一步。,
它不检查拆分是否会导致两个节点或无限循环。例如:
A split leads to
left = []
right = [ [ 3., 2., 1.],
[ 4., 1., 1.]])]
now instead of checking if the right can split further
(left =[3,2,1] , right = [ 4., 1., 1.]),
标志停在上面的台阶上,太早了一步
if len(left) < 2:
node['left'] =left
return
#Here I'm ending the node, if the len is less than 2 rows of data.
else:
node['left'] = check_split(left)
print('after left split', node['left']['Groups'])# for debugging purposes
create_tree2(node['left'], True)
#This is splitting the data and then recursively calling the create_tree2 function
#given that len of the group is NOT less than two.
#And the flag gets activated to prevent infinite looping.
#Notice that node['left'] is being used as the node parameter in the recursion function.
if len(right) <2:
node['right'] = right
return
else:
node['right'] = check_split(right)
print('right_check_split')
create_tree(node['right'],False)
#doing the same thing with the right side.
另一件让我感到困惑的事情是,为什么字典可以从右主节点开始,而不是从左嵌套的字典开始
回想一下,递归首先发生在主左分支中
create_tree2(node['left'] , True),
这将更新left和right的值,当我们点击函数的这一部分时,这些值将继续:
if len(right) <2:
node['right'] = right
return
else:
node['right'] = check_split(right) #This right value would have been updated on?
print('right_check_split')
create_tree(node['right'],False)
我对您的功能进行了以下更改:
def create_tree2(node, flag=False):
left, right =node['Groups']
left, right = np.array(left), np.array(right)
print('left_group', left)
print('right_group', right)
if flag == True and (right.size ==0 or left.size ==0):
node['left'] = left
node['right'] = right
flag = False
return
if len(left) < 2:
node['left'] = left
flag = True
print('too-small left. flag=True')
else:
node['left'] = check_split(left)
print('after left split', node['left']['Groups'])
create_tree2(node['left'],flag)
if len(right) < 2:
node['right'] = right
print('too-small right. flag=True')
flag = True
else:
node['right'] = check_split(right)
print('after right split', node['right']['Groups'])
create_tree2(node['right'], flag)
return node
d = {'one' : [1., 2., 3., 4.],
'two' : [4., 3., 2., 1.]}
df = pd.DataFrame(d)
df['three'] = (0,0,1,1)
df = np.array(df)
root = check_split(df)
y = create_tree2(root)
此外,我认为您可以通过在末尾检查节点的左侧或右侧是否为空,将另一个节点向上拉一个来优化此功能。类似于:
if node['left'] is empty:
kid = node['right']
node.clear()
for k,v in kid.items():
node[k]=v
elif node['right'] is empty:
same basic thing, with left kid
检查是否为空是一个技巧,因为有时是一个dict,有时不是
最后,您似乎没有存储实际的分割信息。这不是决策树的要点吗?知道要比较哪些因素?您不应该记录每个节点上的列和值吗?在
y=create_tree2(root,False)
?(最后一行)上,根是什么更新。所以root是创建的第一个字典。root=check_split(df)
这样做似乎是一个错误:if len(left)<2:node['left']=left;return
并且不将node['right']
设置为任何值。对于len(right),我看到了同样的情况
放低。这是对的吗?啊!似乎是我的编程错误。我可以使用继续
对吗?信息应该通过检查分割功能存储
;该功能的输出是保存所有信息的键值
,“索引”和“组”。我会仔细考虑你的建议虽然这需要一些时间(我还是个新手)。非常感谢您阅读我的代码。我昨天花了一整天的时间,试图了解发生了什么。尽管如此,我学到了很多。=)“
基本上,我使用了lenThelen,我不认为我需要担心len==1,因为如果len==1,在递归之后,始终必须有一个空组。如果我有一个在左边,一个在右边,它将只通过函数而不递归。
def create_tree2(node, flag=False):
left, right =node['Groups']
left, right = np.array(left), np.array(right)
print('left_group', left)
print('right_group', right)
if flag == True and (right.size ==0 or left.size ==0):
node['left'] = left
node['right'] = right
flag = False
return
if len(left) < 2:
node['left'] = left
flag = True
print('too-small left. flag=True')
else:
node['left'] = check_split(left)
print('after left split', node['left']['Groups'])
create_tree2(node['left'],flag)
if len(right) < 2:
node['right'] = right
print('too-small right. flag=True')
flag = True
else:
node['right'] = check_split(right)
print('after right split', node['right']['Groups'])
create_tree2(node['right'], flag)
return node
d = {'one' : [1., 2., 3., 4.],
'two' : [4., 3., 2., 1.]}
df = pd.DataFrame(d)
df['three'] = (0,0,1,1)
df = np.array(df)
root = check_split(df)
y = create_tree2(root)
left_group [[ 1. 4. 0.]
[ 2. 3. 0.]]
right_group [[ 3. 2. 1.]
[ 4. 1. 1.]]
after left split [array([], shape=(0, 3), dtype=float64)
array([[ 1., 4., 0.],
[ 2., 3., 0.]])]
left_group []
right_group [[ 1. 4. 0.]
[ 2. 3. 0.]]
too-small left. flag=True
after right split [array([], shape=(0, 3), dtype=float64)
array([[ 1., 4., 0.],
[ 2., 3., 0.]])]
left_group []
right_group [[ 1. 4. 0.]
[ 2. 3. 0.]]
after right split [array([], shape=(0, 3), dtype=float64)
array([[ 3., 2., 1.],
[ 4., 1., 1.]])]
left_group []
right_group [[ 3. 2. 1.]
[ 4. 1. 1.]]
too-small left. flag=True
after right split [array([], shape=(0, 3), dtype=float64)
array([[ 3., 2., 1.],
[ 4., 1., 1.]])]
left_group []
right_group [[ 3. 2. 1.]
[ 4. 1. 1.]]
Y= {'Groups': array([[[ 1., 4., 0.],
[ 2., 3., 0.]],
[[ 3., 2., 1.],
[ 4., 1., 1.]]]), 'Index': 0, 'right': {'Groups': array([array([], shape=(0, 3), dtype=float64),
array([[ 3., 2., 1.],
[ 4., 1., 1.]])], dtype=object), 'Index': 0, 'right': {'Groups': array([array([], shape=(0, 3), dtype=float64),
array([[ 3., 2., 1.],
[ 4., 1., 1.]])], dtype=object), 'Index': 0, 'right': array([[ 3., 2., 1.],
[ 4., 1., 1.]]), 'Value': 3.0, 'left': array([], shape=(0, 3), dtype=float64)}, 'Value': 3.0, 'left': array([], shape=(0, 3), dtype=float64)}, 'Value': 3.0, 'left': {'Groups': array([array([], shape=(0, 3), dtype=float64),
array([[ 1., 4., 0.],
[ 2., 3., 0.]])], dtype=object), 'Index': 0, 'right': {'Groups': array([array([], shape=(0, 3), dtype=float64),
array([[ 1., 4., 0.],
[ 2., 3., 0.]])], dtype=object), 'Index': 0, 'right': array([[ 1., 4., 0.],
[ 2., 3., 0.]]), 'Value': 1.0, 'left': array([], shape=(0, 3), dtype=float64)}, 'Value': 1.0, 'left': array([], shape=(0, 3), dtype=float64)}}
if node['left'] is empty:
kid = node['right']
node.clear()
for k,v in kid.items():
node[k]=v
elif node['right'] is empty:
same basic thing, with left kid