Python 尝试构建决策树时出现TypeError_Python_Pandas_Machine Learning_Decision Tree

Python 尝试构建决策树时出现TypeError

python pandas machine-learning

Python 尝试构建决策树时出现TypeError,python,pandas,machine-learning,decision-tree,Python,Pandas,Machine Learning,Decision Tree,我正在尝试建立一个决策树，这是我的数据： d = {'height':[0,0,1,1,1],'length':[1,1,0,0,1],'width':[0,0,1,1,1],'label':['Apple','Apple','Grape','Grape','Lemon']} training_data = pd.DataFrame(d) training_data 这是用于尝试设置数据分区问题的代码： class Question: #used for the thres

我正在尝试建立一个决策树，这是我的数据：

d = {'height':[0,0,1,1,1],'length':[1,1,0,0,1],'width':[0,0,1,1,1],'label':['Apple','Apple','Grape','Grape','Lemon']}

training_data = pd.DataFrame(d)
training_data

这是用于尝试设置数据分区问题的代码：

class Question:
    
    #used for the threshold used to partition the data
    def __init__(self, column, value):
        self.column = column #storing a column number
        self.value = value #storing a column value
        
    def match(self,example):
        
        #comparing feature value in an example to the 
        #feature value in the question
        
        val = example[self.column]
        if is_numeric(val):

            if val == 0:
                return int(val) >= self.value

            if val == 1:
                return val <= self.value
        
        #if the value is numeric, see if the value is greater than or
        #equal to three for example, return this in a separate branch
        else:
            return val == self.value
        #if the value is not numeric return it in the other branch
        #with things that aren't numeric and aren't greater than or
        #equal to three, for example

    def __repr__(self):
        
        #printing the question in a readable format
        
        condition = '=='
        
        if is_numeric(self.value):
            condition = '>='
        return "Is %s %s %s?" % (
            header[self.column], condition, int(self.value))

这是我得到的错误，请帮忙！我知道我的数据都是数字，但我不明白为什么一个值会被归类为字符串。我尝试将这些值转换为浮点数，然后转换为整数，并尝试使用pd.to_numeric（）

TypeError回溯（最近一次调用）
在里面
1#示例：找到此培训数据集的最佳问题
2.
---->3最佳收益，最佳问题=找到最佳分割（训练数据）
4最佳问题
在查找最佳分割（df）中
20
21#尝试拆分数据集
--->22正确行，错误行=分区（df，问题）
23
24#如果分区
分区中（df，问题）
18行为真，行为假=[]，[]
19对于loc，df.iterrows（）中的行：
--->20如果问题匹配（第行）：
21行。追加（行）
22.其他：
在匹配中（自我，示例）
18
19如果val==0：
--->20返回int（val）>=自身值
21
22如果val==1：
TypeError:“>=”在“int”和“str”的实例之间不受支持

在数据流中，它在哪里变成字符串？我看不出你追溯了它的历史；。第一次尝试70行代码主要是你的工作。请提供预期的价格。显示中间结果与您预期的不同之处。我们应该能够复制和粘贴一个连续的代码块，执行该文件，并再现您的问题以及跟踪问题点的输出。这让我们可以根据您的测试数据和期望的输出来测试我们的建议。

def find_best_split(df):
    #keeping track of best information gain
    best_gain = 0
    #keep track of the feature/value that produced it
    best_question = None
    current_uncertainity = gini(df)
    n_features = len(df.columns[0:-1]) #number of columns, goes from 0 to x
    
    #iterating through the "features"(columns) in the range of columns
    for col in range(n_features):
        
        #set() builds an unordered collection of unique elements
        #unique values in the columns
        values = set([row[col] for row in df])
        
        #iterating through all values
        for val in values:
            
            question = Question(col, val)
            
            #try splitting the dataset
            true_rows, false_rows = partition(df, question)
            
            #allowing this to skip the previous step if the partitioning
            #question doesn't end up separating the data
            
            if len(true_rows) == 0 or len(false_rows) == 0:
                continue
                
            gain = info_gain(true_rows, false_rows, current_uncertainty)
            
            #can normally just use >, but >= is specific to this example
            #and we will see why
            if gain >= best_gain:
                best_gain, best_question = gain, question
                
    return best_gain, best_question

best_gain, best_question = find_best_split(training_data)
best_question

TypeError                                 Traceback (most recent call last)
<ipython-input-79-48db17d94fd7> in <module>
      1 #example: find the best question to ask for this training dataset
      2 
----> 3 best_gain, best_question = find_best_split(training_data)
      4 best_question

<ipython-input-78-776718e68801> in find_best_split(df)
     20 
     21             #try splitting the dataset
---> 22             true_rows, false_rows = partition(df, question)
     23 
     24             #allowing this to skip the previous step if the partitioning

<ipython-input-64-c3975579f55f> in partition(df, question)
     18         true_rows, false_rows = [],[]
     19         for loc, row in df.iterrows():
---> 20             if question.match(row):
     21                 true_rows.append(row)
     22             else:

<ipython-input-59-30d3649cbfba> in match(self, example)
     18 
     19             if val == 0:
---> 20                 return int(val) >= self.value
     21 
     22             if val == 1:

TypeError: '>=' not supported between instances of 'int' and 'str'