Python 尝试构建决策树时出现TypeError
我正在尝试建立一个决策树,这是我的数据:Python 尝试构建决策树时出现TypeError,python,pandas,machine-learning,decision-tree,Python,Pandas,Machine Learning,Decision Tree,我正在尝试建立一个决策树,这是我的数据: d = {'height':[0,0,1,1,1],'length':[1,1,0,0,1],'width':[0,0,1,1,1],'label':['Apple','Apple','Grape','Grape','Lemon']} training_data = pd.DataFrame(d) training_data 这是用于尝试设置数据分区问题的代码: class Question: #used for the thres
d = {'height':[0,0,1,1,1],'length':[1,1,0,0,1],'width':[0,0,1,1,1],'label':['Apple','Apple','Grape','Grape','Lemon']}
training_data = pd.DataFrame(d)
training_data
这是用于尝试设置数据分区问题的代码:
class Question:
#used for the threshold used to partition the data
def __init__(self, column, value):
self.column = column #storing a column number
self.value = value #storing a column value
def match(self,example):
#comparing feature value in an example to the
#feature value in the question
val = example[self.column]
if is_numeric(val):
if val == 0:
return int(val) >= self.value
if val == 1:
return val <= self.value
#if the value is numeric, see if the value is greater than or
#equal to three for example, return this in a separate branch
else:
return val == self.value
#if the value is not numeric return it in the other branch
#with things that aren't numeric and aren't greater than or
#equal to three, for example
def __repr__(self):
#printing the question in a readable format
condition = '=='
if is_numeric(self.value):
condition = '>='
return "Is %s %s %s?" % (
header[self.column], condition, int(self.value))
这是我得到的错误,请帮忙!
我知道我的数据都是数字,但我不明白为什么一个值会被归类为字符串。我尝试将这些值转换为浮点数,然后转换为整数,并尝试使用pd.to_numeric()
TypeError回溯(最近一次调用)
在里面
1#示例:找到此培训数据集的最佳问题
2.
---->3最佳收益,最佳问题=找到最佳分割(训练数据)
4最佳问题
在查找最佳分割(df)中
20
21#尝试拆分数据集
--->22正确行,错误行=分区(df,问题)
23
24#如果分区
分区中(df,问题)
18行为真,行为假=[],[]
19对于loc,df.iterrows()中的行:
--->20如果问题匹配(第行):
21行。追加(行)
22.其他:
在匹配中(自我,示例)
18
19如果val==0:
--->20返回int(val)>=自身值
21
22如果val==1:
TypeError:“>=”在“int”和“str”的实例之间不受支持
在数据流中,它在哪里变成字符串?我看不出你追溯了它的历史;。第一次尝试70行代码主要是你的工作。请提供预期的价格。显示中间结果与您预期的不同之处。我们应该能够复制和粘贴一个连续的代码块,执行该文件,并再现您的问题以及跟踪问题点的输出。这让我们可以根据您的测试数据和期望的输出来测试我们的建议。
def find_best_split(df):
#keeping track of best information gain
best_gain = 0
#keep track of the feature/value that produced it
best_question = None
current_uncertainity = gini(df)
n_features = len(df.columns[0:-1]) #number of columns, goes from 0 to x
#iterating through the "features"(columns) in the range of columns
for col in range(n_features):
#set() builds an unordered collection of unique elements
#unique values in the columns
values = set([row[col] for row in df])
#iterating through all values
for val in values:
question = Question(col, val)
#try splitting the dataset
true_rows, false_rows = partition(df, question)
#allowing this to skip the previous step if the partitioning
#question doesn't end up separating the data
if len(true_rows) == 0 or len(false_rows) == 0:
continue
gain = info_gain(true_rows, false_rows, current_uncertainty)
#can normally just use >, but >= is specific to this example
#and we will see why
if gain >= best_gain:
best_gain, best_question = gain, question
return best_gain, best_question
best_gain, best_question = find_best_split(training_data)
best_question
TypeError Traceback (most recent call last)
<ipython-input-79-48db17d94fd7> in <module>
1 #example: find the best question to ask for this training dataset
2
----> 3 best_gain, best_question = find_best_split(training_data)
4 best_question
<ipython-input-78-776718e68801> in find_best_split(df)
20
21 #try splitting the dataset
---> 22 true_rows, false_rows = partition(df, question)
23
24 #allowing this to skip the previous step if the partitioning
<ipython-input-64-c3975579f55f> in partition(df, question)
18 true_rows, false_rows = [],[]
19 for loc, row in df.iterrows():
---> 20 if question.match(row):
21 true_rows.append(row)
22 else:
<ipython-input-59-30d3649cbfba> in match(self, example)
18
19 if val == 0:
---> 20 return int(val) >= self.value
21
22 if val == 1:
TypeError: '>=' not supported between instances of 'int' and 'str'