Warning: file_get_contents(/data/phpspider/zhask/data//catemap/3/sql-server-2005/2.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python Sklearn随机林模型未从数据帧中删除标头_Python_Pandas_Dataframe_Scikit Learn_Random Forest - Fatal编程技术网

Python Sklearn随机林模型未从数据帧中删除标头

Python Sklearn随机林模型未从数据帧中删除标头,python,pandas,dataframe,scikit-learn,random-forest,Python,Pandas,Dataframe,Scikit Learn,Random Forest,我试图使用sklearn将下面的数据输入到一个随机森林算法中 数据(以csv格式显示): 我的代码: import pandas as pd from sklearn.ensemble import RandomForestClassifier from sklearn.model_selection import train_test_split import numpy as np master_training_set_path = "data_bank/cleaning_da

我试图使用sklearn将下面的数据输入到一个随机森林算法中

数据(以csv格式显示):

我的代码:

import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
import numpy as np

master_training_set_path = "data_bank/cleaning_data/master_training_data_id/master_train_one_hot.csv"
df = pd.read_csv(master_training_set_path)
labels = np.array(df["labels"].values)

train, test, train_labels, test_labels = train_test_split(df, labels,
                                                      stratify=labels,
                                                      test_size=0.3)
model = RandomForestClassifier(n_estimators=100, bootstrap=True, max_features='sqrt')

# this is the problematic line
model.fit(train, train_labels)
问题行是最后一行,当我运行它时,它返回以下回溯:

Traceback (most recent call last):
  File "path\random_forest.py", line 39, in 
<module>
    model.fit(train, train_labels)
  File "path\sklearn\ensemble\forest.py", line 247, in fit
    X = check_array(X, accept_sparse="csc", dtype=DTYPE)
  File "path\sklearn\utils\validation.py", line 434, in check_array
    array = np.array(array, dtype=dtype, order=order, copy=copy)

ValueError: could not convert string to float: 'self-declared'
回溯(最近一次呼叫最后一次):
文件“path\random_forest.py”,第39行,在
型号配合(列车、列车标签)
文件“path\sklearn\emble\forest.py”,第247行
X=检查数组(X,接受sparse=“csc”,dtype=dtype)
检查数组中的第434行文件“path\sklearn\utils\validation.py”
array=np.array(array,dtype=dtype,order=order,copy=copy)
ValueError:无法将字符串转换为浮点:“自声明”
我已经尝试确保'train'和'train_label'变量是numpy 2d数组,但仍然得到相同的错误


我的困惑来自这样一个事实,“自我声明”功能不是一个值,而是数据集中某个功能的名称。为什么sklearn不在训练数据之前删除标题?

该代码在scikit学习版本上运行:
0.23.1
。如果使用的是具有以下内容的旧版本,则可以尝试更新:

conda install scikit-learn=0.23.1
问题可能是您正在向
列车测试分割
提供
df
。但是,这将起作用,因为创建的是
train
test
数据帧(带标题)而不是特征矩阵,因此会给模型带来问题。因此,您可以尝试替换:

train, test, train_labels, test_labels = train_test_split(df, labels,
                                                      stratify=labels,
                                                      test_size=0.3)
为此:

df.drop(['labels'],axis=1,inplace=True) #you have labels in the training set as well.
train, test, train_labels, test_labels = train_test_split(df.values, labels,
                                                      stratify=labels,
                                                      test_size=0.3)

代码为我运行(也许再看一看您的csv文件)。旁注:注意
train
test
包含标签。好的,查看csv文件,有一些小问题。即数据集中的头。谢谢你的帮助。
df.drop(['labels'],axis=1,inplace=True) #you have labels in the training set as well.
train, test, train_labels, test_labels = train_test_split(df.values, labels,
                                                      stratify=labels,
                                                      test_size=0.3)