Python 标签编码多个分类列
我有一个混合了int、float、category和bool数据类型的数据帧,我正在尝试使用LabelEncoder.fit_转换将category和bool数据转换为int。当在单个列上执行时,它工作得非常好,但是当我尝试通过DF执行for循环时,出现以下错误:Python 标签编码多个分类列,python,machine-learning,scikit-learn,Python,Machine Learning,Scikit Learn,我有一个混合了int、float、category和bool数据类型的数据帧,我正在尝试使用LabelEncoder.fit_转换将category和bool数据转换为int。当在单个列上执行时,它工作得非常好,但是当我尝试通过DF执行for循环时,出现以下错误: relabel = preprocessing.LabelEncoder() for i in first_buyer.columns: if str(first_buyer[i].dtypes) not in ["float
relabel = preprocessing.LabelEncoder()
for i in first_buyer.columns:
if str(first_buyer[i].dtypes) not in ["float64","int64","bool"]:
first_buyer[i] = relabel.fit_transform(first_buyer[i])
错误:
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
~\Anaconda3\lib\site-packages\sklearn\preprocessing\label.py in _encode(values, uniques, encode)
104 try:
--> 105 res = _encode_python(values, uniques, encode)
106 except TypeError:
~\Anaconda3\lib\site-packages\sklearn\preprocessing\label.py in _encode_python(values, uniques, encode)
58 if uniques is None:
---> 59 uniques = sorted(set(values))
60 uniques = np.array(uniques, dtype=values.dtype)
TypeError: '<' not supported between instances of 'str' and 'int'
During handling of the above exception, another exception occurred:
TypeError Traceback (most recent call last)
<ipython-input-17-42e60975f0b6> in <module>
4 for i in first_buyer.columns:
5 if str(first_buyer[i].dtypes) not in ["float64","int64","bool"]:
----> 6 first_buyer[i] = relabel.fit_transform(first_buyer[i])
~\Anaconda3\lib\site-packages\sklearn\preprocessing\label.py in fit_transform(self, y)
234 """
235 y = column_or_1d(y, warn=True)
--> 236 self.classes_, y = _encode(y, encode=True)
237 return y
238
~\Anaconda3\lib\site-packages\sklearn\preprocessing\label.py in _encode(values, uniques, encode)
105 res = _encode_python(values, uniques, encode)
106 except TypeError:
--> 107 raise TypeError("argument must be a string or number")
108 return res
109 else:
TypeError: argument must be a string or number
我希望我的代码输出将所有分类变量转换为数字,以便我可以训练我的数据集虽然我没有看到您的数据帧,但错误很可能是由于某列中存在NAs,或者是由于存在混合类型的列 请尝试以下示例:
from sklearn.preprocessing import LabelEncoder
import numpy as np
import pandas as pd
relabel = LabelEncoder()
# Nas in the dataframe
data1 = pd.DataFrame([['a', 'b', 'c'] ,['1', '2', np.nan]], columns=['A', 'B', 'C'])
# Will raise an error
relabel.fit_transform(data1['C'])
# Mixed types
data2 = pd.DataFrame([['a', 'b', 'c'], ['1', '2', 3]], columns=['A', 'B', 'C'])
# Will raise an error
relabel.fit_transform(data2['C'])
# Clean data
data3 = pd.DataFrame([['a', 'b', 'c'], ['1', '2', '3']], columns=['A', 'B', 'C'])
# Will work
relabel.fit_transform(data3['C'])
具有混合类型的列将通过筛选器的原因是其“数据类型”为“对象”
在使用LabelEncoder之前,您应该进行一些额外的预处理,以确保列没有丢失值,并且是相同类型的列
如果知道要转换的列的所有数据类型,还应使用“in”而不是“not in”的筛选器
您可以使用columns对象的fillna()和astype()方法来完成此操作:
# Define a dummy variable for missing values that is of the same type as the column
data1['C'] = data1['C'].fillna('DUMMY_VARIABLE_FOR_NA')
# Will work now
relabel.fit_transform(data1['C'])
data2['C'] = data2['C'].astype(str)
# Will work now
relabel.fit_transform(data2['C'])
# Define a dummy variable for missing values that is of the same type as the column
data1['C'] = data1['C'].fillna('DUMMY_VARIABLE_FOR_NA')
# Will work now
relabel.fit_transform(data1['C'])
data2['C'] = data2['C'].astype(str)
# Will work now
relabel.fit_transform(data2['C'])