Python 如何在管道中使用LabelEncoder和Random Forest
我正在尝试使用Python 如何在管道中使用LabelEncoder和Random Forest,python,machine-learning,scikit-learn,random-forest,Python,Machine Learning,Scikit Learn,Random Forest,我正在尝试使用scikit learn中的Pipeline。目前我正在做以下工作: 在某些功能上应用LabelEncoder 构建一个随机森林回归器 代码是: x['zipcode'] = labelencoder.fit_transform(x['zipcode']) rfr = RandomForestRegressor(n_estimators=20, random_state=0) rfr.fit(x, y) 如何构建管道,以便将来看不见的数据经过相同的转换?您不需要将LabelE
scikit learn
中的Pipeline
。目前我正在做以下工作:
LabelEncoder
随机森林
回归器x['zipcode'] = labelencoder.fit_transform(x['zipcode'])
rfr = RandomForestRegressor(n_estimators=20, random_state=0)
rfr.fit(x, y)
如何构建
管道
,以便将来看不见的数据经过相同的转换?您不需要将LabelEncoder转换放在sklearn管道
指令中。因此,可能的解决方案是调用LabelEncoder,例如:
import numpy as np
from sklearn.preprocessing import LabelEncoder
lbl = LabelEncoder()
lbl.fit(X)
np.save('lbl_encoder.npy', encoder.classes_)
并在需要时加载
lbl = LabelEncoder
lbl.classes_ = np.load('lbl_encoder.npy')
为了解决这个问题,我将创建自己的
管道
。
请考虑这个简单的例子,您可以根据您的要求定制它并添加到:
import copy
import pickle
from sklearn.ensemble import RandomForestRegressor
from sklearn.preprocessing import LabelEncoder
import pandas as pd
import numpy as np
class MyPipeline:
def __init__(self, column='zipcode', n_estimators=10, random_state=0):
# add any parameters as required
# customize as required
assert isinstance(column, str)
self.column = column
self.label_enc = LabelEncoder()
self.model = RandomForestRegressor(n_estimators=n_estimators,
random_state=random_state,
n_jobs=1)
def preprocess(self, x):
processed_x = copy.deepcopy(x)
processed_x[self.column] = self.label_enc.fit_transform(processed_x[self.column])
return np.array(processed_x)
def fit(self, x, y):
# also customize as required
transformed_x = self.preprocess(x)
print("X After Transform: \n{}\n".format(transformed_x))
self.model.fit(transformed_x, y)
def predict(self, unseen_x):
# also customize as required
processed_x = copy.deepcopy(unseen_x)
processed_x[self.column] = self.label_enc.transform(processed_x[self.column])
print("Unseen Data After Transform: \n{}\n".format(np.array(processed_x)))
return self.model.predict(np.array(processed_x))
试验 输出
谢谢但是,我计划部署最终模型,使用管道,您可以将模型保存为
pkl
文件,该文件还将包括所有变压器。
x = pd.DataFrame(columns=['blabla','zipcode'],
data=[[1, 'zipecode1'], [2,'zipecode2'], [3,'zipecode3'],
[4, 'zipecode4'], [5, 'zipecode5'], [6, 'zipecode6'],
[7, 'zipecode7'], [8, 'zipecode8'], [9, 'zipecode9']])
y = [10,20,30,40,50,60,70,80,90]
mypipeline = MyPipeline()
mypipeline.fit(x,y)
# save it for future work
with open('mypipeline.dat', 'wb') as pickle_file:
pickle.dump(mypipeline, pickle_file)
# retrieve it
with open('mypipeline.dat', 'rb') as pickle_file:
mypipeline_ = pickle.load(pickle_file)
# Here I am passing same x just to make sure it's doing proper transformation
result = mypipeline_.predict(x)
# the result
print("Results: {}".format(result))
X After Transform:
[[1 0]
[2 1]
[3 2]
[4 3]
[5 4]
[6 5]
[7 6]
[8 7]
[9 8]]
Unseen Data After Transform:
[[1 0]
[2 1]
[3 2]
[4 3]
[5 4]
[6 5]
[7 6]
[8 7]
[9 8]]
Results: [12. 18. 28. 37. 47. 56. 69. 79. 82.]