Python 如何在管道中使用LabelEncoder和Random Forest

Python 如何在管道中使用LabelEncoder和Random Forest,python,machine-learning,scikit-learn,random-forest,Python,Machine Learning,Scikit Learn,Random Forest,我正在尝试使用scikit learn中的Pipeline。目前我正在做以下工作: 在某些功能上应用LabelEncoder 构建一个随机森林回归器 代码是: x['zipcode'] = labelencoder.fit_transform(x['zipcode']) rfr = RandomForestRegressor(n_estimators=20, random_state=0) rfr.fit(x, y) 如何构建管道,以便将来看不见的数据经过相同的转换?您不需要将LabelE

我正在尝试使用
scikit learn
中的
Pipeline
。目前我正在做以下工作:

  • 在某些功能上应用
    LabelEncoder
  • 构建一个
    随机森林
    回归器
  • 代码是:

    x['zipcode'] = labelencoder.fit_transform(x['zipcode'])
    
    rfr = RandomForestRegressor(n_estimators=20, random_state=0)
    
    rfr.fit(x, y)
    

    如何构建
    管道
    ,以便将来看不见的数据经过相同的转换?

    您不需要将LabelEncoder转换放在sklearn
    管道
    指令中。因此,可能的解决方案是调用LabelEncoder,例如:

    import numpy as np 
    from sklearn.preprocessing import LabelEncoder
    
    lbl = LabelEncoder()
    lbl.fit(X)
    np.save('lbl_encoder.npy', encoder.classes_)
    
    
    
    并在需要时加载

    
    lbl = LabelEncoder
    lbl.classes_ = np.load('lbl_encoder.npy')
    
    
    

    为了解决这个问题,我将创建自己的
    管道
    。 请考虑这个简单的例子,您可以根据您的要求定制它并添加到:

    import copy
    import pickle
    from sklearn.ensemble import RandomForestRegressor
    from sklearn.preprocessing import LabelEncoder
    import pandas as pd
    import numpy as np
    
    class MyPipeline:
        def __init__(self, column='zipcode', n_estimators=10, random_state=0):
            # add any parameters as required
            # customize as required
            assert isinstance(column, str)
            self.column = column
            self.label_enc = LabelEncoder()
            self.model = RandomForestRegressor(n_estimators=n_estimators,
                                               random_state=random_state,
                                               n_jobs=1)
    
        def preprocess(self, x):
            processed_x = copy.deepcopy(x)
            processed_x[self.column] = self.label_enc.fit_transform(processed_x[self.column])
            return np.array(processed_x)
    
        def fit(self, x, y):
            # also customize as required
            transformed_x = self.preprocess(x)
            print("X After Transform: \n{}\n".format(transformed_x))
            self.model.fit(transformed_x, y)
    
        def predict(self, unseen_x):
            # also customize as required
            processed_x = copy.deepcopy(unseen_x)
            processed_x[self.column] = self.label_enc.transform(processed_x[self.column])
            print("Unseen Data After Transform: \n{}\n".format(np.array(processed_x)))
            return self.model.predict(np.array(processed_x))
    

    试验 输出
    谢谢但是,我计划部署最终模型,使用管道,您可以将模型保存为
    pkl
    文件,该文件还将包括所有变压器。
    x = pd.DataFrame(columns=['blabla','zipcode'],
                     data=[[1, 'zipecode1'], [2,'zipecode2'], [3,'zipecode3'],
                           [4, 'zipecode4'], [5, 'zipecode5'], [6, 'zipecode6'],
                           [7, 'zipecode7'], [8, 'zipecode8'], [9, 'zipecode9']])
    y = [10,20,30,40,50,60,70,80,90]
    
    mypipeline = MyPipeline()
    mypipeline.fit(x,y)
    
    # save it for future work
    with open('mypipeline.dat', 'wb') as pickle_file:
        pickle.dump(mypipeline, pickle_file)
    
    # retrieve it
    with open('mypipeline.dat', 'rb') as pickle_file:
        mypipeline_ = pickle.load(pickle_file)
    
    # Here I am passing same x just to make sure it's doing proper transformation
    result = mypipeline_.predict(x)
    
    # the result
    print("Results: {}".format(result))
    
    X After Transform: 
    [[1 0]
     [2 1]
     [3 2]
     [4 3]
     [5 4]
     [6 5]
     [7 6]
     [8 7]
     [9 8]]
    
    Unseen Data After Transform: 
    [[1 0]
     [2 1]
     [3 2]
     [4 3]
     [5 4]
     [6 5]
     [7 6]
     [8 7]
     [9 8]]
    
    Results: [12. 18. 28. 37. 47. 56. 69. 79. 82.]