Python 无法在sklearn管道中使用灵活类型执行reduce_Python_Pandas_Numpy_Scikit Learn_Pipeline

Python 无法在sklearn管道中使用灵活类型执行reduce

python pandas numpy scikit-learn

Python 无法在sklearn管道中使用灵活类型执行reduce,python,pandas,numpy,scikit-learn,pipeline,Python,Pandas,Numpy,Scikit Learn,Pipeline,我正在尝试实现一个sklearn管道，我的代码如下。这是tips数据集：我正在尝试标记二进制特性，一个热编码日列，并缩放整个列。下面你可以找到我的一个类（另外两个类的结构几乎相同，所以我不会发布它们，我得到的错误与我在这个类中得到的错误相同）当我尝试适应管道时，出现以下错误： ValueError: Expected 2D array, got 1D array instead: Reshape your data either using array.reshape(-1, 1) if yo

我正在尝试实现一个sklearn管道，我的代码如下。这是tips数据集：我正在尝试标记二进制特性，一个热编码日列，并缩放整个列。下面你可以找到我的一个类（另外两个类的结构几乎相同，所以我不会发布它们，我得到的错误与我在这个类中得到的错误相同）

当我尝试适应管道时，出现以下错误：

ValueError: Expected 2D array, got 1D array instead:
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.

cannot perform reduce with flexible type

我尝试按如下方式更改转换：

def transform(self, X):
    encoder = LabelEncoder()
    return encoder.fit_transform(X[[self.column]])

当我这样做时，我得到以下错误：

ValueError: Expected 2D array, got 1D array instead:
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.

cannot perform reduce with flexible type

有人能帮我吗？我确实搜索了上述错误，但无法修复。

谢谢。

我观察到的第一个问题是课堂上的init方法：

  def __init__(self, column=None):
      for column in cols_to_encode:
          self.column = column

据我所知，您正试图为要编码的列分配一个列表，但那里不需要循环（对于两个编码器），您可以简单地将列表分配为：

  def __init__(self, columns):
      self.columns = columns

对于一个热编码器，我认为pd.get_dummies（）比一个热编码器更优雅，因此转换函数将是：

    def transform(self, X):
        '''
        1.Copy the information from original df, 
        2.Drop the old column from new df, 
        3.Return the new df with one hot encoded columns'''
        new_df = X.copy(deep=True)
        new_df.drop(self._cols_one_hot,axis=1,inplace=True)
        return new_df.join(pd.get_dummies(X[self._cols_one_hot]))

class onehotencode(BaseEstimator, TransformerMixin):
    def __init__(self, cols_one_hot):
        self._cols_one_hot = cols_one_hot

    def fit(self, X, y=None):
        return self

    def transform(self, X):
        '''
        1.Copy the information from original df, 
        2.Drop the old column from new df, 
        3.Return the new df with one hot encoded columns
        '''
        new_df = X.copy(deep=True)
        new_df.drop(self._cols_one_hot,axis=1,inplace=True)
        return new_df.join(pd.get_dummies(X[self._cols_one_hot]))


class labelencode(BaseEstimator, TransformerMixin):
    def __init__(self, cols_label_encode):
        self._cols_label_encode = cols_label_encode

    def fit(self, X, y=None):
        return self

    def transform(self, X):
        '''
        1.Copy the information from original df, 
        2.Label encode the columns,
        3.Drop the old columns, 
        4.Return the new df with label encoded columns
        '''
        new_df = X.copy(deep=True)
        label_encoded_cols = new_df[self._cols_label_encode].apply(LabelEncoder().fit_transform)
        new_df.drop(self._cols_label_encode,axis=1,inplace=True)
        return new_df.join(label_encoded_cols)

对于标签编码器部分，它将不适用于多列，因为LabelEncoder不支持多列编码。因此，您必须访问每个列并对其进行编码

    def transform(self, X):
     '''
        1.Copy the information from original df, 
        2.Label encode the columns,
        3.Drop the old columns, 
        4.Return the new df with label encoded columns'''
        new_df = X.copy(deep=True)
        label_encoded_cols =  new_df[self._cols_label_encode].apply(LabelEncoder().fit_transform)
        new_df.drop(self._cols_label_encode,axis=1,inplace=True)
        return new_df.join(label_encoded_cols)

最终解决办法将是：

    def transform(self, X):
        '''
        1.Copy the information from original df, 
        2.Drop the old column from new df, 
        3.Return the new df with one hot encoded columns'''
        new_df = X.copy(deep=True)
        new_df.drop(self._cols_one_hot,axis=1,inplace=True)
        return new_df.join(pd.get_dummies(X[self._cols_one_hot]))

class onehotencode(BaseEstimator, TransformerMixin):
    def __init__(self, cols_one_hot):
        self._cols_one_hot = cols_one_hot

    def fit(self, X, y=None):
        return self

    def transform(self, X):
        '''
        1.Copy the information from original df, 
        2.Drop the old column from new df, 
        3.Return the new df with one hot encoded columns
        '''
        new_df = X.copy(deep=True)
        new_df.drop(self._cols_one_hot,axis=1,inplace=True)
        return new_df.join(pd.get_dummies(X[self._cols_one_hot]))


class labelencode(BaseEstimator, TransformerMixin):
    def __init__(self, cols_label_encode):
        self._cols_label_encode = cols_label_encode

    def fit(self, X, y=None):
        return self

    def transform(self, X):
        '''
        1.Copy the information from original df, 
        2.Label encode the columns,
        3.Drop the old columns, 
        4.Return the new df with label encoded columns
        '''
        new_df = X.copy(deep=True)
        label_encoded_cols = new_df[self._cols_label_encode].apply(LabelEncoder().fit_transform)
        new_df.drop(self._cols_label_encode,axis=1,inplace=True)
        return new_df.join(label_encoded_cols)

然后，管道将被称为：

pipeline = Pipeline([('ohe',onehotencode(cols_to_encode)),
                    ('le',labelencode(cols_to_encode_label))])

df_transformed = pipeline.fit_transform(df)

df_transformed.head（）将打印：