Python 无法在sklearn管道中使用灵活类型执行reduce

Python 无法在sklearn管道中使用灵活类型执行reduce,python,pandas,numpy,scikit-learn,pipeline,Python,Pandas,Numpy,Scikit Learn,Pipeline,我正在尝试实现一个sklearn管道,我的代码如下。这是tips数据集:我正在尝试标记二进制特性,一个热编码日列,并缩放整个列。下面你可以找到我的一个类(另外两个类的结构几乎相同,所以我不会发布它们,我得到的错误与我在这个类中得到的错误相同) 当我尝试适应管道时,出现以下错误: ValueError: Expected 2D array, got 1D array instead: Reshape your data either using array.reshape(-1, 1) if yo

我正在尝试实现一个sklearn管道,我的代码如下。这是tips数据集:我正在尝试标记二进制特性,一个热编码日列,并缩放整个列。下面你可以找到我的一个类(另外两个类的结构几乎相同,所以我不会发布它们,我得到的错误与我在这个类中得到的错误相同)

当我尝试适应管道时,出现以下错误:

ValueError: Expected 2D array, got 1D array instead:
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.
cannot perform reduce with flexible type
我尝试按如下方式更改转换:

def transform(self, X):
    encoder = LabelEncoder()
    return encoder.fit_transform(X[[self.column]])
当我这样做时,我得到以下错误:

ValueError: Expected 2D array, got 1D array instead:
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.
cannot perform reduce with flexible type
有人能帮我吗?我确实搜索了上述错误,但无法修复。
谢谢。

我观察到的第一个问题是课堂上的init方法:

  def __init__(self, column=None):
      for column in cols_to_encode:
          self.column = column
据我所知,您正试图为要编码的列分配一个列表,但那里不需要循环(对于两个编码器),您可以简单地将列表分配为:

  def __init__(self, columns):
      self.columns = columns
对于一个热编码器,我认为pd.get_dummies()比一个热编码器更优雅,因此转换函数将是:

    def transform(self, X):
        '''
        1.Copy the information from original df, 
        2.Drop the old column from new df, 
        3.Return the new df with one hot encoded columns'''
        new_df = X.copy(deep=True)
        new_df.drop(self._cols_one_hot,axis=1,inplace=True)
        return new_df.join(pd.get_dummies(X[self._cols_one_hot]))
class onehotencode(BaseEstimator, TransformerMixin):
    def __init__(self, cols_one_hot):
        self._cols_one_hot = cols_one_hot

    def fit(self, X, y=None):
        return self

    def transform(self, X):
        '''
        1.Copy the information from original df, 
        2.Drop the old column from new df, 
        3.Return the new df with one hot encoded columns
        '''
        new_df = X.copy(deep=True)
        new_df.drop(self._cols_one_hot,axis=1,inplace=True)
        return new_df.join(pd.get_dummies(X[self._cols_one_hot]))


class labelencode(BaseEstimator, TransformerMixin):
    def __init__(self, cols_label_encode):
        self._cols_label_encode = cols_label_encode

    def fit(self, X, y=None):
        return self

    def transform(self, X):
        '''
        1.Copy the information from original df, 
        2.Label encode the columns,
        3.Drop the old columns, 
        4.Return the new df with label encoded columns
        '''
        new_df = X.copy(deep=True)
        label_encoded_cols = new_df[self._cols_label_encode].apply(LabelEncoder().fit_transform)
        new_df.drop(self._cols_label_encode,axis=1,inplace=True)
        return new_df.join(label_encoded_cols)
对于标签编码器部分,它将不适用于多列,因为LabelEncoder不支持多列编码。因此,您必须访问每个列并对其进行编码

    def transform(self, X):
     '''
        1.Copy the information from original df, 
        2.Label encode the columns,
        3.Drop the old columns, 
        4.Return the new df with label encoded columns'''
        new_df = X.copy(deep=True)
        label_encoded_cols =  new_df[self._cols_label_encode].apply(LabelEncoder().fit_transform)
        new_df.drop(self._cols_label_encode,axis=1,inplace=True)
        return new_df.join(label_encoded_cols)
最终解决办法将是:

    def transform(self, X):
        '''
        1.Copy the information from original df, 
        2.Drop the old column from new df, 
        3.Return the new df with one hot encoded columns'''
        new_df = X.copy(deep=True)
        new_df.drop(self._cols_one_hot,axis=1,inplace=True)
        return new_df.join(pd.get_dummies(X[self._cols_one_hot]))
class onehotencode(BaseEstimator, TransformerMixin):
    def __init__(self, cols_one_hot):
        self._cols_one_hot = cols_one_hot

    def fit(self, X, y=None):
        return self

    def transform(self, X):
        '''
        1.Copy the information from original df, 
        2.Drop the old column from new df, 
        3.Return the new df with one hot encoded columns
        '''
        new_df = X.copy(deep=True)
        new_df.drop(self._cols_one_hot,axis=1,inplace=True)
        return new_df.join(pd.get_dummies(X[self._cols_one_hot]))


class labelencode(BaseEstimator, TransformerMixin):
    def __init__(self, cols_label_encode):
        self._cols_label_encode = cols_label_encode

    def fit(self, X, y=None):
        return self

    def transform(self, X):
        '''
        1.Copy the information from original df, 
        2.Label encode the columns,
        3.Drop the old columns, 
        4.Return the new df with label encoded columns
        '''
        new_df = X.copy(deep=True)
        label_encoded_cols = new_df[self._cols_label_encode].apply(LabelEncoder().fit_transform)
        new_df.drop(self._cols_label_encode,axis=1,inplace=True)
        return new_df.join(label_encoded_cols)
然后,管道将被称为:

pipeline = Pipeline([('ohe',onehotencode(cols_to_encode)),
                    ('le',labelencode(cols_to_encode_label))])

df_transformed = pipeline.fit_transform(df)
df_transformed.head()将打印:


我观察到的第一个问题是课堂上的init方法:

  def __init__(self, column=None):
      for column in cols_to_encode:
          self.column = column
据我所知,您正试图为要编码的列分配一个列表,但那里不需要循环(对于两个编码器),您可以简单地将列表分配为:

  def __init__(self, columns):
      self.columns = columns
对于一个热编码器,我认为pd.get_dummies()比一个热编码器更优雅,因此转换函数将是:

    def transform(self, X):
        '''
        1.Copy the information from original df, 
        2.Drop the old column from new df, 
        3.Return the new df with one hot encoded columns'''
        new_df = X.copy(deep=True)
        new_df.drop(self._cols_one_hot,axis=1,inplace=True)
        return new_df.join(pd.get_dummies(X[self._cols_one_hot]))
class onehotencode(BaseEstimator, TransformerMixin):
    def __init__(self, cols_one_hot):
        self._cols_one_hot = cols_one_hot

    def fit(self, X, y=None):
        return self

    def transform(self, X):
        '''
        1.Copy the information from original df, 
        2.Drop the old column from new df, 
        3.Return the new df with one hot encoded columns
        '''
        new_df = X.copy(deep=True)
        new_df.drop(self._cols_one_hot,axis=1,inplace=True)
        return new_df.join(pd.get_dummies(X[self._cols_one_hot]))


class labelencode(BaseEstimator, TransformerMixin):
    def __init__(self, cols_label_encode):
        self._cols_label_encode = cols_label_encode

    def fit(self, X, y=None):
        return self

    def transform(self, X):
        '''
        1.Copy the information from original df, 
        2.Label encode the columns,
        3.Drop the old columns, 
        4.Return the new df with label encoded columns
        '''
        new_df = X.copy(deep=True)
        label_encoded_cols = new_df[self._cols_label_encode].apply(LabelEncoder().fit_transform)
        new_df.drop(self._cols_label_encode,axis=1,inplace=True)
        return new_df.join(label_encoded_cols)
对于标签编码器部分,它将不适用于多列,因为LabelEncoder不支持多列编码。因此,您必须访问每个列并对其进行编码

    def transform(self, X):
     '''
        1.Copy the information from original df, 
        2.Label encode the columns,
        3.Drop the old columns, 
        4.Return the new df with label encoded columns'''
        new_df = X.copy(deep=True)
        label_encoded_cols =  new_df[self._cols_label_encode].apply(LabelEncoder().fit_transform)
        new_df.drop(self._cols_label_encode,axis=1,inplace=True)
        return new_df.join(label_encoded_cols)
最终解决办法将是:

    def transform(self, X):
        '''
        1.Copy the information from original df, 
        2.Drop the old column from new df, 
        3.Return the new df with one hot encoded columns'''
        new_df = X.copy(deep=True)
        new_df.drop(self._cols_one_hot,axis=1,inplace=True)
        return new_df.join(pd.get_dummies(X[self._cols_one_hot]))
class onehotencode(BaseEstimator, TransformerMixin):
    def __init__(self, cols_one_hot):
        self._cols_one_hot = cols_one_hot

    def fit(self, X, y=None):
        return self

    def transform(self, X):
        '''
        1.Copy the information from original df, 
        2.Drop the old column from new df, 
        3.Return the new df with one hot encoded columns
        '''
        new_df = X.copy(deep=True)
        new_df.drop(self._cols_one_hot,axis=1,inplace=True)
        return new_df.join(pd.get_dummies(X[self._cols_one_hot]))


class labelencode(BaseEstimator, TransformerMixin):
    def __init__(self, cols_label_encode):
        self._cols_label_encode = cols_label_encode

    def fit(self, X, y=None):
        return self

    def transform(self, X):
        '''
        1.Copy the information from original df, 
        2.Label encode the columns,
        3.Drop the old columns, 
        4.Return the new df with label encoded columns
        '''
        new_df = X.copy(deep=True)
        label_encoded_cols = new_df[self._cols_label_encode].apply(LabelEncoder().fit_transform)
        new_df.drop(self._cols_label_encode,axis=1,inplace=True)
        return new_df.join(label_encoded_cols)
然后,管道将被称为:

pipeline = Pipeline([('ohe',onehotencode(cols_to_encode)),
                    ('le',labelencode(cols_to_encode_label))])

df_transformed = pipeline.fit_transform(df)
df_transformed.head()将打印:


你能编辑这个问题并在问题中添加
df.head()
的结果吗?相反,我给出了数据集本身的链接。
cols\u-to\u-encode
cols\u-to\u-encode\u标签的值是多少?target=df['tip']cols\u-to\u-scale=['total\u-bill','size']cols\u-to\u-encode=['day']cols\u-to\u-encode标签=['sex'、'smoker'、'time']可能的重复您能编辑问题并在问题中添加
df.head()
的结果吗?相反,我给出了数据集本身的链接。
cols\u to\u encode
cols\u to\u encode\u标签的值是什么?target=df['tip']cols\u to\u scale=['total\u bill','size']cols\u to\u encode=['day']cols_to_encode_label=['sex','smoker','time']可能重复