Machine learning 您好,关于sklearn.Pipeline和timeseries的自定义转换器的两个问题

Machine learning 您好,关于sklearn.Pipeline和timeseries的自定义转换器的两个问题,machine-learning,scikit-learn,python-3.7,pipeline,transformer,Machine Learning,Scikit Learn,Python 3.7,Pipeline,Transformer,我应该如何修改下面的代码以使其正常工作: 预测的目标=管道拟合和预测(df) 编辑: 我的代码: import numpy as np import pandas as pd from sklearn.base import BaseEstimator from sklearn.base import TransformerMixin from sklearn.pipeline import Pipeline np.random.seed(1) rows,cols = 100,1 data

我应该如何修改下面的代码以使其正常工作:

预测的目标=管道拟合和预测(df)

编辑: 我的代码:

import numpy as np
import pandas as pd
from sklearn.base import BaseEstimator
from sklearn.base import TransformerMixin
from sklearn.pipeline import Pipeline 
np.random.seed(1)

rows,cols = 100,1
data = np.random.randint(100, size = (rows,cols))
tidx = pd.date_range('2019-01-01', periods=rows, freq='20min') 
df = pd.DataFrame(data, columns=['num_orders'], index=tidx)
      


class MakeFeatures(BaseEstimator, TransformerMixin):

def __init__(self, X, y = None, max_lag = None, rolling_mean_day = None, rolling_mean_month = None):
    self.X = X.resample('1H').sum()
    self.max_lag = max_lag
    self.rolling_mean_day = rolling_mean_day
    self.rolling_mean_month = rolling_mean_month
        
def fit(self, X, y = None):
    return self

def transform(self, X, y = None):
    data = pd.DataFrame(index = self.X.index)
    data['num_orders'] = self.X['num_orders']
    data['year'] = self.X.index.year
    data['month'] = self.X.index.month
    data['day'] = self.X.index.day
    data['dayofweek'] = self.X.index.dayofweek
    
    data['detrend'] = self.X.shift() - self.X
    
    if self.max_lag:
        for lag in range(1, self.max_lag + 1):
            data['lag_{}'.format(lag)] = data['detrend'].shift(lag)
    if self.rolling_mean_day:
        data['rolling_mean_24'] = data.detrend.shift().rolling(self.rolling_mean_day).mean()
    
    if self.rolling_mean_month:
        data['rolling_mean_24'] = data['detrend'].shift().rolling(self.rolling_mean_month).mean()
    
    if data['year'].mean() == data['year'][1]:
        data = data.drop('year', axis = 1)
    
    data = data.dropna()
    
    y = data.num_orders
    data = data.drop('num_orders', 1)
    
    return data, y

pipe = Pipeline([
                ('features', MakeFeatures(df, df, 2 , 24)),
                ('scaler', StandardScaler())  
    ])

target, predicted = pipe.fit_transform(df, df)  # where ‘Target’ is y - the output from the Class
输出:

管道内的每个功能都工作正常

我可以毫无问题地运行MakeFeatures(df,df)StandardScaler()

我可以将MakeFeatures的产品(df,df)插入StandardScaler,它没有错误。

您不能使用

预测的目标=管道拟合和预测(df)

使用您定义的管道,因为只有在估计器也实现了这种方法的情况下,才能使用fit_predict()方法

仅当最终估计器实现fit_predict时有效

而且,它只会返回预测,因此您不能使用
target,predicted=
,而应该使用
predicted=

你弄错了

ValueError:使用序列设置数组元素

因为您提供的是
StandardScaler()
a
pandas.TimeSeries

这是因为使用方法调用
pipe.fit\u predict(df)
只为管道提供“X”,而不是“y”。这对于管道“MakeFeatures”的第一个组件来说很好,因为它接受“X”并返回“data”和“y”,但在管道中不会使用“y”,因为“y”必须在fit_predict()调用的开头定义

请在此处查看该方法的文档:

它表示“y”参数的状态

培训目标。必须满足以下所有步骤的标签要求: 管道

因此,“y”将用作管道所有部分的“y”,但您的未定义,因此
None

因此,当前管道的基本情况如下:

makeF = MakeFeatures(df, 2 , 24)
transformed_df = makeF.fit_transform(df)

sc = StandardScaler()
sc.fit(transformed_df)
并导致
ValueError:使用序列设置数组元素。

因此,我建议您按照以下方式更新代码:

import numpy as np
import pandas as pd
from sklearn.base import BaseEstimator
from sklearn.base import TransformerMixin
from sklearn.pipeline import Pipeline 
from sklearn.preprocessing import StandardScaler 
from sklearn.linear_model import LinearRegression

np.random.seed(1)

rows,cols = 100,1
data = np.random.randint(100, size = (rows,cols))
tidx = pd.date_range('2019-01-01', periods=rows, freq='20min') 
df = pd.DataFrame(data, columns=['num_orders'], index=tidx)
      
class MakeFeatures(BaseEstimator, TransformerMixin):

  def __init__(self, X, max_lag = None, rolling_mean_day = None, rolling_mean_month = None):
      self.X = X.resample('1H').sum()
      self.max_lag = max_lag
      self.rolling_mean_day = rolling_mean_day
      self.rolling_mean_month = rolling_mean_month
          
  def fit(self, X):
      return self

  def transform(self, X):
      data = pd.DataFrame(index = self.X.index)
      data['num_orders'] = self.X['num_orders']
      data['year'] = self.X.index.year
      data['month'] = self.X.index.month
      data['day'] = self.X.index.day
      data['dayofweek'] = self.X.index.dayofweek
      
      data['detrend'] = self.X.shift() - self.X
      
      if self.max_lag:
          for lag in range(1, self.max_lag + 1):
              data['lag_{}'.format(lag)] = data['detrend'].shift(lag)
      if self.rolling_mean_day:
          data['rolling_mean_24'] = data.detrend.shift().rolling(self.rolling_mean_day).mean()
      
      if self.rolling_mean_month:
          data['rolling_mean_24'] = data['detrend'].shift().rolling(self.rolling_mean_month).mean()
      
      if data['year'].mean() == data['year'][1]:
          data = data.drop('year', axis = 1)
      
      data = data.dropna()
      
      y = data.num_orders
      data = data.drop('num_orders', 1)
      
      return data, list(y)

pipe = Pipeline([
                 ('scaler', StandardScaler()),
                ('Model' , LinearRegression())
      ])

makeF = MakeFeatures(df, 2 , 24)
makeF.fit(df)
data,y = makeF.transform(df)
pipe.fit(data,y)  # where ‘Target’ is y - the output from the Class
然后,您可以使用管道预测数据并评估性能,例如使用r2_分数:

from sklearn.metrics import r2_score

predictions = pipe.predict(data)
r2_score(y,predictions)

嗨,你现在的问题很难理解。您能否提供数据的样本/虚拟数据,以便使用您的代码对其进行测试?@Kim Tang,谢谢您的评论。至少现在我明白了所有的缺点是从哪里来的。这是我的第一个问题,我仍然在学习如何提问。欢迎来到Stack Overflow!看一下这里的教程,然后用代码的“最小可复制示例”(minimal repeatable example)更新您的问题,以复制您的问题,以便其他人可以更好地帮助您。让我们看看这是否可行。感谢您的指导Kim Tang,感谢您的回答,但是StandardScaler和Pipeline都接受label参数,所以如果我将类修改为MakeFeature(self,X,y,…),然后调用pipe.fit_transform(df,df)(其中pipe=Pipeline([('features',MakeFeatures(df,df),('scaler',StandardScaler())),为什么它仍然不起作用呢?我在这里没有看到您完整的更新代码,但通过查看,我认为错误在于调用StandardScaler时将使用'df'作为'y'参数,因为您调用了pipe.fit_变换(df,df)。但是StandardScaler不能将“df”用作“y”。首先,感谢您的患者,我不确定该政策,但我的问题与您已经回答的主要问题不同。StandardScaler()。fit_transform(df,df)工作正常,MakeFeatures(df,df)也工作正常。但是一起:管道=管道([('features',MakeFeatures(df,df),('scaler',StandardScaler())]),pipe.fit_transform(df,df)它不起作用。好吧,就用你的问题提出一个新的问题,包括代码和一些数据来重现和理解你的问题。这样其他人也可以帮你。谢谢,可以
import numpy as np
import pandas as pd
from sklearn.base import BaseEstimator
from sklearn.base import TransformerMixin
from sklearn.pipeline import Pipeline 
from sklearn.preprocessing import StandardScaler 
from sklearn.linear_model import LinearRegression

np.random.seed(1)

rows,cols = 100,1
data = np.random.randint(100, size = (rows,cols))
tidx = pd.date_range('2019-01-01', periods=rows, freq='20min') 
df = pd.DataFrame(data, columns=['num_orders'], index=tidx)
      
class MakeFeatures(BaseEstimator, TransformerMixin):

  def __init__(self, X, max_lag = None, rolling_mean_day = None, rolling_mean_month = None):
      self.X = X.resample('1H').sum()
      self.max_lag = max_lag
      self.rolling_mean_day = rolling_mean_day
      self.rolling_mean_month = rolling_mean_month
          
  def fit(self, X):
      return self

  def transform(self, X):
      data = pd.DataFrame(index = self.X.index)
      data['num_orders'] = self.X['num_orders']
      data['year'] = self.X.index.year
      data['month'] = self.X.index.month
      data['day'] = self.X.index.day
      data['dayofweek'] = self.X.index.dayofweek
      
      data['detrend'] = self.X.shift() - self.X
      
      if self.max_lag:
          for lag in range(1, self.max_lag + 1):
              data['lag_{}'.format(lag)] = data['detrend'].shift(lag)
      if self.rolling_mean_day:
          data['rolling_mean_24'] = data.detrend.shift().rolling(self.rolling_mean_day).mean()
      
      if self.rolling_mean_month:
          data['rolling_mean_24'] = data['detrend'].shift().rolling(self.rolling_mean_month).mean()
      
      if data['year'].mean() == data['year'][1]:
          data = data.drop('year', axis = 1)
      
      data = data.dropna()
      
      y = data.num_orders
      data = data.drop('num_orders', 1)
      
      return data, list(y)

pipe = Pipeline([
                 ('scaler', StandardScaler()),
                ('Model' , LinearRegression())
      ])

makeF = MakeFeatures(df, 2 , 24)
makeF.fit(df)
data,y = makeF.transform(df)
pipe.fit(data,y)  # where ‘Target’ is y - the output from the Class
from sklearn.metrics import r2_score

predictions = pipe.predict(data)
r2_score(y,predictions)