Machine learning 您好,关于sklearn.Pipeline和timeseries的自定义转换器的两个问题
我应该如何修改下面的代码以使其正常工作: 预测的目标=管道拟合和预测(df) 编辑: 我的代码:Machine learning 您好,关于sklearn.Pipeline和timeseries的自定义转换器的两个问题,machine-learning,scikit-learn,python-3.7,pipeline,transformer,Machine Learning,Scikit Learn,Python 3.7,Pipeline,Transformer,我应该如何修改下面的代码以使其正常工作: 预测的目标=管道拟合和预测(df) 编辑: 我的代码: import numpy as np import pandas as pd from sklearn.base import BaseEstimator from sklearn.base import TransformerMixin from sklearn.pipeline import Pipeline np.random.seed(1) rows,cols = 100,1 data
import numpy as np
import pandas as pd
from sklearn.base import BaseEstimator
from sklearn.base import TransformerMixin
from sklearn.pipeline import Pipeline
np.random.seed(1)
rows,cols = 100,1
data = np.random.randint(100, size = (rows,cols))
tidx = pd.date_range('2019-01-01', periods=rows, freq='20min')
df = pd.DataFrame(data, columns=['num_orders'], index=tidx)
class MakeFeatures(BaseEstimator, TransformerMixin):
def __init__(self, X, y = None, max_lag = None, rolling_mean_day = None, rolling_mean_month = None):
self.X = X.resample('1H').sum()
self.max_lag = max_lag
self.rolling_mean_day = rolling_mean_day
self.rolling_mean_month = rolling_mean_month
def fit(self, X, y = None):
return self
def transform(self, X, y = None):
data = pd.DataFrame(index = self.X.index)
data['num_orders'] = self.X['num_orders']
data['year'] = self.X.index.year
data['month'] = self.X.index.month
data['day'] = self.X.index.day
data['dayofweek'] = self.X.index.dayofweek
data['detrend'] = self.X.shift() - self.X
if self.max_lag:
for lag in range(1, self.max_lag + 1):
data['lag_{}'.format(lag)] = data['detrend'].shift(lag)
if self.rolling_mean_day:
data['rolling_mean_24'] = data.detrend.shift().rolling(self.rolling_mean_day).mean()
if self.rolling_mean_month:
data['rolling_mean_24'] = data['detrend'].shift().rolling(self.rolling_mean_month).mean()
if data['year'].mean() == data['year'][1]:
data = data.drop('year', axis = 1)
data = data.dropna()
y = data.num_orders
data = data.drop('num_orders', 1)
return data, y
pipe = Pipeline([
('features', MakeFeatures(df, df, 2 , 24)),
('scaler', StandardScaler())
])
target, predicted = pipe.fit_transform(df, df) # where ‘Target’ is y - the output from the Class
输出:
管道内的每个功能都工作正常
我可以毫无问题地运行MakeFeatures(df,df)和StandardScaler()
我可以将MakeFeatures的产品(df,df)插入StandardScaler,它没有错误。您不能使用
预测的目标=管道拟合和预测(df)
使用您定义的管道,因为只有在估计器也实现了这种方法的情况下,才能使用fit_predict()方法
仅当最终估计器实现fit_predict时有效
而且,它只会返回预测,因此您不能使用target,predicted=
,而应该使用predicted=
你弄错了
ValueError:使用序列设置数组元素
因为您提供的是StandardScaler()
apandas.TimeSeries
这是因为使用方法调用pipe.fit\u predict(df)
只为管道提供“X”,而不是“y”。这对于管道“MakeFeatures”的第一个组件来说很好,因为它接受“X”并返回“data”和“y”,但在管道中不会使用“y”,因为“y”必须在fit_predict()调用的开头定义
请在此处查看该方法的文档:
它表示“y”参数的状态
培训目标。必须满足以下所有步骤的标签要求:
管道
因此,“y”将用作管道所有部分的“y”,但您的未定义,因此None
因此,当前管道的基本情况如下:
makeF = MakeFeatures(df, 2 , 24)
transformed_df = makeF.fit_transform(df)
sc = StandardScaler()
sc.fit(transformed_df)
并导致ValueError:使用序列设置数组元素。
因此,我建议您按照以下方式更新代码:
import numpy as np
import pandas as pd
from sklearn.base import BaseEstimator
from sklearn.base import TransformerMixin
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
np.random.seed(1)
rows,cols = 100,1
data = np.random.randint(100, size = (rows,cols))
tidx = pd.date_range('2019-01-01', periods=rows, freq='20min')
df = pd.DataFrame(data, columns=['num_orders'], index=tidx)
class MakeFeatures(BaseEstimator, TransformerMixin):
def __init__(self, X, max_lag = None, rolling_mean_day = None, rolling_mean_month = None):
self.X = X.resample('1H').sum()
self.max_lag = max_lag
self.rolling_mean_day = rolling_mean_day
self.rolling_mean_month = rolling_mean_month
def fit(self, X):
return self
def transform(self, X):
data = pd.DataFrame(index = self.X.index)
data['num_orders'] = self.X['num_orders']
data['year'] = self.X.index.year
data['month'] = self.X.index.month
data['day'] = self.X.index.day
data['dayofweek'] = self.X.index.dayofweek
data['detrend'] = self.X.shift() - self.X
if self.max_lag:
for lag in range(1, self.max_lag + 1):
data['lag_{}'.format(lag)] = data['detrend'].shift(lag)
if self.rolling_mean_day:
data['rolling_mean_24'] = data.detrend.shift().rolling(self.rolling_mean_day).mean()
if self.rolling_mean_month:
data['rolling_mean_24'] = data['detrend'].shift().rolling(self.rolling_mean_month).mean()
if data['year'].mean() == data['year'][1]:
data = data.drop('year', axis = 1)
data = data.dropna()
y = data.num_orders
data = data.drop('num_orders', 1)
return data, list(y)
pipe = Pipeline([
('scaler', StandardScaler()),
('Model' , LinearRegression())
])
makeF = MakeFeatures(df, 2 , 24)
makeF.fit(df)
data,y = makeF.transform(df)
pipe.fit(data,y) # where ‘Target’ is y - the output from the Class
然后,您可以使用管道预测数据并评估性能,例如使用r2_分数:
from sklearn.metrics import r2_score
predictions = pipe.predict(data)
r2_score(y,predictions)
嗨,你现在的问题很难理解。您能否提供数据的样本/虚拟数据,以便使用您的代码对其进行测试?@Kim Tang,谢谢您的评论。至少现在我明白了所有的缺点是从哪里来的。这是我的第一个问题,我仍然在学习如何提问。欢迎来到Stack Overflow!看一下这里的教程,然后用代码的“最小可复制示例”(minimal repeatable example)更新您的问题,以复制您的问题,以便其他人可以更好地帮助您。让我们看看这是否可行。感谢您的指导Kim Tang,感谢您的回答,但是StandardScaler和Pipeline都接受label参数,所以如果我将类修改为MakeFeature(self,X,y,…),然后调用pipe.fit_transform(df,df)(其中pipe=Pipeline([('features',MakeFeatures(df,df),('scaler',StandardScaler())),为什么它仍然不起作用呢?我在这里没有看到您完整的更新代码,但通过查看,我认为错误在于调用StandardScaler时将使用'df'作为'y'参数,因为您调用了pipe.fit_变换(df,df)。但是StandardScaler不能将“df”用作“y”。首先,感谢您的患者,我不确定该政策,但我的问题与您已经回答的主要问题不同。StandardScaler()。fit_transform(df,df)工作正常,MakeFeatures(df,df)也工作正常。但是一起:管道=管道([('features',MakeFeatures(df,df),('scaler',StandardScaler())]),pipe.fit_transform(df,df)它不起作用。好吧,就用你的问题提出一个新的问题,包括代码和一些数据来重现和理解你的问题。这样其他人也可以帮你。谢谢,可以
import numpy as np
import pandas as pd
from sklearn.base import BaseEstimator
from sklearn.base import TransformerMixin
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
np.random.seed(1)
rows,cols = 100,1
data = np.random.randint(100, size = (rows,cols))
tidx = pd.date_range('2019-01-01', periods=rows, freq='20min')
df = pd.DataFrame(data, columns=['num_orders'], index=tidx)
class MakeFeatures(BaseEstimator, TransformerMixin):
def __init__(self, X, max_lag = None, rolling_mean_day = None, rolling_mean_month = None):
self.X = X.resample('1H').sum()
self.max_lag = max_lag
self.rolling_mean_day = rolling_mean_day
self.rolling_mean_month = rolling_mean_month
def fit(self, X):
return self
def transform(self, X):
data = pd.DataFrame(index = self.X.index)
data['num_orders'] = self.X['num_orders']
data['year'] = self.X.index.year
data['month'] = self.X.index.month
data['day'] = self.X.index.day
data['dayofweek'] = self.X.index.dayofweek
data['detrend'] = self.X.shift() - self.X
if self.max_lag:
for lag in range(1, self.max_lag + 1):
data['lag_{}'.format(lag)] = data['detrend'].shift(lag)
if self.rolling_mean_day:
data['rolling_mean_24'] = data.detrend.shift().rolling(self.rolling_mean_day).mean()
if self.rolling_mean_month:
data['rolling_mean_24'] = data['detrend'].shift().rolling(self.rolling_mean_month).mean()
if data['year'].mean() == data['year'][1]:
data = data.drop('year', axis = 1)
data = data.dropna()
y = data.num_orders
data = data.drop('num_orders', 1)
return data, list(y)
pipe = Pipeline([
('scaler', StandardScaler()),
('Model' , LinearRegression())
])
makeF = MakeFeatures(df, 2 , 24)
makeF.fit(df)
data,y = makeF.transform(df)
pipe.fit(data,y) # where ‘Target’ is y - the output from the Class
from sklearn.metrics import r2_score
predictions = pipe.predict(data)
r2_score(y,predictions)