Python 值拟合模型时出错_Python_Numpy_Pandas_Statsmodels

Python 值拟合模型时出错

python numpy pandas

Python 值拟合模型时出错,python,numpy,pandas,statsmodels,Python,Numpy,Pandas,Statsmodels,我运行此代码只是为了检查线性回归模型在python中的工作方式： import pandas as pd import numpy as np import statsmodels.api as sm train = pd.read_csv('data/train.csv', parse_dates=[0]) test = pd.read_csv('data/test.csv', parse_dates=[0]) print train.head() #Feature engineerin

我运行此代码只是为了检查线性回归模型在python中的工作方式：

import pandas as pd
import numpy as np
import statsmodels.api as sm

train = pd.read_csv('data/train.csv', parse_dates=[0])
test = pd.read_csv('data/test.csv', parse_dates=[0])

print train.head()

#Feature engineering
temp_train = pd.DatetimeIndex(train['datetime'])
train['year'] = temp_train.year
train['month'] = temp_train.month
train['hour'] = temp_train.hour
train['weekday'] = temp_train.weekday

temp_test = pd.DatetimeIndex(test['datetime'])
test['year'] = temp_test.year
test['month'] = temp_test.month
test['hour'] = temp_test.hour
test['weekday'] = temp_test.weekday

#Define features vector
features = ['season', 'holiday', 'workingday', 'weather',
            'temp', 'atemp', 'humidity', 'windspeed', 'year',
            'month', 'weekday', 'hour']

#The evaluation metric is the RMSE in the log domain,
#so we should transform the target columns into log domain as well.
for col in ['casual', 'registered', 'count']:
    train['log-' + col] = train[col].apply(lambda x: np.log1p(x))

#Split train data set into training and validation sets
training, validation = train[:int(0.8*len(train))], train[int(0.8*len(train)):]

# Create a linear model
X = sm.add_constant(training[features])
model = sm.OLS(training['log-count'],X) # OLS stands for Ordinary Least Squares
f = model.fit()

ypred = f.predict(sm.add_constant(validation[features]))
print(ypred)

plt.figure();
plt.plot(validation[features], ypred, 'o', validation[features], validation['log-count'], 'b-');
plt.title('blue: true,   red: OLS');

弹出以下错误消息。这意味着什么？如何修复

Traceback (most recent call last):
  File "C:/TestModel/linear_regression.py", line 99, in <module>
    ypred = f.predict(sm.add_constant(validation[features]))
  File "C:\Python27\lib\site-packages\statsmodels\base\model.py", line 749, in predict
    return self.model.predict(self.params, exog, *args, **kwargs)
  File "C:\Python27\lib\site-packages\statsmodels\regression\linear_model.py", line 359, in predict
    return np.dot(exog, params)
ValueError: shapes (2178,12) and (13,) not aligned: 12 (dim 1) != 13 (dim 0)

这看起来像是此用例的

add_constant

函数的设计问题

从文档字符串：

" 对于ndarrays和pandas.DataFrames，检查以确保不存在常数已包含。如果至少有一列，则返回原始对象。 "

我认为这样定义是为了避免使用奇异设计矩阵进行估算，但是

predict

也适用于奇异矩阵

我的猜测是，您的

验证

数据有一列包含所有相同的值，例如，它们可能都来自同一年。如果这是有意的，那么您需要手动将常量添加到数据帧中

如果

add_constant

可以选择改变这种行为，那就更好了。

谢谢。您能举一个小例子，说明如何向验证集添加常数吗？你是说一个新的列吗？我尝试过这个（它给出了相同的错误）：validation['intercept']=pd.Series（[0代表范围内的x（len（validation.index））]，index=validation.index）ypred=f.predict（validation）这似乎几乎可以实现，但关于切片的副本：validnew=validation[features]validnew['intercept']=pd.Series（[0代表范围内的x（len）（validation.index））]，index=validation.index）这解决了问题：validnew=validation[features]validnew.insert（0，'const'，1）ypred=f.predict（validnew）另一种可能是在进行培训/验证拆分之前添加常量，并将

'const'

添加到

功能

列表中。然后，您始终拥有所有拆分的完整设计矩阵。

print training.head()
             datetime  season  holiday  workingday  weather  temp   atemp  \
0 2011-01-01 00:00:00       1        0           0        1  9.84  14.395   
1 2011-01-01 01:00:00       1        0           0        1  9.02  13.635   
2 2011-01-01 02:00:00       1        0           0        1  9.02  13.635   
3 2011-01-01 03:00:00       1        0           0        1  9.84  14.395   
4 2011-01-01 04:00:00       1        0           0        1  9.84  14.395   

   humidity  windspeed  casual  registered  count  year  month  hour  weekday  \
0        81          0       3          13     16  2011      1     0        5   
1        80          0       8          32     40  2011      1     1        5   
2        80          0       5          27     32  2011      1     2        5   
3        75          0       3          10     13  2011      1     3        5   
4        75          0       0           1      1  2011      1     4        5   

   log-casual  log-registered  log-count  
0    1.386294        2.639057   2.833213  
1    2.197225        3.496508   3.713572  
2    1.791759        3.332205   3.496508  
3    1.386294        2.397895   2.639057  
4    0.000000        0.693147   0.693147  


print validation.head()
                datetime  season  holiday  workingday  weather   temp   atemp  \
8708 2012-08-05 05:00:00       3        0           0        1  29.52  34.850   
8709 2012-08-05 06:00:00       3        0           0        1  29.52  34.850   
8710 2012-08-05 07:00:00       3        0           0        1  30.34  35.605   
8711 2012-08-05 08:00:00       3        0           0        1  31.16  36.365   
8712 2012-08-05 09:00:00       3        0           0        1  32.80  38.635   

      humidity  windspeed  casual  registered  count  year  month  hour  \
8708        74    16.9979       1          18     19  2012      8     5   
8709        79    16.9979       7          12     19  2012      8     6   
8710        74    19.9995      18          50     68  2012      8     7   
8711        66    22.0028      27          81    108  2012      8     8   
8712        59    23.9994      61         168    229  2012      8     9   

      weekday  log-casual  log-registered  log-count  
8708        6    0.693147        2.944439   2.995732  
8709        6    2.079442        2.564949   2.995732  
8710        6    2.944439        3.931826   4.234107  
8711        6    3.332205        4.406719   4.691348  
8712        6    4.127134        5.129899   5.438079