Python 数据帧子集的回归统计
我有一个数据框架,由多年的数据和多个环境参数组成。数据帧如下所示:Python 数据帧子集的回归统计,python,pandas,dataframe,regression,Python,Pandas,Dataframe,Regression,我有一个数据框架,由多年的数据和多个环境参数组成。数据帧如下所示: import pandas as pd import numpy as np from scipy import stats Parameters= ['Temperature','Rain', 'Pressure', 'Humidity'] nrows = 365 daterange = pd.date_range('1/1/2019', periods=nrows, freq='D') Vals = pd.DataFram
import pandas as pd
import numpy as np
from scipy import stats
Parameters= ['Temperature','Rain', 'Pressure', 'Humidity']
nrows = 365
daterange = pd.date_range('1/1/2019', periods=nrows, freq='D')
Vals = pd.DataFrame(np.random.randint(10, 150, size=(nrows, len(Parameters))), columns=Parameters)
Vals = Vals.set_index(daterange)
print(Vals)
我已经创建了一个月名为Vals['month']=Vals.index.month_name().str.slice(stop=3)
的列,我想从两个变量之间的回归计算斜率,Rain和Temperature
,并在数据框中提取它们。我尝试了以下解决方案:
pd.DataFrame.from_dict({y:stats.linregress(Vals['Temperature'], Vals['Rain'])[:2] for y, x in
Vals.groupby('Month')},'index').\
rename(columns={0:'Slope',1:'Intercept'})
但是输出不是我所期望的。我想要每月的回归统计数据,但结果是这样的
Slope Intercept
Apr -0.016868 81.723291
Aug -0.016868 81.723291
Dec -0.016868 81.723291
Feb -0.016868 81.723291
Jan -0.016868 81.723291
Jul -0.016868 81.723291
Jun -0.016868 81.723291
Mar -0.016868 81.723291
May -0.016868 81.723291
Nov -0.016868 81.723291
Oct -0.016868 81.723291
Sep -0.016868 81.723291
回归似乎是从总数据集计算出来的,并存储在每个月的指数中。如何通过类似的过程计算每月统计数据?这里是我过去使用过的一些代码。我使用了
sklearn.LinearModel
,因为我认为它更易于使用,但如果您愿意,可以更改为scipy.stats
此代码使用apply
,并在函数linear\u model
中进行线性回归
import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression
def linear_model(group):
x,y = group.Temperature.values.reshape(-1,1), group.Rain.values.reshape(-1,1)
model = LinearRegression().fit(x,y)
m = model.coef_
i = model.intercept_
r_sqd = model.score(x,y)
return (pd.Series({ 'slope':np.squeeze(m), 'intercept':np.squeeze(i),
'r_sqd':np.squeeze(r_sqd)}))
Parameters= ['Temperature','Rain', 'Pressure', 'Humidity']
nrows = 365
daterange = pd.date_range('1/1/2019', periods=nrows, freq='D')
Vals = pd.DataFrame(np.random.randint(10, 150, size=(nrows, len(Parameters))), columns=Parameters)
Vals = Vals.set_index(daterange)
Vals.groupby(Vals.index.month).apply(linear_model)
结果:
Vals.groupby(Vals.index.month).apply(linear_model)
Out[15]:
slope intercept r_sqd
1 -0.06334408633973578 80.98723450432585 0.003480
2 -0.1393001910724248 85.40023995141723 0.020435
3 -0.0535505295232336 69.09958112535743 0.003481
4 0.23187299827488306 57.866651248302546 0.048741
5 -0.04813654915436082 74.31295680099751 0.001867
6 0.31976921541526526 48.496345031992746 0.089027
7 -0.1979417421554613 94.84215558468942 0.052023
8 0.22239030327077666 68.62700822940076 0.061849
9 0.054607306452220644 72.0988798639258 0.002877
10 -0.07841007716276265 91.9211204014171 0.006085
11 -0.13517307855088803 100.44769438307809 0.016045
12 -0.1967407738498068 101.7393002049148 0.042255
你的尝试很接近。将for循环与groupby对象一起使用时,将返回组的名称和数据。典型的惯例是:
for name, group in Vals.groupby('Month'):
#do stuff with group
由于您为name
调用了x
,为group
调用了y
,因此您可以将Vals
更改为y
,代码将产生与上述相同的结果
pd.DataFrame.from_dict({y:stats.linregress(x['Temperature'], x['Rain'])[:2] for y, x in
Vals.groupby('Month')},'index').\
rename(columns={0:'Slope',1:'Intercept'})
Slope Intercept
Apr 0.231873 57.866651
Aug 0.222390 68.627008
Dec -0.196741 101.739300
Feb -0.139300 85.400240
Jan -0.063344 80.987235
Jul -0.197942 94.842156
Jun 0.319769 48.496345
Mar -0.053551 69.099581
May -0.048137 74.312957
Nov -0.135173 100.447694
Oct -0.078410 91.921120
Sep 0.054607 72.098880
这对我有用。谢谢你指出这个问题,它很有帮助:):)