Python 基于发生频率的概率预测
我有一个2011-2013年的降雨时间序列,其中降雨数据为1(无雨)和0(雨)格式。原始数据间隔为1小时,从每天上午10点到下午3点。我不想预测2014年的降雨量,但我想根据降雨列中出现的1或0预测同一时间间隔内全年的降雨机会。目前,我使用以下代码通过计算1或0次出现来预测下雨的可能性:Python 基于发生频率的概率预测,python,pandas,probability,prediction,Python,Pandas,Probability,Prediction,我有一个2011-2013年的降雨时间序列,其中降雨数据为1(无雨)和0(雨)格式。原始数据间隔为1小时,从每天上午10点到下午3点。我不想预测2014年的降雨量,但我想根据降雨列中出现的1或0预测同一时间间隔内全年的降雨机会。目前,我使用以下代码通过计算1或0次出现来预测下雨的可能性: import pandas as pd b = {'year':[2011,2011,2011,2011,2011,2011,2011,2011,2011,2011,2011,2011,
import pandas as pd
b = {'year':[2011,2011,2011,2011,2011,2011,2011,2011,2011,2011,2011,2011,
2012,2012,2012,2012,2012,2012,2012,2012,2012,2012,2012,2012,
2013,2013,2013,2013,2013,2013,2013,2013,2013,2013,2013,2013],
'month': [1,2,3,4,5,6,7,8,9,10,11,12,1,2,3,4,5,6,7,8,9,10,11,12,1,2,3,4,5,6,7,8,9,10,11,12],
'rain':[1,0,0,0,1,1,0,1,1,0,0,1,0,0,1,0,0,0,1,1,1,1,1,0,0,1,1,0,1,0,1,0,1,0,1,0]}
b = pd.DataFrame(b,columns = ['year','month','rain'])
def X(b):
if (b['month'] == 1):
return 'Jan'
elif (b['month']==2):
return 'Feb'
elif (b['month']==3):
return 'Mar'
elif (b['month']==4):
return 'Apr'
elif (b['month']==5):
return 'May'
elif (b['month']==6):
return 'Jun'
elif (b['month']==7):
return 'Jul'
elif (b['month']==8):
return 'Aug'
elif (b['month']==9):
return 'Sep'
elif (b['month']==10):
return 'Oct'
elif (b['month']==11):
return 'Nov'
elif (b['month']==12):
return 'Dec'
b['X'] = b.apply(X,axis =1)
mask_x = (b['X']=='Jul')
mask_y = b['rain'].loc[mask_x]
mask_y.value_counts()
我认为这种方法不适用于大型数据集,有人能给我推荐一种有效且稳健的方法,从此类数据集预测降雨量。数据是通过每小时随机选择
[0,1]
创建的。我们通过在日期列中按时间分组来计算总病例数和病例数。现在,您可以通过事件总数/数量计算降雨率。我按照您的代码创建年、月和月的缩写名称,但这并不是必需的
import pandas as pd
import numpy as np
import random
random.seed(20200817)
date_rng = pd.date_range('2013-01-01', '2016-01-01', freq='1H')
rain = random.choices([0,1], k=len(date_rng))
b = pd.DataFrame({'date':pd.to_datetime(date_rng), 'rain':rain})
hour_rain = b.groupby([b.date.dt.month, b.date.dt.day, b.date.dt.hour])['rain'].agg([sum,np.size])
hour_rain.index.names = ['month','day','hour']
hour_rain.reset_index()
month day hour sum size
0 1 1 0 0 4
1 1 1 1 2 3
2 1 1 2 3 3
3 1 1 3 1 3
4 1 1 4 1 3
... ... ... ... ... ...
8755 12 31 19 2 3
8756 12 31 20 2 3
8757 12 31 21 2 3
8758 12 31 22 0 3
8759 12 31 23 0 3
我想做的事情如下所示:
import pandas as pd
import numpy as np
import random
random.seed(20200817)
date_rng = pd.date_range('2013-01-01', '2015-12-31', freq='1H')
rain = random.choices([0,1], k=len(date_rng))
b = pd.DataFrame({'date':pd.to_datetime(date_rng), 'rain':rain})
b['year'] = b['date'].dt.year
b['month'] = b['date'].dt.month
b['day'] = b['date'].dt.day
b['hour'] = b['date'].dt.hour
b['X'] = b['date'].dt.strftime('%b')
b['hour']= b['hour'].astype(str).str.zfill(2)
b['day']= b['day'].astype(str).str.zfill(2)
# Joint the Month, Date, Hour and Minute together
b['var'] = b['X']+b['day'].astype(str)+b['hour'].astype(str)
cols = b.columns.tolist()
cols = cols[-1:] + cols[:-1]
b = b[cols]
# drop the unwanted columns
b = b.drop(["date","month","X","hour","day","year"], axis=1)
# now lets say I wanna predict 20 January 15.00 chance of rain
mask_x = (b['var']=='Jan2015')
mask_y = b['rain'].loc[mask_x]
mask_y.value_counts()
output:
0 2
1 1
# means the chance of rain is 33.33% and no chance of rain is 66.67%
当我对大型数据集(超过20年)执行此操作时,我觉得效果不太好。在同一时期内仅使用平均值如何?我已更正了聚合规范中的代码。