Python 熊猫在非仓位边界条件下切割
我有计数率数据,我希望在每个箱子中有一个最小计数数。我面临的挑战是打破垃圾箱边界,这意味着每个垃圾箱没有保证的最小计数数。以下是我正在使用的示例:Python 熊猫在非仓位边界条件下切割,python,pandas,Python,Pandas,我有计数率数据,我希望在每个箱子中有一个最小计数数。我面临的挑战是打破垃圾箱边界,这意味着每个垃圾箱没有保证的最小计数数。以下是我正在使用的示例: # setup a time index for the dataframe r_t = pd.date_range(start=datetime.datetime(2020, 1, 1, 0), end=datetime.datetime(2020, 1, 1, 0, 5),
# setup a time index for the dataframe
r_t = pd.date_range(start=datetime.datetime(2020, 1, 1, 0),
end=datetime.datetime(2020, 1, 1, 0, 5),
freq=datetime.timedelta(milliseconds=15))
# random poisson statistics in each bin
np.random.seed(8675309)
df = pd.DataFrame({'cts':np.random.poisson(20, size=len(r_t))}, index=r_t)
# use a cumsum to define where to break, this is where more thinking is needed
df['cumsum'] = df['cts'].cumsum()
# do the cut, this does not give the answer I want
df['cut'] = pd.cut(df['cumsum'], pd.interval_range(start=0,
end=df['cumsum'].max()+1,
freq=50,
closed='right'),
include_lowest=True)
print(df.head())
# mess around to make this into a dataframe of the shape I need
df = df.reset_index()
df = df.rename(columns={'index':'time'})
df['julday'] = pd.DatetimeIndex(df['time']).to_julian_date()
df_cut = df.groupby('cut').agg({'julday':np.mean, 'cts':np.sum})
df_cut['time'] = pd.to_datetime(df_cut['julday'], origin='julian', unit='D')
df_cut = df_cut.set_index('time')
df_cut.head(10
初始输入数据
cts cumsum cut
2020-01-01 00:00:00.000 18 18 (0, 50]
2020-01-01 00:00:00.015 19 37 (0, 50]
2020-01-01 00:00:00.030 16 53 (50, 100]
2020-01-01 00:00:00.045 21 74 (50, 100]
2020-01-01 00:00:00.060 18 92 (50, 100]
2020-01-01 00:00:00.075 23 115 (100, 150]
2020-01-01 00:00:00.090 28 143 (100, 150]
2020-01-01 00:00:00.105 14 157 (150, 200]
2020-01-01 00:00:00.120 14 171 (150, 200]
2020-01-01 00:00:00.135 11 182 (150, 200]
输出
julday cts
time
2020-01-01 00:00:00.007516800 2.458850e+06 37
2020-01-01 00:00:00.045014400 2.458850e+06 55
2020-01-01 00:00:00.082512000 2.458850e+06 51
2020-01-01 00:00:00.127526400 2.458850e+06 52
2020-01-01 00:00:00.172540800 2.458850e+06 44
2020-01-01 00:00:00.210038400 2.458850e+06 57
2020-01-01 00:00:00.247536000 2.458850e+06 43
2020-01-01 00:00:00.277516800 2.458850e+06 43
2020-01-01 00:00:00.315014400 2.458850e+06 63
2020-01-01 00:00:00.352512000 2.458850e+06 41
正如您所看到的,第一个箱子没有至少50个计数。箱子的边缘是正确的,但没有达到我真正想要的。正确的答案是针对前几个垃圾箱[53、62、56,…]
我已经看过了,但看起来也不太对劲
有人有什么好主意吗
更新: 下面是一个可行但不太漂亮(而且速度很慢)的解决方案:
您是否正在寻找
qcut
将数据切割成大小类似的分区?不,qcut会分成“基于秩或基于样本分位数的大小相等的存储桶”,我想在每个存储桶至少有50个计数的位置进行分解。因此您可以使用100、51、50、40等存储桶?如果是这样的话,你可以把这些小箱子和它们的邻居合并起来,得到新的箱子。为了更清楚一点,我更新了代码,真的在寻找最小的箱子,每个箱子至少有50个计数。
# setup a time index for the dataframe
r_t = pd.date_range(start=datetime.datetime(2020, 1, 1, 0), end=datetime.datetime(2020, 1, 1, 0, 5),
freq=datetime.timedelta(milliseconds=15))
# random poisson statistics in each bin
np.random.seed(8675309)
df = pd.DataFrame({'cts':np.random.poisson(20, size=len(r_t))}, index=r_t)
df['group'] = 0
df.head(10)
cts group
2020-01-01 00:00:00.000 18 0
2020-01-01 00:00:00.015 19 0
2020-01-01 00:00:00.030 16 0
2020-01-01 00:00:00.045 21 0
2020-01-01 00:00:00.060 18 0
2020-01-01 00:00:00.075 23 0
2020-01-01 00:00:00.090 28 0
2020-01-01 00:00:00.105 14 0
2020-01-01 00:00:00.120 14 0
2020-01-01 00:00:00.135 11 0
# loop over the data grouping it up.
while df.loc[df['group'] == df['group'].max()]['cts'].cumsum().max() > 50:
dft = df.loc[df['group'] == df['group'].max()]['cts'].cumsum()
dft = dft.loc[dft>50]
idx = dft.argmin()+1
df.loc[dft.index[idx]:, 'group'] += 1
# do the dataframe dance again to get it in the form I want
df = df.reset_index()
df = df.rename(columns={'index':'time'})
df['julday'] = pd.DatetimeIndex(df['time']).to_julian_date()
df_cut = df.groupby('group').agg({'julday':np.mean, 'cts':np.sum})
df_cut['time'] = pd.to_datetime(df_cut['julday'], origin='julian', unit='D')
df_cut = df_cut.set_index('time')
df_cut.head(10)
julday cts
time
2020-01-01 00:00:00.015033600 2.458850e+06 53
2020-01-01 00:00:00.059961600 2.458850e+06 62
2020-01-01 00:00:00.105062400 2.458850e+06 56
2020-01-01 00:00:00.157507200 2.458850e+06 68
2020-01-01 00:00:00.210038400 2.458850e+06 57
2020-01-01 00:00:00.255052800 2.458850e+06 64
2020-01-01 00:00:00.299980800 2.458850e+06 63
2020-01-01 00:00:00.344995200 2.458850e+06 63
2020-01-01 00:00:00.390009600 2.458850e+06 60
2020-01-01 00:00:00.427507200 2.458850e+06 51