Python 熊猫在非仓位边界条件下切割

Python 熊猫在非仓位边界条件下切割,python,pandas,Python,Pandas,我有计数率数据,我希望在每个箱子中有一个最小计数数。我面临的挑战是打破垃圾箱边界,这意味着每个垃圾箱没有保证的最小计数数。以下是我正在使用的示例: # setup a time index for the dataframe r_t = pd.date_range(start=datetime.datetime(2020, 1, 1, 0), end=datetime.datetime(2020, 1, 1, 0, 5),

我有计数率数据,我希望在每个箱子中有一个最小计数数。我面临的挑战是打破垃圾箱边界,这意味着每个垃圾箱没有保证的最小计数数。以下是我正在使用的示例:

# setup a time index for the dataframe
r_t = pd.date_range(start=datetime.datetime(2020, 1, 1, 0),
                     end=datetime.datetime(2020, 1, 1, 0, 5),
                     freq=datetime.timedelta(milliseconds=15))

# random poisson statistics in each bin
np.random.seed(8675309)
df = pd.DataFrame({'cts':np.random.poisson(20, size=len(r_t))}, index=r_t)
# use a cumsum to define where to break, this is where more thinking is needed
df['cumsum'] = df['cts'].cumsum()
# do the cut, this does not give the answer I want
df['cut'] = pd.cut(df['cumsum'], pd.interval_range(start=0,
                                                   end=df['cumsum'].max()+1, 
                                                   freq=50, 
                                                   closed='right'), 
                                                   include_lowest=True)
print(df.head())
# mess around to make this into a dataframe of the shape I need
df = df.reset_index()
df = df.rename(columns={'index':'time'})
df['julday'] = pd.DatetimeIndex(df['time']).to_julian_date()
df_cut = df.groupby('cut').agg({'julday':np.mean, 'cts':np.sum})
df_cut['time'] = pd.to_datetime(df_cut['julday'], origin='julian', unit='D')
df_cut = df_cut.set_index('time')
df_cut.head(10
初始输入数据

                         cts  cumsum         cut
2020-01-01 00:00:00.000   18      18     (0, 50]
2020-01-01 00:00:00.015   19      37     (0, 50]
2020-01-01 00:00:00.030   16      53   (50, 100]
2020-01-01 00:00:00.045   21      74   (50, 100]
2020-01-01 00:00:00.060   18      92   (50, 100]
2020-01-01 00:00:00.075   23     115  (100, 150]
2020-01-01 00:00:00.090   28     143  (100, 150]
2020-01-01 00:00:00.105   14     157  (150, 200]
2020-01-01 00:00:00.120   14     171  (150, 200]
2020-01-01 00:00:00.135   11     182  (150, 200]
输出

            julday  cts
time        
2020-01-01 00:00:00.007516800   2.458850e+06    37
2020-01-01 00:00:00.045014400   2.458850e+06    55
2020-01-01 00:00:00.082512000   2.458850e+06    51
2020-01-01 00:00:00.127526400   2.458850e+06    52
2020-01-01 00:00:00.172540800   2.458850e+06    44
2020-01-01 00:00:00.210038400   2.458850e+06    57
2020-01-01 00:00:00.247536000   2.458850e+06    43
2020-01-01 00:00:00.277516800   2.458850e+06    43
2020-01-01 00:00:00.315014400   2.458850e+06    63
2020-01-01 00:00:00.352512000   2.458850e+06    41
正如您所看到的,第一个箱子没有至少50个计数。箱子的边缘是正确的,但没有达到我真正想要的。正确的答案是针对前几个垃圾箱
[53、62、56,…]

我已经看过了,但看起来也不太对劲

有人有什么好主意吗


更新:

下面是一个可行但不太漂亮(而且速度很慢)的解决方案:


您是否正在寻找
qcut
将数据切割成大小类似的分区?不,qcut会分成“基于秩或基于样本分位数的大小相等的存储桶”,我想在每个存储桶至少有50个计数的位置进行分解。因此您可以使用
100、51、50、40等存储桶?如果是这样的话,你可以把这些小箱子和它们的邻居合并起来,得到新的箱子。为了更清楚一点,我更新了代码,真的在寻找最小的箱子,每个箱子至少有50个计数。
# setup a time index for the dataframe
r_t = pd.date_range(start=datetime.datetime(2020, 1, 1, 0), end=datetime.datetime(2020, 1, 1, 0, 5),
                        freq=datetime.timedelta(milliseconds=15))
# random poisson statistics in each bin
np.random.seed(8675309)
df = pd.DataFrame({'cts':np.random.poisson(20, size=len(r_t))}, index=r_t)

df['group'] = 0
df.head(10)


    cts     group
2020-01-01 00:00:00.000     18  0
2020-01-01 00:00:00.015     19  0
2020-01-01 00:00:00.030     16  0
2020-01-01 00:00:00.045     21  0
2020-01-01 00:00:00.060     18  0
2020-01-01 00:00:00.075     23  0
2020-01-01 00:00:00.090     28  0
2020-01-01 00:00:00.105     14  0
2020-01-01 00:00:00.120     14  0
2020-01-01 00:00:00.135     11  0


# loop over the data grouping it up. 
while df.loc[df['group'] == df['group'].max()]['cts'].cumsum().max() > 50:
    dft = df.loc[df['group'] == df['group'].max()]['cts'].cumsum()
    dft = dft.loc[dft>50]
    idx = dft.argmin()+1
    df.loc[dft.index[idx]:, 'group'] += 1


# do the dataframe dance again to get it in the form I want
df = df.reset_index()
df = df.rename(columns={'index':'time'})
df['julday'] = pd.DatetimeIndex(df['time']).to_julian_date()
df_cut = df.groupby('group').agg({'julday':np.mean, 'cts':np.sum})
df_cut['time'] = pd.to_datetime(df_cut['julday'], origin='julian', unit='D')
df_cut = df_cut.set_index('time')
df_cut.head(10)

    julday  cts
time        
2020-01-01 00:00:00.015033600   2.458850e+06    53
2020-01-01 00:00:00.059961600   2.458850e+06    62
2020-01-01 00:00:00.105062400   2.458850e+06    56
2020-01-01 00:00:00.157507200   2.458850e+06    68
2020-01-01 00:00:00.210038400   2.458850e+06    57
2020-01-01 00:00:00.255052800   2.458850e+06    64
2020-01-01 00:00:00.299980800   2.458850e+06    63
2020-01-01 00:00:00.344995200   2.458850e+06    63
2020-01-01 00:00:00.390009600   2.458850e+06    60
2020-01-01 00:00:00.427507200   2.458850e+06    51