Python 按数据帧中的时间范围分组行

Python 按数据帧中的时间范围分组行,python,pandas,Python,Pandas,我有一个由时间戳索引的大数据帧,我想根据时间范围将行分配给组 例如,在下面的数据中,我将组中第一个条目的1ms内的行分组 groupid 1999-12-31 23:59:59.000107 1 1999-12-31 23:59:59.000385 1 1999-12-31 23:59:59.000404 1 1999-12-31 23:59:59.000704 1 1999-12-31 23:

我有一个由时间戳索引的大数据帧,我想根据时间范围将行分配给组

例如,在下面的数据中,我将组中第一个条目的1ms内的行分组

                           groupid
1999-12-31 23:59:59.000107       1
1999-12-31 23:59:59.000385       1
1999-12-31 23:59:59.000404       1
1999-12-31 23:59:59.000704       1
1999-12-31 23:59:59.001281       2
1999-12-31 23:59:59.002211       2
1999-12-31 23:59:59.002367       3
我有一个工作代码,它通过迭代行并使用当前行对数据帧进行切片来实现这一点:

dts = sorted([datetime(1999, 12, 31, 23, 59, 59, x) for
              x in np.random.randint(1, 999999, 1000)])
df = pd.DataFrame({'groupid': None}, dts)

print df.head(20)

groupid = 1
for dt, row in df.iterrows():
    if df.loc[row.name].groupid:
        continue
    end = dt + timedelta(milliseconds=1)
    group = df.loc[dt:end]
    df.loc[group.index, 'groupid'] = groupid
    groupid += 1

print df.head(20)

但是,与iterrows一样,在大型数据帧上的操作速度很慢。我曾尝试过应用函数和使用groupby,但都没有成功。使用itertuples是否是提高性能的最佳方法(我现在就要尝试)?有人能给点建议吗

这类似于重采样操作

创建您的数据

In [39]: pd.set_option('max_rows',12)

In [40]: np.random.seed(11111)

In [41]: dts = sorted([datetime(1999, 12, 31, 23, 59, 59, x) for
              x in np.random.randint(1, 999999, 1000)])

In [42]: df = pd.DataFrame({'groupid': np.random.randn(len(dts))}, dts)
因此,简单地分组可以直接为您提供分组。您可以迭代,因为这是一个生成器

In [43]: list(df.groupby(pd.Grouper(freq='ms')))[0:3]
Out[43]: 
[(Timestamp('1999-12-31 23:59:59', offset='L'),
                               groupid
  1999-12-31 23:59:59.000789 -1.369503
  1999-12-31 23:59:59.000814  0.776049),
 (Timestamp('1999-12-31 23:59:59.001000', offset='L'),
                               groupid
  1999-12-31 23:59:59.001041 -0.374915
  1999-12-31 23:59:59.001062 -1.470845),
 (Timestamp('1999-12-31 23:59:59.002000', offset='L'),
                               groupid
  1999-12-31 23:59:59.002355 -0.240954)]
只是重新采样可能更简单。您可以为
方式使用自定义函数

In [44]: df.resample('ms',how='sum')
Out[44]: 
                          groupid
1999-12-31 23:59:59.000 -0.593454
1999-12-31 23:59:59.001 -1.845759
1999-12-31 23:59:59.002 -0.240954
1999-12-31 23:59:59.003  1.291403
1999-12-31 23:59:59.004       NaN
1999-12-31 23:59:59.005  0.291484
...                           ...
1999-12-31 23:59:59.994       NaN
1999-12-31 23:59:59.995       NaN
1999-12-31 23:59:59.996       NaN
1999-12-31 23:59:59.997 -0.445052
1999-12-31 23:59:59.998       NaN
1999-12-31 23:59:59.999 -0.895305

[1000 rows x 1 columns]

好的,我想下面就是你想要的,它通过将所有值减去第一个值,从你的索引中构造一个时间增量。然后访问微秒组件并除以1000,然后将序列数据类型转换为int:

In [86]:

df['groupid'] = ((df.index.to_series() - df.index[0]).dt.microseconds / 1000).astype(np.int32)
df
Out[86]:
                            groupid
1999-12-31 23:59:59.000133        0
1999-12-31 23:59:59.000584        0
1999-12-31 23:59:59.003544        3
1999-12-31 23:59:59.009193        9
1999-12-31 23:59:59.010220       10
1999-12-31 23:59:59.010632       10
1999-12-31 23:59:59.010716       10
1999-12-31 23:59:59.011387       11
1999-12-31 23:59:59.011837       11
1999-12-31 23:59:59.013277       13
1999-12-31 23:59:59.013305       13
1999-12-31 23:59:59.014754       14
1999-12-31 23:59:59.016015       15
1999-12-31 23:59:59.016067       15
1999-12-31 23:59:59.017788       17
1999-12-31 23:59:59.018236       18
1999-12-31 23:59:59.021281       21
1999-12-31 23:59:59.021772       21
1999-12-31 23:59:59.021927       21
1999-12-31 23:59:59.022200       22
1999-12-31 23:59:59.023104       22
1999-12-31 23:59:59.023375       23
1999-12-31 23:59:59.023688       23
1999-12-31 23:59:59.023726       23
1999-12-31 23:59:59.025397       25
1999-12-31 23:59:59.026407       26
1999-12-31 23:59:59.026480       26
1999-12-31 23:59:59.027825       27
1999-12-31 23:59:59.028793       28
1999-12-31 23:59:59.030716       30
...                             ...
1999-12-31 23:59:59.975432      975
1999-12-31 23:59:59.976699      976
1999-12-31 23:59:59.977177      977
1999-12-31 23:59:59.979475      979
1999-12-31 23:59:59.980282      980
1999-12-31 23:59:59.980672      980
1999-12-31 23:59:59.983202      983
1999-12-31 23:59:59.984214      984
1999-12-31 23:59:59.984674      984
1999-12-31 23:59:59.984933      984
1999-12-31 23:59:59.985664      985
1999-12-31 23:59:59.985779      985
1999-12-31 23:59:59.988812      988
1999-12-31 23:59:59.989324      989
1999-12-31 23:59:59.990386      990
1999-12-31 23:59:59.990485      990
1999-12-31 23:59:59.990969      990
1999-12-31 23:59:59.991255      991
1999-12-31 23:59:59.991739      991
1999-12-31 23:59:59.993979      993
1999-12-31 23:59:59.994705      994
1999-12-31 23:59:59.994874      994
1999-12-31 23:59:59.995397      995
1999-12-31 23:59:59.995753      995
1999-12-31 23:59:59.995863      995
1999-12-31 23:59:59.996574      996
1999-12-31 23:59:59.998139      998
1999-12-31 23:59:59.998533      998
1999-12-31 23:59:59.998778      998
1999-12-31 23:59:59.999915      999
感谢@Jeff指出了更干净的方法:

In [96]:
df['groupid'] = (df.index-df.index[0]).astype('timedelta64[ms]')
df

Out[96]:
                            groupid
1999-12-31 23:59:59.000884        0
1999-12-31 23:59:59.001175        0
1999-12-31 23:59:59.001262        0
1999-12-31 23:59:59.001540        0
1999-12-31 23:59:59.001769        0
1999-12-31 23:59:59.002478        1
1999-12-31 23:59:59.005001        4
1999-12-31 23:59:59.005497        4
1999-12-31 23:59:59.006908        6
1999-12-31 23:59:59.008860        7
1999-12-31 23:59:59.009257        8
1999-12-31 23:59:59.010012        9
1999-12-31 23:59:59.011451       10
1999-12-31 23:59:59.013177       12
1999-12-31 23:59:59.014138       13
1999-12-31 23:59:59.015795       14
1999-12-31 23:59:59.015865       14
1999-12-31 23:59:59.016069       15
1999-12-31 23:59:59.016666       15
1999-12-31 23:59:59.016718       15
1999-12-31 23:59:59.019058       18
1999-12-31 23:59:59.019675       18
1999-12-31 23:59:59.020747       19
1999-12-31 23:59:59.021856       20
1999-12-31 23:59:59.022959       22
1999-12-31 23:59:59.023812       22
1999-12-31 23:59:59.023938       23
1999-12-31 23:59:59.024122       23
1999-12-31 23:59:59.025332       24
1999-12-31 23:59:59.025397       24
...                             ...
1999-12-31 23:59:59.959725      958
1999-12-31 23:59:59.959742      958
1999-12-31 23:59:59.959892      959
1999-12-31 23:59:59.960345      959
1999-12-31 23:59:59.960800      959
1999-12-31 23:59:59.961054      960
1999-12-31 23:59:59.962749      961
1999-12-31 23:59:59.965681      964
1999-12-31 23:59:59.966409      965
1999-12-31 23:59:59.966558      965
1999-12-31 23:59:59.967357      966
1999-12-31 23:59:59.967842      966
1999-12-31 23:59:59.970465      969
1999-12-31 23:59:59.974022      973
1999-12-31 23:59:59.974734      973
1999-12-31 23:59:59.975879      974
1999-12-31 23:59:59.978291      977
1999-12-31 23:59:59.980483      979
1999-12-31 23:59:59.980868      979
1999-12-31 23:59:59.981417      980
1999-12-31 23:59:59.984208      983
1999-12-31 23:59:59.984639      983
1999-12-31 23:59:59.985533      984
1999-12-31 23:59:59.986785      985
1999-12-31 23:59:59.987502      986
1999-12-31 23:59:59.987914      987
1999-12-31 23:59:59.988406      987
1999-12-31 23:59:59.989436      988
1999-12-31 23:59:59.994449      993
1999-12-31 23:59:59.996657      995

谢谢你的回答杰夫,不幸的是,这不是我想要的,因为它只是根据毫秒桶进行分组。我要找的是组中第一项之后的毫秒桶。例如,我需要:df=pd.DataFrame({'groupid':None},index=[datetime(1999,12,31,23,59,591900),datetime(1999,12,31,23,59,592100)])assert len(df.groupby(pd.Grouper(freq='ms'))==1