Python 按发生频率筛选日期时间的numpy数组_Python_Datetime_Numpy_Pandas_Filtering

Python 按发生频率筛选日期时间的numpy数组

python datetime numpy pandas

Python 按发生频率筛选日期时间的numpy数组,python,datetime,numpy,pandas,filtering,Python,Datetime,Numpy,Pandas,Filtering,我有一个超过200万条记录的数组，每条记录都有一个datetime.datetime格式的10分钟解析时间戳，以及其他列中的几个其他值我只想保留时间戳在数组中出现20次或更多次的记录。最快的方法是什么？我有足够的内存，所以我在寻找处理速度我尝试过[]。在列表中计算，但开始失去等待它完成的生活意志。我也尝试过numpy.bincount，但不幸的是它不喜欢datetime.datetime 如有任何建议，将不胜感激。谢谢对数组排序通过遍历一次来计算连续出现的次数，并过滤频率>=20 运行

我有一个超过200万条记录的数组，每条记录都有一个datetime.datetime格式的10分钟解析时间戳，以及其他列中的几个其他值

我只想保留时间戳在数组中出现20次或更多次的记录。最快的方法是什么？我有足够的内存，所以我在寻找处理速度

我尝试过[]。在列表中计算，但开始失去等待它完成的生活意志。我也尝试过numpy.bincount，但不幸的是它不喜欢datetime.datetime

如有任何建议，将不胜感激。谢谢

对数组排序通过遍历一次来计算连续出现的次数，并过滤频率>=20 运行时间仅限，而您的列表理解可能在**2。。。这对200万条参赛作品产生了很大的影响

根据数据的结构，您可能只能对保存数据的numpy数组中所需的轴和数据进行排序

对数组排序通过遍历一次来计算连续出现的次数，并过滤频率>=20 运行时间仅限，而您的列表理解可能在**2。。。这对200万条参赛作品产生了很大的影响

根据数据的结构，您可能只能从保存数据的numpy数组中对所需的轴和数据进行排序。

根据下面的建议，我正在编辑此文件，以包括使用np.unique的计时。这是迄今为止最好的解决办法

In [10]: import pandas as pd
         import numpy as np
         from collections import Counter

         #create a fake data set 
         dates = pd.date_range("2012-01-01", "2015-01-01", freq="10min")
         dates = np.random.choice(dates, 2000000, replace=True)

根据以下建议，到目前为止，以下是最快的：

In [32]: %%timeit
         values, counts = np.unique(dates, return_counts=True)
         filtered_dates = values[counts>20]
         10 loops, best of 3: 150 ms per loop

使用计数器，您可以创建每个项目计数的字典，然后将其转换为pd.Series以进行筛选

In [11]: %%timeit
         foo = pd.Series(Counter(dates))
         filtered_dates = np.array(foo[foo > 20].index)
         1 loop, best of 3: 12.3 s per loop

对于一个包含200万项的阵列来说，这并不算太糟糕，与以下各项相比：

In [12]: dates = list(dates)
         filtered_dates = [e for e in set(dates) if dates.count(e) > 20]

我不会等待列表理解版本完成…

我将根据下面的建议，使用np.unique对其进行编辑，以包括计时。这是迄今为止最好的解决办法

In [10]: import pandas as pd
         import numpy as np
         from collections import Counter

         #create a fake data set 
         dates = pd.date_range("2012-01-01", "2015-01-01", freq="10min")
         dates = np.random.choice(dates, 2000000, replace=True)

根据以下建议，到目前为止，以下是最快的：

In [32]: %%timeit
         values, counts = np.unique(dates, return_counts=True)
         filtered_dates = values[counts>20]
         10 loops, best of 3: 150 ms per loop

使用计数器，您可以创建每个项目计数的字典，然后将其转换为pd.Series以进行筛选

In [11]: %%timeit
         foo = pd.Series(Counter(dates))
         filtered_dates = np.array(foo[foo > 20].index)
         1 loop, best of 3: 12.3 s per loop

对于一个包含200万项的阵列来说，这并不算太糟糕，与以下各项相比：

In [12]: dates = list(dates)
         filtered_dates = [e for e in set(dates) if dates.count(e) > 20]

我不会等待列表理解版本完成…

实际上可能会尝试。在numpy v1.9+中，unique可以返回一些附加值，如unique_索引、unique_逆、unique_计数

如果你想使用熊猫，它将非常简单，而且可能非常快。你可以用一个小盒子。比如：

out = df.groupby('timestamp').filter(lambda x: len(x) > 20)

实际上我可以试试。在numpy v1.9+中，unique可以返回一些附加值，如unique_索引、unique_逆、unique_计数

如果你想使用熊猫，它将非常简单，而且可能非常快。你可以用一个小盒子。比如：

out = df.groupby('timestamp').filter(lambda x: len(x) > 20)

Numpy在这些类型的操作上比pandas慢，因为它是独一无二的种类，而pandas的机器不需要这样做。此外，这是更惯用的

熊猫

In [22]: %%timeit
   ....: i = Index(dates)
   ....: i[i.value_counts()>20]
   ....: 
10 loops, best of 3: 78.2 ms per loop

In [23]: i = Index(dates)

In [24]: i[i.value_counts()>20]
Out[24]: 
DatetimeIndex(['2013-06-16 20:40:00', '2013-05-28 03:00:00', '2013-10-31 19:50:00', '2014-06-20 13:00:00', '2013-07-08 21:40:00', '2012-02-26 17:00:00', '2013-01-02 15:40:00', '2012-08-24 02:00:00',
               '2014-10-17 08:20:00', '2012-07-27 20:10:00',
               ...
               '2014-08-07 05:10:00', '2014-05-21 08:10:00', '2014-03-09 12:50:00', '2013-05-10 02:30:00', '2013-04-15 20:20:00', '2012-06-23 05:20:00', '2012-07-06 16:10:00', '2013-02-14 12:20:00',
               '2014-10-27 03:10:00', '2013-09-04 12:00:00'],
              dtype='datetime64[ns]', length=2978, freq=None)

In [25]: len(i[i.value_counts()>20])
Out[25]: 2978

来自其他soln的Numpy

In [26]: %%timeit
         values, counts = np.unique(dates, return_counts=True)
         filtered_dates = values[counts>20]
   ....: 
10 loops, best of 3: 145 ms per loop

In [27]: filtered_dates = values[counts>20]

In [28]: len(filtered_dates)
Out[28]: 2978

Numpy在这些类型的操作上比pandas慢，因为它是独一无二的种类，而pandas的机器不需要这样做。此外，这是更惯用的

熊猫

In [22]: %%timeit
   ....: i = Index(dates)
   ....: i[i.value_counts()>20]
   ....: 
10 loops, best of 3: 78.2 ms per loop

In [23]: i = Index(dates)

In [24]: i[i.value_counts()>20]
Out[24]: 
DatetimeIndex(['2013-06-16 20:40:00', '2013-05-28 03:00:00', '2013-10-31 19:50:00', '2014-06-20 13:00:00', '2013-07-08 21:40:00', '2012-02-26 17:00:00', '2013-01-02 15:40:00', '2012-08-24 02:00:00',
               '2014-10-17 08:20:00', '2012-07-27 20:10:00',
               ...
               '2014-08-07 05:10:00', '2014-05-21 08:10:00', '2014-03-09 12:50:00', '2013-05-10 02:30:00', '2013-04-15 20:20:00', '2012-06-23 05:20:00', '2012-07-06 16:10:00', '2013-02-14 12:20:00',
               '2014-10-27 03:10:00', '2013-09-04 12:00:00'],
              dtype='datetime64[ns]', length=2978, freq=None)

In [25]: len(i[i.value_counts()>20])
Out[25]: 2978

来自其他soln的Numpy

In [26]: %%timeit
         values, counts = np.unique(dates, return_counts=True)
         filtered_dates = values[counts>20]
   ....: 
10 loops, best of 3: 145 ms per loop

In [27]: filtered_dates = values[counts>20]

In [28]: len(filtered_dates)
Out[28]: 2978

谢谢你的建议

最后，我用字典做了一些完全不同的事情，发现它比我所需要的处理速度快得多

我创建了一个字典，其中有一组唯一的时间戳作为键，空列表作为值，然后在无序列表或数组中循环一次，并用我想要计数的值填充值列表

再次感谢

谢谢你的建议

最后，我用字典做了一些完全不同的事情，发现它比我所需要的处理速度快得多

我创建了一个字典，其中有一组唯一的时间戳作为键，空列表作为值，然后在无序列表或数组中循环一次，并用我想要计数的值填充值列表

再次感谢

是否有一种无需循环数据即可快速计数事件的方法？否。您必须通过每个条目对其进行过滤，但是，如果首先对数据进行排序，则速度非常快。是否有一种无需循环数据即可快速计数事件的方法？否。但是，您必须通过每个条目对其进行过滤，如果首先对数据进行排序，速度会非常快。熊猫可能能够按时完成此操作，因此我添加了标记。您能举一个非常小的示例说明您的数组是什么样子吗？3-4个元素应该给我们一个足够好的主意。使用pandas和时间戳上的groupBy pandas可能能够按时完成这项工作，所以我添加了标签。你能举一个非常小的例子说明你的数组是什么样子吗？3-4要素

ents应该给我们一个足够好的主意。使用pandas并在时间上进行分组。同意，这是迄今为止最好的解决方案，只要pandas可用。同意，这是迄今为止最好的解决方案，只要pandas可用。