如何按时间间隔在5分钟内创建窗口,以便使用Python3计算单词的重复次数
我有一个CSV文件,它有两列:如何按时间间隔在5分钟内创建窗口,以便使用Python3计算单词的重复次数,python,python-3.x,Python,Python 3.x,我有一个CSV文件,它有两列:毫秒和主题。我的CSV文件如下所示: milliseconds, topics 1.4998308E+12,today is warm 1.4998309E+12,today is warm 1.4998310E+12,today is warm 1.4998314E+12,today is cold 1.4998315E+12,today is cold 1.4998317E+12,today is cold 1.4998318E+12,today
毫秒
和主题
。我的CSV文件如下所示:
milliseconds, topics
1.4998308E+12,today is warm
1.4998309E+12,today is warm
1.4998310E+12,today is warm
1.4998314E+12,today is cold
1.4998315E+12,today is cold
1.4998317E+12,today is cold
1.4998318E+12,today is cold
1.4998320E+12,today is cold
1.4998322E+12,today is cold
1.4998323E+12,today is cold
1.4998324E+12,today is cold
1.4998326E+12,today is warm
1.4998328E+12,today is warm
1.4998331E+12,today is cold
1.4998333E+12,today is warm
1.4998336E+12,today is warm
1.4998336E+12,today is warm
1.4998337E+12,today is warm
1.4998338E+12,today is snow
1.4998339E+12,today is snow
1.4998340E+12,today is snow
1.4998341E+12,today is snow
1.4998342E+12,today is warm
1.4998343E+12,today is warm
如何在每个窗口包含5分钟的窗口中计算单词。时间从2017年12月7日6:40:00至2017年12月7日7:38:20
window(1) start from 6:40:00 to 6:44:00
window(2) start from 6:45:00 to 6:49:00
window(3) start from 6:49:00 to 6:53:00
window(4) start from 6:54:00 to 6:58:00
window(5) start from 6:59:00 to 7:03:00
window(6) start from 7:04:00 to 7:08:00
etc
我想使用Python3在5分钟的时间间隔内计算雪、warm
和cold
的发生率。结果如下:
warm 3 0 0 0 0 0 2 0 1 3 0 2 total 11
cold 0 0 2 2 2 2 0 1 0 0 0 0 total 09
snow 0 0 0 0 0 0 0 0 0 0 3 1 total 4
其中窗口(1)重复warm
3次,重复cold
0次,重复snow
0次
等等。熊猫群比是你需要的
import pandas as pd
df = pd.read_csv(<filename>)
然后我们每5分钟分组讨论一次主题
counts = topics.groupby([pd.Grouper(level='milliseconds', freq='5min'), 'topic']).count()
milliseconds topic count
2017-07-12 03:40:00 warm 3
2017-07-12 03:50:00 cold 2
2017-07-12 03:55:00 cold 2
2017-07-12 04:00:00 cold 2
2017-07-12 04:05:00 cold 2
2017-07-12 04:10:00 warm 2
2017-07-12 04:15:00 cold 1
2017-07-12 04:20:00 warm 1
2017-07-12 04:25:00 warm 3
2017-07-12 04:30:00 snow 3
2017-07-12 04:35:00 snow 1
2017-07-12 04:35:00 warm 2
如果需要,您可以使用unstack
results = counts.unstack('milliseconds').fillna(0).astype(int)
results.columns = range(len(results.columns))
results['total'] = results.sum(axis=1)
你能告诉我们你做了什么吗?我发现了这个错误:文件“pandas_libs\hashtable\u class_helper.pxi”,第1218行,在pandas._libs.hashtable.PyObjectHashTable.get_item KeyError:“topics”这是因为你的csv在标题中有空格。重命名列或更改键现在我在results=counts.unstack('ms')的第10行找到了这个文件“C:/Users/admin/readFile/window.py”。fillna(0)AttributeError:'int'对象没有属性“unstack”,非常感谢,Maarten。
milliseconds topic count
2017-07-12 03:40:00 warm 1
2017-07-12 03:41:40 warm 1
2017-07-12 03:43:20 warm 1
2017-07-12 03:50:00 cold 1
2017-07-12 03:51:40 cold 1
2017-07-12 03:55:00 cold 1
2017-07-12 03:56:40 cold 1
2017-07-12 04:00:00 cold 1
2017-07-12 04:03:20 cold 1
2017-07-12 04:05:00 cold 1
2017-07-12 04:06:40 cold 1
2017-07-12 04:10:00 warm 1
2017-07-12 04:13:20 warm 1
2017-07-12 04:18:20 cold 1
2017-07-12 04:21:40 warm 1
2017-07-12 04:26:40 warm 1
2017-07-12 04:26:40 warm 1
2017-07-12 04:28:20 warm 1
2017-07-12 04:30:00 snow 1
2017-07-12 04:31:40 snow 1
2017-07-12 04:33:20 snow 1
2017-07-12 04:35:00 snow 1
2017-07-12 04:36:40 warm 1
2017-07-12 04:38:20 warm 1
counts = topics.groupby([pd.Grouper(level='milliseconds', freq='5min'), 'topic']).count()
milliseconds topic count
2017-07-12 03:40:00 warm 3
2017-07-12 03:50:00 cold 2
2017-07-12 03:55:00 cold 2
2017-07-12 04:00:00 cold 2
2017-07-12 04:05:00 cold 2
2017-07-12 04:10:00 warm 2
2017-07-12 04:15:00 cold 1
2017-07-12 04:20:00 warm 1
2017-07-12 04:25:00 warm 3
2017-07-12 04:30:00 snow 3
2017-07-12 04:35:00 snow 1
2017-07-12 04:35:00 warm 2
results = counts.unstack('milliseconds').fillna(0).astype(int)
results.columns = range(len(results.columns))
results['total'] = results.sum(axis=1)
print(results)
topic 0 1 2 3 4 5 6 7 8 9 10 total
cold 0 2 2 2 2 0 1 0 0 0 0 9
snow 0 0 0 0 0 0 0 0 0 3 1 4
warm 3 0 0 0 0 2 0 1 3 0 2 11