Pandas dask-CSV时间序列操作_Pandas_Csv_Matplotlib_Anaconda_Dask

Pandas dask-CSV时间序列操作

pandas csv matplotlib anaconda dask

Pandas dask-CSV时间序列操作,pandas,csv,matplotlib,anaconda,dask,Pandas,Csv,Matplotlib,Anaconda,Dask,我有一个大约5GB大小的CSV，数据结构和类型如下： datetime product name serial number 0 2017-06-24 14:30:15 orange 123456 1 2017-07-04 21:33:50 apple 123456 2 2017-07-06 06:38:52 orange

我有一个大约5GB大小的CSV，数据结构和类型如下：

              datetime      product name      serial number
0  2017-06-24 14:30:15            orange             123456
1  2017-07-04 21:33:50             apple             123456
2  2017-07-06 06:38:52            orange             123456
3  2017-07-10 15:52:07            banana             123456
4  2017-07-10 15:52:51            banana             123456
5  2017-07-10 15:53:18            banana             123456
6  2017-07-11 11:50:40         pineapple             123456
7  2017-07-11 00:53:43             apple              54321
8  2017-07-11 06:23:52             apple              54321
9  2017-07-11 06:23:52             apple              12454
10 2017-07-11 06:23:52             apple              12454
11 2017-07-11 06:23:52             apple              12454
12 2017-07-11 06:23:52             apple              15039
13 2017-07-11 06:23:52             apple              15037
14 2017-07-11 06:23:52             apple              15039
15 2017-07-11 06:23:52             apple              15190
16 2017-07-11 06:23:52             apple              15039
17 2017-07-11 06:23:52             apple              15037
18 2017-07-11 06:23:52             apple              15037
19 2017-07-11 06:23:52             apple              15037
....
few millions more lines

df.dtypes
Out[134]: 
datetime           datetime64[ns]
name                       object
events                      int64
dtype: object

问题1：如何按产品名称分组，然后仅统计前10名产品的序列号出现次数（最多出现在前10名）

问题2：如何在时域中绘制一个产品名称的每个序列号的出现情况

问题3：我真的想绘制一个产品名称在时域中每个“序列号”的出现情况，到目前为止，我可以使用以下方法从数据框中选择“产品名称”：

df_orange = df[df['proudct name'] == 'orange']
# how do I plot it?

虽然我的两分钱将是使用

.cut

或

.resample

进行装箱，但我将展示一个简单的解决方案，您可以为每个

产品名运行该解决方案
import pandas as pd
import matplotlib.pyplot as plt

# groupby twice
apple = (df.groupby('product name')    # groupby 'product name'
           .get_group('apple')         # get 'apple' group
           .groupby('datetime'))       # groupby 'datetime'


apple1 = (apple['serial number']       # select 'serial number'
          .agg(['value_counts']        # count the 'serial number's
          .unstack(1)                  # this makes 'serial number's go across columns
          .droplevel(axis=1, level=0)) # drop extra multiindex level name('value_counts')           

apple1.plot(kind='bar')                # plot it
plt.xticks(rotation=0)                 # because your 'datetime' is long and un-formatted
plt.yticks([i for i in range(5)])      # set xticks to  int
plt.show()



对每个产品名称
重复此操作，将其绘制为子批次
到图
中，然后设置如果您有问题1和问题2所需输出的示例，将非常有帮助，这样我们就可以获得一些视觉帮助来了解您需要的内容是的，我最终使用了熊猫，尽管速度非常慢，但它比DASK或vesk更容易使用（在绘图方面）。
import pandas as pd
import matplotlib.pyplot as plt

# groupby twice
apple = (df.groupby('product name')    # groupby 'product name'
           .get_group('apple')         # get 'apple' group
           .groupby('datetime'))       # groupby 'datetime'


apple1 = (apple['serial number']       # select 'serial number'
          .agg(['value_counts']        # count the 'serial number's
          .unstack(1)                  # this makes 'serial number's go across columns
          .droplevel(axis=1, level=0)) # drop extra multiindex level name('value_counts')           

apple1.plot(kind='bar')                # plot it
plt.xticks(rotation=0)                 # because your 'datetime' is long and un-formatted
plt.yticks([i for i in range(5)])      # set xticks to  int
plt.show()