Python 按对象分组的熊猫被记住了多长时间?

Python 按对象分组的熊猫被记住了多长时间?,python,python-3.x,pandas,dataframe,Python,Python 3.x,Pandas,Dataframe,我有以下示例Python 3.4脚本。它做了以下工作: 创建一个数据帧 将日期变量转换为datetime64格式 基于两个分类变量创建groupby对象 生成一个数据帧,其中包含每个组中的项目数 将count dataframe与原始dataframe合并,以创建一列,其中包含每个组中的行数 创建一列,其中包含连续行之间的日期差 以下是脚本: import numpy as np import pandas as pd # Create dataframe consisting of id,

我有以下示例Python 3.4脚本。它做了以下工作:

  • 创建一个数据帧
  • 将日期变量转换为datetime64格式
  • 基于两个分类变量创建groupby对象
  • 生成一个数据帧,其中包含每个组中的项目数
  • 将count dataframe与原始dataframe合并,以创建一列,其中包含每个组中的行数
  • 创建一列,其中包含连续行之间的日期差
  • 以下是脚本:

    import numpy as np
    import pandas as pd
    
    # Create dataframe consisting of id, date and two categories (gender and age)
    tempDF = pd.DataFrame({ 'id': [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20],
                            'date': ["02/04/2015 02:34","06/04/2015 12:34","09/04/2015 23:03","12/04/2015 01:00","15/04/2015 07:12","21/04/2015 12:59","29/04/2015 17:33","04/05/2015 10:44","06/05/2015 11:12","10/05/2015 08:52","12/05/2015 14:19","19/05/2015 19:22","27/05/2015 22:31","01/06/2015 11:09","04/06/2015 12:57","10/06/2015 04:00","15/06/2015 03:23","19/06/2015 05:37","23/06/2015 13:41","27/06/2015 15:43"],
                            'gender': ["male","female","female","male","male","female","female",np.nan,"male","male","female","male","female","female","male","female","male","female",np.nan,"male"],
                            'age': ["young","old","old","old","old","old",np.nan,"old","old","young","young","old","young","young","old",np.nan,"old","young",np.nan,np.nan]})
    
    # Convert date to datetime
    tempDF['date'] = pd.to_datetime(tempDF['date'])
    
    # Create groupby object based on two categorical variables
    tempGroupby = tempDF.sort_values(['gender','age','id']).groupby(['gender','age'])
    
    # Count number in each group and merge with original dataframe to create 'count' column
    tempCountsDF = tempGroupby['id'].count().reset_index(drop=False)
    tempCountsDF = tempCountsDF.rename(columns={'id': 'count'})
    tempDF = tempDF.merge(tempCountsDF, on=['gender','age'])
    
    # Calculate difference between consecutive rows in each group. (First row in each
    # group should have date difference = NaT)
    tempGroupby = tempDF.sort_values(['gender','age','id']).groupby(['gender','age'])
    tempDF['diff'] = tempGroupby['date'].diff()
    print(tempDF)
    
    此脚本生成以下输出:

          age                date  gender  id  count                diff
    0   young 2015-02-04 02:34:00    male   1      2                 NaT
    1   young 2015-10-05 08:52:00    male  10      2   243 days 06:18:00
    2     old 2015-06-04 12:34:00  female   2      3                 NaT
    3     old 2015-09-04 23:03:00  female   3      3    92 days 10:29:00
    4     old 2015-04-21 12:59:00  female   6      3 -137 days +13:56:00
    5     old 2015-12-04 01:00:00    male   4      6                 NaT
    6     old 2015-04-15 07:12:00    male   5      6 -233 days +06:12:00
    7     old 2015-06-05 11:12:00    male   9      6    51 days 04:00:00
    8     old 2015-05-19 19:22:00    male  12      6  -17 days +08:10:00
    9     old 2015-04-06 12:57:00    male  15      6  -44 days +17:35:00
    10    old 2015-06-15 03:23:00    male  17      6    69 days 14:26:00
    11  young 2015-12-05 14:19:00  female  11      4                 NaT
    12  young 2015-05-27 22:31:00  female  13      4 -192 days +08:12:00
    13  young 2015-01-06 11:09:00  female  14      4 -142 days +12:38:00
    14  young 2015-06-19 05:37:00  female  18      4   163 days 18:28:00
    
    这正是我所期望的。然而,它似乎依赖于两次创建groupby对象(以完全相同的方式)。如果第二个groupby定义被注释掉,则在diff列中似乎会产生非常不同的输出:

    import numpy as np
    import pandas as pd
    
    # Create dataframe consisting of id, date and two categories (gender and age)
    tempDF = pd.DataFrame({ 'id': [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20],
                            'date': ["02/04/2015 02:34","06/04/2015 12:34","09/04/2015 23:03","12/04/2015 01:00","15/04/2015 07:12","21/04/2015 12:59","29/04/2015 17:33","04/05/2015 10:44","06/05/2015 11:12","10/05/2015 08:52","12/05/2015 14:19","19/05/2015 19:22","27/05/2015 22:31","01/06/2015 11:09","04/06/2015 12:57","10/06/2015 04:00","15/06/2015 03:23","19/06/2015 05:37","23/06/2015 13:41","27/06/2015 15:43"],
                            'gender': ["male","female","female","male","male","female","female",np.nan,"male","male","female","male","female","female","male","female","male","female",np.nan,"male"],
                            'age': ["young","old","old","old","old","old",np.nan,"old","old","young","young","old","young","young","old",np.nan,"old","young",np.nan,np.nan]})
    
    # Convert date to datetime
    tempDF['date'] = pd.to_datetime(tempDF['date'])
    
    # Create groupby object based on two categorical variables
    tempGroupby = tempDF.sort_values(['gender','age','id']).groupby(['gender','age'])
    
    # Count number in each group and merge with original dataframe to create 'count' column
    tempCountsDF = tempGroupby['id'].count().reset_index(drop=False)
    tempCountsDF = tempCountsDF.rename(columns={'id': 'count'})
    tempDF = tempDF.merge(tempCountsDF, on=['gender','age'])
    
    # Calculate difference between consecutive rows in each group. (First row in each
    # group should have date difference = NaT)
    # ****** THIS TIME THE FOLLOWING GROUPBY DEFINITION IS COMMENTED OUT ***** 
    # tempGroupby = tempDF.sort_values(['gender','age','id']).groupby(['gender','age'])
    tempDF['diff'] = tempGroupby['date'].diff()
    print(tempDF)
    
    而且,这次的输出非常不同(根本不是我想要的)

    (在我的真实脚本中,结果似乎有点不稳定,有时有效,有时无效。但在上面的脚本中,不同的输出似乎一致地出现。)


    为什么在使用.diff()函数之前,必须在本质上相同的数据帧上重新创建groupby对象(尽管添加了额外的列)?这对我来说似乎很危险。

    不一样,索引已更改。例如:

    tempDF.loc[1].id  # before
    
    10
    
    tempDF.loc[1].id  # after
    
    2
    
    因此,如果您使用旧的
    tempDF
    计算
    tempGroupby
    ,然后在执行此操作时更改
    tempDF
    中的索引:

    tempDF['diff'] = tempGroupby['date'].diff()
    

    索引与您期望的不匹配。您正在为每一行分配与旧
    tempDF

    中具有该索引的行相对应的差异。不一样的是,索引已更改。例如:

    tempDF.loc[1].id  # before
    
    10
    
    tempDF.loc[1].id  # after
    
    2
    
    因此,如果您使用旧的
    tempDF
    计算
    tempGroupby
    ,然后在执行此操作时更改
    tempDF
    中的索引:

    tempDF['diff'] = tempGroupby['date'].diff()
    

    索引与您期望的不匹配。您正在为每一行分配与旧
    tempDF

    中具有该索引的行相对应的差异,您已更改
    tempDF
    ,第二个
    groupby
    在更改的数据帧上计算。我怀疑这就是为什么这两个结果不同的原因。您已经更改了
    tempDF
    ,第二个
    groupby
    是在更改后的数据帧上计算的。我想这就是为什么这两个结果不同的原因。谢谢高雄。这无疑为我指明了正确的方向。当我使用.merge()时,我认为索引会被保留,但是当在列上连接列时(就像我在merge函数中使用on=[]所做的那样),索引会被忽略(我刚刚在文档中发现)。只有在将索引与索引或索引与列连接时,索引才会被保留。谢谢Goyo。这无疑为我指明了正确的方向。当我使用.merge()时,我认为索引会被保留,但是当在列上连接列时(就像我在merge函数中使用on=[]所做的那样),索引会被忽略(我刚刚在文档中发现)。只有在将索引连接到索引或将索引连接到列时,索引才会被保留。