Python 如何使用groupby在两个存储箱中剪切一列并聚合每个存储箱的数据？_Python_Pandas_Jupyter Notebook

Python 如何使用groupby在两个存储箱中剪切一列并聚合每个存储箱的数据？

python pandas jupyter-notebook

Python 如何使用groupby在两个存储箱中剪切一列并聚合每个存储箱的数据？,python,pandas,jupyter-notebook,Python,Pandas,Jupyter Notebook,这是我的数据帧 session_id question_difficulty attempt_updated_at 5c822af21c1fba22 2 1557470128000 5c822af21c1fba22 3 1557469685000 5c822af21c1fba22 4 1557470079000 5c822af21c1fba22 5 1557472999000 5c8

这是我的数据帧

session_id  question_difficulty     attempt_updated_at
5c822af21c1fba22            2   1557470128000
5c822af21c1fba22            3   1557469685000
5c822af21c1fba22            4   1557470079000
5c822af21c1fba22            5   1557472999000
5c822af21c1fba22            3   1557474145000
5c822af21c1fba22            3   1557474441000
5c822af21c1fba22            4   1557474299000
5c822af21c1fba22            4   1557474738000
5c822af21c1fba22            3   1557475430000
5c822af21c1fba22            4   1557476960000
5c822af21c1fba22            5   1557477458000
5c822af21c1fba22            2   1557478118000
5c822af21c1fba22            5   1557482556000
5c822af21c1fba22            4   1557482809000
5c822af21c1fba22            5   1557482886000
5c822af21c1fba22            5   1557484232000

我想将“尝试更新时间”（即大纪元时间）字段切成两个相等的箱子，并在每个会话的箱子中找到“问题难度”的平均值

我想分别存储第一个箱子和第二个箱子的平均值

我试图通过pd.cut，但我不知道如何使用它

我希望我的输出像

比如说,

session_id         mean1_difficulty       mean2_difficulty
5c822af21c1fba22            5.0                3.0

任何想法都值得赞赏，谢谢。

我相信您需要的是聚合

平均值

：

df1 = (df.groupby(['session_id', pd.qcut(df['attempt_updated_at'], 2, labels=False)])
         ['question_difficulty'].mean()
                                .unstack()
                                .rename(columns=lambda x: f'mean{x+1}_difficulty'))
print (df1)
attempt_updated_at  mean1_difficulty  mean2_difficulty
session_id                                            
5c822af21c1fba22                 3.5             4.125

或：

更好地解释函数之间的差异。

我认为应该这样做：

pdf.sort_values('attempt_updated_at', ascending=False, inplace=True).reset_index(drop=True)
first = pdf.iloc[:pdf.shape[0] // 2]
second = pdf.iloc[pdf.shape[0] // 2:]

res = pd.DataFrame(first.groupby('session_id')['question_difficulty'].agg('mean')) \
    .rename(columns={'question_difficulty': 'mean1_difficulty'}) \
    .join(second.groupby('session_id')['question_difficulty'].agg('mean')) \
    .rename(columns={'question_difficulty': 'mean2_difficulty'})

谢谢你，这是可行的，但是在pd.cut方法中有没有一种方法可以过滤掉在“mean1\u难度”或“mean2\u难度”中为0的行？@RedDragon-你认为过滤方式是？@RedDragon-使用

df=df[pd.cut（df['trunt\u updated\u at']，2，labels=False）==0]

first=pdf.iloc[：pdf.shape[0]/2]second=pdf.iloc[pdf.shape[0]//2:]这将分割数据帧，但我想根据大纪元时间对其进行分割。我误解了，只是按照需要将其“分割为两个相等的箱子”。。。。无论如何，按历元列对数据帧进行排序仍然是一个赢家。在那里，我编辑了代码，以支持按历元时间值对数据进行切割（“升序=真”只是为了确保索引确实正确重置，您可以删除它。

pdf.sort_values('attempt_updated_at', ascending=False, inplace=True).reset_index(drop=True)
first = pdf.iloc[:pdf.shape[0] // 2]
second = pdf.iloc[pdf.shape[0] // 2:]

res = pd.DataFrame(first.groupby('session_id')['question_difficulty'].agg('mean')) \
    .rename(columns={'question_difficulty': 'mean1_difficulty'}) \
    .join(second.groupby('session_id')['question_difficulty'].agg('mean')) \
    .rename(columns={'question_difficulty': 'mean2_difficulty'})