Python 创建新列_Python_Pandas_Cumsum

Python 创建新列

python pandas

Python 创建新列,python,pandas,cumsum,Python,Pandas,Cumsum,我面临以下挑战。我有一个名为：defined_conversions的数据框架 user_id pageviews conversion timestamp 1 3 True 08:01:12 1 4 False 07:02:14 1 7 False 08:02:14 2 2

我面临以下挑战。我有一个名为：defined_conversions的数据框架

user_id    pageviews    conversion    timestamp
1          3            True          08:01:12
1          4            False         07:02:14
1          7            False         08:02:14
2          2            True          10:12:15
2          2            False         05:12:18

我想要实现的是添加一个名为sum_pageviews的附加列，它获取每个用户的页面浏览量的累积总和

我构建此函数是为了实现以下目标：

def pageviews_per_user(defined_conversions):
    defined_conversions['sum_pageviews'] = defined_conversions.groupby(['user_id'])['pageviews'].cumsum
    return defined_conversions

我担心的是，dataframe将如下所示：

   user_id    pageviews    conversion    timestamp    sum_pageviews
    1          3            True          08:01:12    14
    1          4            False         07:02:14    14
    1          7            False         08:02:14    14
    2          2            True          10:12:15    4
    2          2            False         05:12:18    4

我希望它看起来像：

  user_id    pageviews    conversion    timestamp    sum_pageviews
    1          3            True          08:01:12    3
    1          4            False         07:02:14    7
    1          7            False         08:02:14    14
    2          2            True          10:12:15    2
    2          2            False         05:12:18    4

因此，本质上，页面浏览量应该与时间戳之后的累计值相加。在运行cumsum公式之前，是否应该先按时间戳对数据进行排序？还是我应该做点别的

ps：我是python/pandas的初学者

提前谢谢

很接近了--只需调用

cumsum（）

：

作为一项功能：

def pageviews_per_user(df, by='user_id', aggcol='pageviews', **kwargs):
    df.sort_values([by, 'timestamp'], inplace=True)
    df['sum_pageviews'] = df.groupby(by=by, sort=False, **kwargs)[aggcol].cumsum()
    return df

>>> df
   user_id  pageviews  conversion timestamp
0        1          3        True  08:01:12
1        1          4       False  07:02:14
2        1          7       False  08:02:14
3        2          2        True  10:12:15
4        2          2       False  05:12:18
>>> def pageviews_per_user(df, by='user_id', aggcol='pageviews', **kwargs):
...     df.sort_values([by, 'timestamp'], inplace=True)
...     df['sum_pageviews'] = df.groupby(by=by, **kwargs)[aggcol].cumsum()
...     return df
... 
>>> pageviews_per_user(df)
   user_id  pageviews  conversion timestamp  sum_pageviews
1        1          4       False  07:02:14              4
0        1          3        True  08:01:12              7
2        1          7       False  08:02:14             14
4        2          2       False  05:12:18              2
3        2          2        True  10:12:15              4
>>> df
   user_id  pageviews  conversion timestamp  sum_pageviews
1        1          4       False  07:02:14              4
0        1          3        True  08:01:12              7
2        1          7       False  08:02:14             14
4        2          2       False  05:12:18              2
3        2          2        True  10:12:15              4

请注意，这不仅会返回数据帧，还会在适当的位置对其进行修改

以下是您将如何使用该函数：

def pageviews_per_user(df, by='user_id', aggcol='pageviews', **kwargs):
    df.sort_values([by, 'timestamp'], inplace=True)
    df['sum_pageviews'] = df.groupby(by=by, sort=False, **kwargs)[aggcol].cumsum()
    return df

>>> df
   user_id  pageviews  conversion timestamp
0        1          3        True  08:01:12
1        1          4       False  07:02:14
2        1          7       False  08:02:14
3        2          2        True  10:12:15
4        2          2       False  05:12:18
>>> def pageviews_per_user(df, by='user_id', aggcol='pageviews', **kwargs):
...     df.sort_values([by, 'timestamp'], inplace=True)
...     df['sum_pageviews'] = df.groupby(by=by, **kwargs)[aggcol].cumsum()
...     return df
... 
>>> pageviews_per_user(df)
   user_id  pageviews  conversion timestamp  sum_pageviews
1        1          4       False  07:02:14              4
0        1          3        True  08:01:12              7
2        1          7       False  08:02:14             14
4        2          2       False  05:12:18              2
3        2          2        True  10:12:15              4
>>> df
   user_id  pageviews  conversion timestamp  sum_pageviews
1        1          4       False  07:02:14              4
0        1          3        True  08:01:12              7
2        1          7       False  08:02:14             14
4        2          2       False  05:12:18              2
3        2          2        True  10:12:15              4

尽管

timestamp

不是日期时间的一列（就Pandas而言，只是字符串），但它仍然可以按字典顺序排序

如果您想对其他列名进行分组，那么使用

by

、

aggcol

和

**kwargs

可以使您的函数更具通用性。如果没有，您也可以像在您的问题中那样将这些硬编码到函数体中

**kwargs

允许您将任何其他关键字参数传递给

您很接近了--您只需调用

cumsum（）

：

作为一项功能：

def pageviews_per_user(df, by='user_id', aggcol='pageviews', **kwargs):
    df.sort_values([by, 'timestamp'], inplace=True)
    df['sum_pageviews'] = df.groupby(by=by, sort=False, **kwargs)[aggcol].cumsum()
    return df

>>> df
   user_id  pageviews  conversion timestamp
0        1          3        True  08:01:12
1        1          4       False  07:02:14
2        1          7       False  08:02:14
3        2          2        True  10:12:15
4        2          2       False  05:12:18
>>> def pageviews_per_user(df, by='user_id', aggcol='pageviews', **kwargs):
...     df.sort_values([by, 'timestamp'], inplace=True)
...     df['sum_pageviews'] = df.groupby(by=by, **kwargs)[aggcol].cumsum()
...     return df
... 
>>> pageviews_per_user(df)
   user_id  pageviews  conversion timestamp  sum_pageviews
1        1          4       False  07:02:14              4
0        1          3        True  08:01:12              7
2        1          7       False  08:02:14             14
4        2          2       False  05:12:18              2
3        2          2        True  10:12:15              4
>>> df
   user_id  pageviews  conversion timestamp  sum_pageviews
1        1          4       False  07:02:14              4
0        1          3        True  08:01:12              7
2        1          7       False  08:02:14             14
4        2          2       False  05:12:18              2
3        2          2        True  10:12:15              4

请注意，这不仅会返回数据帧，还会在适当的位置对其进行修改

以下是您将如何使用该函数：

def pageviews_per_user(df, by='user_id', aggcol='pageviews', **kwargs):
    df.sort_values([by, 'timestamp'], inplace=True)
    df['sum_pageviews'] = df.groupby(by=by, sort=False, **kwargs)[aggcol].cumsum()
    return df

>>> df
   user_id  pageviews  conversion timestamp
0        1          3        True  08:01:12
1        1          4       False  07:02:14
2        1          7       False  08:02:14
3        2          2        True  10:12:15
4        2          2       False  05:12:18
>>> def pageviews_per_user(df, by='user_id', aggcol='pageviews', **kwargs):
...     df.sort_values([by, 'timestamp'], inplace=True)
...     df['sum_pageviews'] = df.groupby(by=by, **kwargs)[aggcol].cumsum()
...     return df
... 
>>> pageviews_per_user(df)
   user_id  pageviews  conversion timestamp  sum_pageviews
1        1          4       False  07:02:14              4
0        1          3        True  08:01:12              7
2        1          7       False  08:02:14             14
4        2          2       False  05:12:18              2
3        2          2        True  10:12:15              4
>>> df
   user_id  pageviews  conversion timestamp  sum_pageviews
1        1          4       False  07:02:14              4
0        1          3        True  08:01:12              7
2        1          7       False  08:02:14             14
4        2          2       False  05:12:18              2
3        2          2        True  10:12:15              4

尽管

timestamp

不是日期时间的一列（就Pandas而言，只是字符串），但它仍然可以按字典顺序排序

如果您想对其他列名进行分组，那么使用

by

、

aggcol

和

**kwargs

可以使您的函数更具通用性。如果没有，您也可以像在您的问题中那样将这些硬编码到函数体中<代码>**kwargs允许您将任何其他关键字参数传递给

这是副选项Brad+1。感谢您的回答Brad，代码可以工作，但看起来非常高级，考虑到我的Python水平，我更愿意从一个简单一点的解决方案开始。我能稍微调整一下自己的代码以使其工作吗？本质上：时间戳应该从早到晚排序：用户每分钟都可以查看一个页面。只有每个用户的最新时间戳才能获得值：conversion=true，此后每个用户都没有后续会话（时间戳）。因此，我希望页面浏览量根据时间戳对累积进行汇总。@julien1337我添加了更多的细节和解释。除此之外，祝你好运！非常感谢Brad！这是vice one Brad+1。感谢您的回答Brad，代码可以工作，但看起来非常高级，考虑到我的Python水平，我更愿意从一个简单一点的解决方案开始。我能稍微调整一下自己的代码以使其工作吗？本质上：时间戳应该从早到晚排序：用户每分钟都可以查看一个页面。只有每个用户的最新时间戳才能获得值：conversion=true，此后每个用户都没有后续会话（时间戳）。因此，我希望页面浏览量根据时间戳对累积进行汇总。@julien1337我添加了更多的细节和解释。除此之外，祝你好运！非常感谢Brad！