Python 在按时间顺序排列的页面视图数据帧中，计算在特定页面之前访问的上一个页面_Python_Pandas_Pandas Groupby

Python 在按时间顺序排列的页面视图数据帧中，计算在特定页面之前访问的上一个页面

python pandas

Python 在按时间顺序排列的页面视图数据帧中，计算在特定页面之前访问的上一个页面,python,pandas,pandas-groupby,Python,Pandas,Pandas Groupby,我已经创建了以下数据框，以访问日期的升序列出了用户访问的页面。共有5页：BLQ2_1至BLQ2_5 user_id created_at PAGE 72672 2017-02-20 BLQ2_1 72672 2017-03-03 BLQ2_5 72672 2017-03-03 BLQ2_3 72672 2017-03-05 BLQ2_4 12370 2017-03-06 BLQ2_4 12370 2017-03-06 BLQ2_5 1237

我已经创建了以下数据框，以访问日期的升序列出了用户访问的页面。共有5页：BLQ2_1至BLQ2_5

user_id  created_at  PAGE  
72672    2017-02-20  BLQ2_1
72672    2017-03-03  BLQ2_5
72672    2017-03-03  BLQ2_3
72672    2017-03-05  BLQ2_4
12370    2017-03-06  BLQ2_4
12370    2017-03-06  BLQ2_5
12370    2017-03-06  BLQ2_3
94822    2017-03-06  BLQ2_2
94822    2017-03-10  BLQ2_4
94822    2017-03-10  BLQ2_5
94822    2017-02-24  BLQ2_4

对于每一个页面，我想获得关于访问的前一个页面的统计数据，考虑到所有用户。也就是说，我需要计算每个页面的统计数据，如：

到BLQ2_5的路径是：从BLQ2_4到2次，从BLQ2_1到1次

到BLQ2_3的路径是：从BLQ2_5到2倍，从BLQ2_4到1倍

通往BLQ2_4的路径是：1次从BLQ2_5开始，1次从BLQ2_3开始，1次从BLQ2_2开始，1次从无处开始

我必须为此使用循环吗？或者有没有办法利用熊猫的

groupby

功能？有什么建议吗

下面是我使用for循环的解决方案：

pg_BLQ2_5 = pd.DataFrame()
pg_BLQ2_4 = pd.DataFrame()
pg_BLQ2_3 = pd.DataFrame()
pg_BLQ2_2 = pd.DataFrame()
pg_BLQ2_1 = pd.DataFrame()
first_pages = pd.DataFrame()

for user_id in df['user_id'].unique():
    #get only current user's records, and reset index
    _pg = df[df['user_id'] == user_id].reset_index()
    _pg.drop('index', axis=1, inplace=True)
    
    #if this is the first page visited, treat differently
    first_page = _pg.iloc[0]
    first_pages = first_pages.append(first_page)

    #exclude the first page visited from the dataframe
    _pg = _pg.loc[1:].reset_index()
    _pg.drop('index', axis=1, inplace=True)

    #for each page, get the record from its previous index, and build the dataframe.
    pg_BLQ2_5 = pg_BLQ2_5.append(_pg.iloc[_pg[_pg['PAGE'] == 'BLQ2_5'].index -1])
    pg_BLQ2_4 = pg_BLQ2_4.append(_pg.iloc[_pg[_pg['PAGE'] == 'BLQ2_4'].index -1])
    pg_BLQ2_3 = pg_BLQ2_3.append(_pg.iloc[_pg[_pg['PAGE'] == 'BLQ2_3'].index -1])
    pg_BLQ2_2 = pg_BLQ2_2.append(_pg.iloc[_pg[_pg['PAGE'] == 'BLQ2_2'].index -1])
    pg_BLQ2_1 = pg_BLQ2_1.append(_pg.iloc[_pg[_pg['PAGE'] == 'BLQ2_1'].index -1])

首先创建显示上一页的列（假设数据帧按用户排序，然后按日期排序）：

然后简单地

groupby

并计算值：

df.groupby('PAGE')['prev'].value_counts()

PAGE    prev  
BLQ2_3  BLQ2_5    2
BLQ2_4  BLQ2_2    1
        BLQ2_3    1
        BLQ2_5    1
BLQ2_5  BLQ2_4    2
        BLQ2_1    1

例如，您也可以使用

取消堆叠

来重塑形状。

非常感谢您的回答。在上面的OP中，我使用for循环分享了我的方法。然而，由于某些原因，我得到的结果与我用你的解决方案得到的结果完全不同。我看不出这两种解决方案有任何问题，但无法解决问题。你知道吗？用你的代码：

BLQ2\u 2->BLQ2\u 1=151

，用我的代码，它是

。你贴的小例子怎么样？我从您的建议中得到了不同的结果，但我认为您没有显示完整的数据帧。这一次你也得到了不同的结果吗？是的，先生，我试图用我分享的样本数据来计算。我想我的代码有问题。我在你的代码中发现了问题。您不应删除

\u pg

中的第一页。假设

BLQ2_4

是第二页（现在是删除后的第一页）。然后

index

是

，

index-1

是

-1

，然后选择。。。最后一页！

df.groupby('PAGE')['prev'].value_counts()

PAGE    prev  
BLQ2_3  BLQ2_5    2
BLQ2_4  BLQ2_2    1
        BLQ2_3    1
        BLQ2_5    1
BLQ2_5  BLQ2_4    2
        BLQ2_1    1