Python-按组计算连续频率

Python-按组计算连续频率,python,pandas,sequence,frequency,itertools,Python,Pandas,Sequence,Frequency,Itertools,我有一系列按时间戳和用户id排序的电子邮件 我想调查我收到电子邮件后被电子邮件j跟踪的频率。我将在热图中显示用户的这些频率,以显示最常见的路径 a = """timestamp,email,subject 2016-07-01 10:17:00,a@gmail.com,subject2 2016-07-01 02:01:02,a@gmail.com,welcome 2016-07-01 14:45:04,a@gmail.com,subject3 2016-07-01 08:14:02,a@gma

我有一系列按时间戳和用户id排序的电子邮件

我想调查我收到电子邮件后被电子邮件j跟踪的频率。我将在热图中显示用户的这些频率,以显示最常见的路径

a = """timestamp,email,subject
2016-07-01 10:17:00,a@gmail.com,subject2
2016-07-01 02:01:02,a@gmail.com,welcome
2016-07-01 14:45:04,a@gmail.com,subject3
2016-07-01 08:14:02,a@gmail.com,subject1
2016-07-01 16:26:35,a@gmail.com,subject4
2016-07-01 10:17:00,b@gmail.com,subject1
2016-07-01 02:01:02,b@gmail.com,welcome
2016-07-01 14:45:04,b@gmail.com,subject3
2016-07-01 08:14:02,b@gmail.com,subject2
2016-07-01 16:26:35,b@gmail.com,subject4
2016-07-01 18:00:00,c@gmail.com,welcome
2016-07-01 19:00:02,c@gmail.com,subject1
2016-07-01 20:00:04,c@gmail.com,subject3
2016-07-01 21:14:02,c@gmail.com,subject4
2016-07-01 21:26:35,c@gmail.com,subject2
"""

import pandas as pd
from pandas.io.parsers import StringIO
df1 = pd.read_csv(StringIO(a), parse_dates=['timestamp'])
df1=df1.sort_values(['email','timestamp'])
已排序的df1:

        timestamp        email   subject
 1  2016-07-01 02:01:02  a@gmail.com   welcome
 3  2016-07-01 08:14:02  a@gmail.com  subject1
 0  2016-07-01 10:17:00  a@gmail.com  subject2
 2  2016-07-01 14:45:04  a@gmail.com  subject3
 4  2016-07-01 16:26:35  a@gmail.com  subject4
 6  2016-07-01 02:01:02  b@gmail.com   welcome
 8  2016-07-01 08:14:02  b@gmail.com  subject2
 5  2016-07-01 10:17:00  b@gmail.com  subject1
 7  2016-07-01 14:45:04  b@gmail.com  subject3
 9  2016-07-01 16:26:35  b@gmail.com  subject4
 10 2016-07-01 18:00:00  c@gmail.com   welcome
 11 2016-07-01 19:00:02  c@gmail.com  subject1
 12 2016-07-01 20:00:04  c@gmail.com  subject3
 13 2016-07-01 21:14:02  c@gmail.com  subject4
 14 2016-07-01 21:26:35  c@gmail.com  subject2
输出应该如下所示

          welcome   subject1    subject2    subject3    subject4
welcome      0              
subject1     2         0                    
subject2     1         1          0     
subject3     0         2          1           0 
subject4     0         0          0           3             0
换句话说,有两次主题1出现在欢迎电子邮件之后。有一次,受试者2在欢迎信息后出现,等等

最好的方法是什么?

两行程序(可以压缩为一行程序):

您可以稍微按摩一下,使其与OP中引用的完全相同:

subjects = ['welcome'] + ['subject{}'.format(i) for i in range(1, 5)]
res = res.loc[subjects, subjects].fillna(0).astype(int)
print(res)

# subject       welcome  subject1  subject2  subject3  subject4
# next_subject                                                 
# welcome             0         0         0         0         0
# subject1            2         0         1         0         0
# subject2            1         1         0         0         1
# subject3            0         2         1         0         0
# subject4            0         0         0         3         0

你能解释一下你的输出吗?看起来subject4和welcome应该有1未编辑该表。希望现在更清楚。
subjects = ['welcome'] + ['subject{}'.format(i) for i in range(1, 5)]
res = res.loc[subjects, subjects].fillna(0).astype(int)
print(res)

# subject       welcome  subject1  subject2  subject3  subject4
# next_subject                                                 
# welcome             0         0         0         0         0
# subject1            2         0         1         0         0
# subject2            1         1         0         0         1
# subject3            0         2         1         0         0
# subject4            0         0         0         3         0