Python-按组计算连续频率
我有一系列按时间戳和用户id排序的电子邮件 我想调查我收到电子邮件后被电子邮件j跟踪的频率。我将在热图中显示用户的这些频率,以显示最常见的路径Python-按组计算连续频率,python,pandas,sequence,frequency,itertools,Python,Pandas,Sequence,Frequency,Itertools,我有一系列按时间戳和用户id排序的电子邮件 我想调查我收到电子邮件后被电子邮件j跟踪的频率。我将在热图中显示用户的这些频率,以显示最常见的路径 a = """timestamp,email,subject 2016-07-01 10:17:00,a@gmail.com,subject2 2016-07-01 02:01:02,a@gmail.com,welcome 2016-07-01 14:45:04,a@gmail.com,subject3 2016-07-01 08:14:02,a@gma
a = """timestamp,email,subject
2016-07-01 10:17:00,a@gmail.com,subject2
2016-07-01 02:01:02,a@gmail.com,welcome
2016-07-01 14:45:04,a@gmail.com,subject3
2016-07-01 08:14:02,a@gmail.com,subject1
2016-07-01 16:26:35,a@gmail.com,subject4
2016-07-01 10:17:00,b@gmail.com,subject1
2016-07-01 02:01:02,b@gmail.com,welcome
2016-07-01 14:45:04,b@gmail.com,subject3
2016-07-01 08:14:02,b@gmail.com,subject2
2016-07-01 16:26:35,b@gmail.com,subject4
2016-07-01 18:00:00,c@gmail.com,welcome
2016-07-01 19:00:02,c@gmail.com,subject1
2016-07-01 20:00:04,c@gmail.com,subject3
2016-07-01 21:14:02,c@gmail.com,subject4
2016-07-01 21:26:35,c@gmail.com,subject2
"""
import pandas as pd
from pandas.io.parsers import StringIO
df1 = pd.read_csv(StringIO(a), parse_dates=['timestamp'])
df1=df1.sort_values(['email','timestamp'])
已排序的df1:
timestamp email subject
1 2016-07-01 02:01:02 a@gmail.com welcome
3 2016-07-01 08:14:02 a@gmail.com subject1
0 2016-07-01 10:17:00 a@gmail.com subject2
2 2016-07-01 14:45:04 a@gmail.com subject3
4 2016-07-01 16:26:35 a@gmail.com subject4
6 2016-07-01 02:01:02 b@gmail.com welcome
8 2016-07-01 08:14:02 b@gmail.com subject2
5 2016-07-01 10:17:00 b@gmail.com subject1
7 2016-07-01 14:45:04 b@gmail.com subject3
9 2016-07-01 16:26:35 b@gmail.com subject4
10 2016-07-01 18:00:00 c@gmail.com welcome
11 2016-07-01 19:00:02 c@gmail.com subject1
12 2016-07-01 20:00:04 c@gmail.com subject3
13 2016-07-01 21:14:02 c@gmail.com subject4
14 2016-07-01 21:26:35 c@gmail.com subject2
输出应该如下所示
welcome subject1 subject2 subject3 subject4
welcome 0
subject1 2 0
subject2 1 1 0
subject3 0 2 1 0
subject4 0 0 0 3 0
换句话说,有两次主题1出现在欢迎电子邮件之后。有一次,受试者2在欢迎信息后出现,等等
最好的方法是什么?两行程序(可以压缩为一行程序):
您可以稍微按摩一下,使其与OP中引用的完全相同:
subjects = ['welcome'] + ['subject{}'.format(i) for i in range(1, 5)]
res = res.loc[subjects, subjects].fillna(0).astype(int)
print(res)
# subject welcome subject1 subject2 subject3 subject4
# next_subject
# welcome 0 0 0 0 0
# subject1 2 0 1 0 0
# subject2 1 1 0 0 1
# subject3 0 2 1 0 0
# subject4 0 0 0 3 0
你能解释一下你的输出吗?看起来subject4和welcome应该有1未编辑该表。希望现在更清楚。
subjects = ['welcome'] + ['subject{}'.format(i) for i in range(1, 5)]
res = res.loc[subjects, subjects].fillna(0).astype(int)
print(res)
# subject welcome subject1 subject2 subject3 subject4
# next_subject
# welcome 0 0 0 0 0
# subject1 2 0 1 0 0
# subject2 1 1 0 0 1
# subject3 0 2 1 0 0
# subject4 0 0 0 3 0