Python:如何基于datetime获取值计数
我编写了以下代码,创建了两个数据帧Python:如何基于datetime获取值计数,python,python-3.x,pandas,dataframe,time-series,Python,Python 3.x,Pandas,Dataframe,Time Series,我编写了以下代码,创建了两个数据帧nq和cmnt nq包含UserId和获得徽章的相应时间日期 cmnt包含OwnerUserId和用户发表评论的时间CreationDate 我想统计一下在获得徽章1周之前和之后的所有日子里的评论,这样我就可以从中创建一个时间序列线图 以下代码执行相同的操作,但会产生一个KeyError。请提供为所有用户执行此操作的代码 nq UserId | date 1 2009-10-17 17:38:32.590 2 20
nq
和cmnt
nq
包含UserId
和获得徽章的相应时间日期
cmnt
包含OwnerUserId
和用户发表评论的时间CreationDate
我想统计一下在获得徽章1周之前和之后的所有日子里的评论,这样我就可以从中创建一个时间序列线图 以下代码执行相同的操作,但会产生一个KeyError。请提供为所有用户执行此操作的代码 nq
UserId | date
1 2009-10-17 17:38:32.590
2 2009-10-19 00:37:23.067
3 2009-10-20 08:37:14.143
4 2009-10-21 18:07:51.247
5 2009-10-22 21:25:24.483
cmnt
OwnerUserId | CreationDate
1 2009-10-16 17:38:32.590
1 2009-10-18 17:38:32.590
2 2009-10-18 00:37:23.067
2 2009-10-17 00:37:23.067
2 2009-10-20 00:37:23.067
3 2009-10-19 08:37:14.143
4 2009-10-20 18:07:51.247
5 2009-10-21 21:25:24.483
UserId | date |-7|-6|-5|-4|-3|-2|-1|0 |1 |2 |3 |4 |5 |6 |7
1 2009-10-17 17:38:32.590 |0 |0 |0 |0 |0 |0 |1 |0 |1 |0 |0 |0 |0 |0 |0
2 2009-10-19 00:37:23.067 |0 |0 |0 |0 |0 |1 |1 |0 |1 |0 |0 |0 |0 |0 |0
3 2009-10-20 08:37:14.143 |0 |0 |0 |0 |0 |0 |1 |0 |0 |0 |0 |0 |0 |0 |0
4 2009-10-21 18:07:51.247 |0 |0 |0 |0 |0 |0 |1 |0 |0 |0 |0 |0 |0 |0 |0
5 2009-10-22 21:25:24.483 |0 |0 |0 |0 |0 |0 |1 |0 |0 |0 |0 |0 |0 |0 |0
代码
nq.date = pd.to_datetime(nq.date)
cmnt.CreationDate = pd.to_datetime(cmnt.CreationDate)
count= []
for j in range(len(nq)):
for i in range(-7,8):
check_date = nq.date.iloc[j] + timedelta(days=i)
count = cmnt.loc[(cmnt.OwnerUserId == nq.UserId.iloc[j]) & (cmnt.CreationDate == check_date)].shape[0]
nq.iloc[j].append({nq[i]:count})
预期产出
OwnerUserId | CreationDate
1 2009-10-16 17:38:32.590
1 2009-10-18 17:38:32.590
2 2009-10-18 00:37:23.067
2 2009-10-17 00:37:23.067
2 2009-10-20 00:37:23.067
3 2009-10-19 08:37:14.143
4 2009-10-20 18:07:51.247
5 2009-10-21 21:25:24.483
UserId | date |-7|-6|-5|-4|-3|-2|-1|0 |1 |2 |3 |4 |5 |6 |7
1 2009-10-17 17:38:32.590 |0 |0 |0 |0 |0 |0 |1 |0 |1 |0 |0 |0 |0 |0 |0
2 2009-10-19 00:37:23.067 |0 |0 |0 |0 |0 |1 |1 |0 |1 |0 |0 |0 |0 |0 |0
3 2009-10-20 08:37:14.143 |0 |0 |0 |0 |0 |0 |1 |0 |0 |0 |0 |0 |0 |0 |0
4 2009-10-21 18:07:51.247 |0 |0 |0 |0 |0 |0 |1 |0 |0 |0 |0 |0 |0 |0 |0
5 2009-10-22 21:25:24.483 |0 |0 |0 |0 |0 |0 |1 |0 |0 |0 |0 |0 |0 |0 |0
此处,-1
列表示在获得徽章前一天发表的评论,1
列表示在获得徽章后一天发表的评论,依此类推
注意
有一种完全交替的方法可以做到这一点。我的主要目标是绘制一个时间序列线图,显示用户在获得徽章前后的评论数量。可能您需要交叉合并、筛选,然后是交叉表。:
# merge the two dataframes
merged = (nq.merge(cmnt, left_on='UserId',
right_on='OwnerUserId',
how='left')
)
# extract the date difference between `date` and `CreationDate`
merged['date_diff'] = merged['date'].dt.normalize() - merged['CreationDate'].dt.normalize()
merged['date_diff'] = (merged['date_diff'] / pd.to_timedelta('1D')).astype(int)
# filter the comments within the range
merged = merged[merged['date_diff'].between(-7,7)]
# crosstab
pd.crosstab([merged['UserId'],merged['date']], merged['date_diff'])
输出:
date_diff -1 1 2
UserId date
1 2009-10-17 17:38:32.590 1 1 0
2 2009-10-19 00:37:23.067 1 1 1
3 2009-10-20 08:37:14.143 0 1 0
4 2009-10-21 18:07:51.247 0 1 0
5 2009-10-22 21:25:24.483 0 1 0
这将提供正确的输出。您能否向解决方案中添加如何将此交叉表转换为数据帧?该交叉表命令将返回一个数据帧。只需将其分配给某个对象,例如,
out=pd.crosstab(…)
。它是,但我希望它有以下列['UserId',“date”,-7,-6,…0…,6,7],我可以像普通数据帧列一样访问这些列。但是现在的列是Int64Index([-7,-6,-5,-4,-3,-2,-1,0,1,2,3,4,5,6,7],dtype='int64',name='date_diff')
。因此,现在命令df['UserId']给出错误,因为'UserId'不是df的一列。(df=pd.crosstab(…)chainreset_index()
使用该pd.crosstab()
。对于某些输入,我也会得到此错误ValueError:无法将非有限值(NA或inf)转换为整数。如何解决这个问题?