Python 将cumcount()与DUP一起使用
我有一个df,看起来像这样:Python 将cumcount()与DUP一起使用,python,python-3.x,pandas,pandas-groupby,Python,Python 3.x,Pandas,Pandas Groupby,我有一个df,看起来像这样: ID Component IDDate EmployeeID CreateUserID 24 1 2017-09-11 00:00:00.000 0907036 Afior 24 2 2017-09-11 00:00:00.000 0907036 Afior 24 3 2017-09-11 00:00:00.000 0907036 Afior 25 1
ID Component IDDate EmployeeID CreateUserID
24 1 2017-09-11 00:00:00.000 0907036 Afior
24 2 2017-09-11 00:00:00.000 0907036 Afior
24 3 2017-09-11 00:00:00.000 0907036 Afior
25 1 2017-09-12 00:00:00.000 0907036 Afior
25 3 2017-09-12 00:00:00.000 0907036 Afior
26 8 2017-09-16 00:00:00.000 1013842 JHyde
26 11 2017-09-16 00:00:00.000 1013842 JHyde
26 12 2017-09-16 00:00:00.000 1013842 JHyde
26 23 2017-09-16 00:00:00.000 1013842 JHyde
27 21 2017-09-16 00:00:00.000 0907036 Afior
27 22 2017-09-16 00:00:00.000 0907036 Afior
27 23 2017-09-16 00:00:00.000 0907036 Afior
28 15 2017-10-16 00:00:00.000 1013842 JHyde
28 16 2017-10-16 00:00:00.000 1013842 JHyde
28 19 2017-10-16 00:00:00.000 1013842 JHyde
28 25 2017-10-16 00:00:00.000 1013842 JHyde
28 26 2017-10-16 00:00:00.000 1013842 JHyde
ID Component IDDate EmployeeID CreateUserID seq
24 1 2017-09-11 00:00:00.000 0907036 Afior 1
24 2 2017-09-11 00:00:00.000 0907036 Afior 1
24 3 2017-09-11 00:00:00.000 0907036 Afior 1
25 1 2017-09-12 00:00:00.000 0907036 Afior 2
25 3 2017-09-12 00:00:00.000 0907036 Afior 2
26 8 2017-09-16 00:00:00.000 1013842 JHyde 1
26 11 2017-09-16 00:00:00.000 1013842 JHyde 1
26 12 2017-09-16 00:00:00.000 1013842 JHyde 1
26 23 2017-09-16 00:00:00.000 1013842 JHyde 1
27 21 2017-09-16 00:00:00.000 0907036 Afior 3
27 22 2017-09-16 00:00:00.000 0907036 Afior 3
27 23 2017-09-16 00:00:00.000 0907036 Afior 3
28 15 2017-10-16 00:00:00.000 1013842 JHyde 2
28 16 2017-10-16 00:00:00.000 1013842 JHyde 2
28 19 2017-10-16 00:00:00.000 1013842 JHyde 2
28 25 2017-10-16 00:00:00.000 1013842 JHyde 2
28 26 2017-10-16 00:00:00.000 1013842 JHyde 2
我试图使用cumcount创建一个变量,该变量保存每个ID/EmployeeID组合的观察顺序。我还没能让计数达到我想要的水平,但我尝试了cumcount()
上的一些变化,但这些变化并没有让我达到我想要的水平,比如:
df['seq'] = df.groupby(['EmployeeID', 'ID', 'Date']).cumcount().add(1)
df['seq'] = df.groupby(['EmployeeID', 'Date']).cumcount().add(1)
df['seq'] = df.groupby(['EmployeeID', 'ID']).cumcount().add(1)
理想情况下,我的输出如下所示:
ID Component IDDate EmployeeID CreateUserID
24 1 2017-09-11 00:00:00.000 0907036 Afior
24 2 2017-09-11 00:00:00.000 0907036 Afior
24 3 2017-09-11 00:00:00.000 0907036 Afior
25 1 2017-09-12 00:00:00.000 0907036 Afior
25 3 2017-09-12 00:00:00.000 0907036 Afior
26 8 2017-09-16 00:00:00.000 1013842 JHyde
26 11 2017-09-16 00:00:00.000 1013842 JHyde
26 12 2017-09-16 00:00:00.000 1013842 JHyde
26 23 2017-09-16 00:00:00.000 1013842 JHyde
27 21 2017-09-16 00:00:00.000 0907036 Afior
27 22 2017-09-16 00:00:00.000 0907036 Afior
27 23 2017-09-16 00:00:00.000 0907036 Afior
28 15 2017-10-16 00:00:00.000 1013842 JHyde
28 16 2017-10-16 00:00:00.000 1013842 JHyde
28 19 2017-10-16 00:00:00.000 1013842 JHyde
28 25 2017-10-16 00:00:00.000 1013842 JHyde
28 26 2017-10-16 00:00:00.000 1013842 JHyde
ID Component IDDate EmployeeID CreateUserID seq
24 1 2017-09-11 00:00:00.000 0907036 Afior 1
24 2 2017-09-11 00:00:00.000 0907036 Afior 1
24 3 2017-09-11 00:00:00.000 0907036 Afior 1
25 1 2017-09-12 00:00:00.000 0907036 Afior 2
25 3 2017-09-12 00:00:00.000 0907036 Afior 2
26 8 2017-09-16 00:00:00.000 1013842 JHyde 1
26 11 2017-09-16 00:00:00.000 1013842 JHyde 1
26 12 2017-09-16 00:00:00.000 1013842 JHyde 1
26 23 2017-09-16 00:00:00.000 1013842 JHyde 1
27 21 2017-09-16 00:00:00.000 0907036 Afior 3
27 22 2017-09-16 00:00:00.000 0907036 Afior 3
27 23 2017-09-16 00:00:00.000 0907036 Afior 3
28 15 2017-10-16 00:00:00.000 1013842 JHyde 2
28 16 2017-10-16 00:00:00.000 1013842 JHyde 2
28 19 2017-10-16 00:00:00.000 1013842 JHyde 2
28 25 2017-10-16 00:00:00.000 1013842 JHyde 2
28 26 2017-10-16 00:00:00.000 1013842 JHyde 2
有没有一种处理DUP的方法可以让我得到这个输出?是否最好先将df变宽,然后应用
cumcount()
?如果我理解正确,这将转换为分类数据并获取代码
df[['IDDate','EmployeeID']].apply(tuple,1).groupby(df['CreateUserID']).apply(lambda x : x.astype('category').cat.codes+1)
Out[8]:
0 1
1 1
2 1
3 2
4 2
5 1
6 1
7 1
8 1
9 3
10 3
11 3
12 2
13 2
14 2
15 2
16 2
dtype: int8
这里有一种方法,基本上只通过EmployeeID
进行分组,然后检查ID
是否从一行更改到下一行,并返回该行的cumsum
(这基于您的尝试和所需的输出)
另一种方法是按EmployeeID分组,然后在日期上进行密集排序:
In [187]: df.groupby("EmployeeID")["Date"].apply(lambda x: x.rank(method='dense')).astype(int)
Out[187]:
0 1
1 1
2 1
3 2
4 2
5 1
6 1
7 1
8 1
9 3
10 3
11 3
12 2
13 2
14 2
15 2
16 2
Name: Date, dtype: int64
这将按值排序,而不是按第一次看到的值排序,尽管若日期按示例中的顺序排序,这并不重要