Python 将cumcount()与DUP一起使用

Python 将cumcount()与DUP一起使用,python,python-3.x,pandas,pandas-groupby,Python,Python 3.x,Pandas,Pandas Groupby,我有一个df,看起来像这样: ID Component IDDate EmployeeID CreateUserID 24 1 2017-09-11 00:00:00.000 0907036 Afior 24 2 2017-09-11 00:00:00.000 0907036 Afior 24 3 2017-09-11 00:00:00.000 0907036 Afior 25 1

我有一个df,看起来像这样:

ID Component IDDate                   EmployeeID CreateUserID
24 1         2017-09-11 00:00:00.000  0907036    Afior
24 2         2017-09-11 00:00:00.000  0907036    Afior
24 3         2017-09-11 00:00:00.000  0907036    Afior
25 1         2017-09-12 00:00:00.000  0907036    Afior
25 3         2017-09-12 00:00:00.000  0907036    Afior
26 8         2017-09-16 00:00:00.000  1013842    JHyde
26 11        2017-09-16 00:00:00.000  1013842    JHyde
26 12        2017-09-16 00:00:00.000  1013842    JHyde
26 23        2017-09-16 00:00:00.000  1013842    JHyde
27 21        2017-09-16 00:00:00.000  0907036    Afior
27 22        2017-09-16 00:00:00.000  0907036    Afior
27 23        2017-09-16 00:00:00.000  0907036    Afior
28 15        2017-10-16 00:00:00.000  1013842    JHyde
28 16        2017-10-16 00:00:00.000  1013842    JHyde
28 19        2017-10-16 00:00:00.000  1013842    JHyde
28 25        2017-10-16 00:00:00.000  1013842    JHyde
28 26        2017-10-16 00:00:00.000  1013842    JHyde
ID Component IDDate                   EmployeeID CreateUserID seq
24 1         2017-09-11 00:00:00.000  0907036    Afior        1
24 2         2017-09-11 00:00:00.000  0907036    Afior        1
24 3         2017-09-11 00:00:00.000  0907036    Afior        1
25 1         2017-09-12 00:00:00.000  0907036    Afior        2
25 3         2017-09-12 00:00:00.000  0907036    Afior        2
26 8         2017-09-16 00:00:00.000  1013842    JHyde        1
26 11        2017-09-16 00:00:00.000  1013842    JHyde        1
26 12        2017-09-16 00:00:00.000  1013842    JHyde        1
26 23        2017-09-16 00:00:00.000  1013842    JHyde        1
27 21        2017-09-16 00:00:00.000  0907036    Afior        3
27 22        2017-09-16 00:00:00.000  0907036    Afior        3
27 23        2017-09-16 00:00:00.000  0907036    Afior        3
28 15        2017-10-16 00:00:00.000  1013842    JHyde        2
28 16        2017-10-16 00:00:00.000  1013842    JHyde        2
28 19        2017-10-16 00:00:00.000  1013842    JHyde        2
28 25        2017-10-16 00:00:00.000  1013842    JHyde        2
28 26        2017-10-16 00:00:00.000  1013842    JHyde        2
我试图使用cumcount创建一个变量,该变量保存每个ID/EmployeeID组合的观察顺序。我还没能让计数达到我想要的水平,但我尝试了
cumcount()
上的一些变化,但这些变化并没有让我达到我想要的水平,比如:

df['seq'] = df.groupby(['EmployeeID', 'ID', 'Date']).cumcount().add(1)

df['seq'] = df.groupby(['EmployeeID', 'Date']).cumcount().add(1)

df['seq'] = df.groupby(['EmployeeID', 'ID']).cumcount().add(1)
理想情况下,我的输出如下所示:

ID Component IDDate                   EmployeeID CreateUserID
24 1         2017-09-11 00:00:00.000  0907036    Afior
24 2         2017-09-11 00:00:00.000  0907036    Afior
24 3         2017-09-11 00:00:00.000  0907036    Afior
25 1         2017-09-12 00:00:00.000  0907036    Afior
25 3         2017-09-12 00:00:00.000  0907036    Afior
26 8         2017-09-16 00:00:00.000  1013842    JHyde
26 11        2017-09-16 00:00:00.000  1013842    JHyde
26 12        2017-09-16 00:00:00.000  1013842    JHyde
26 23        2017-09-16 00:00:00.000  1013842    JHyde
27 21        2017-09-16 00:00:00.000  0907036    Afior
27 22        2017-09-16 00:00:00.000  0907036    Afior
27 23        2017-09-16 00:00:00.000  0907036    Afior
28 15        2017-10-16 00:00:00.000  1013842    JHyde
28 16        2017-10-16 00:00:00.000  1013842    JHyde
28 19        2017-10-16 00:00:00.000  1013842    JHyde
28 25        2017-10-16 00:00:00.000  1013842    JHyde
28 26        2017-10-16 00:00:00.000  1013842    JHyde
ID Component IDDate                   EmployeeID CreateUserID seq
24 1         2017-09-11 00:00:00.000  0907036    Afior        1
24 2         2017-09-11 00:00:00.000  0907036    Afior        1
24 3         2017-09-11 00:00:00.000  0907036    Afior        1
25 1         2017-09-12 00:00:00.000  0907036    Afior        2
25 3         2017-09-12 00:00:00.000  0907036    Afior        2
26 8         2017-09-16 00:00:00.000  1013842    JHyde        1
26 11        2017-09-16 00:00:00.000  1013842    JHyde        1
26 12        2017-09-16 00:00:00.000  1013842    JHyde        1
26 23        2017-09-16 00:00:00.000  1013842    JHyde        1
27 21        2017-09-16 00:00:00.000  0907036    Afior        3
27 22        2017-09-16 00:00:00.000  0907036    Afior        3
27 23        2017-09-16 00:00:00.000  0907036    Afior        3
28 15        2017-10-16 00:00:00.000  1013842    JHyde        2
28 16        2017-10-16 00:00:00.000  1013842    JHyde        2
28 19        2017-10-16 00:00:00.000  1013842    JHyde        2
28 25        2017-10-16 00:00:00.000  1013842    JHyde        2
28 26        2017-10-16 00:00:00.000  1013842    JHyde        2

有没有一种处理DUP的方法可以让我得到这个输出?是否最好先将df变宽,然后应用
cumcount()

如果我理解正确,这将转换为分类数据并获取
代码

df[['IDDate','EmployeeID']].apply(tuple,1).groupby(df['CreateUserID']).apply(lambda x : x.astype('category').cat.codes+1)
Out[8]: 
0     1
1     1
2     1
3     2
4     2
5     1
6     1
7     1
8     1
9     3
10    3
11    3
12    2
13    2
14    2
15    2
16    2
dtype: int8

这里有一种方法,基本上只通过
EmployeeID
进行分组,然后检查
ID
是否从一行更改到下一行,并返回该行的
cumsum
(这基于您的尝试和所需的输出)


另一种方法是按EmployeeID分组,然后在日期上进行密集排序:

In [187]: df.groupby("EmployeeID")["Date"].apply(lambda x: x.rank(method='dense')).astype(int)
Out[187]: 
0     1
1     1
2     1
3     2
4     2
5     1
6     1
7     1
8     1
9     3
10    3
11    3
12    2
13    2
14    2
15    2
16    2
Name: Date, dtype: int64
这将按值排序,而不是按第一次看到的值排序,尽管若日期按示例中的顺序排序,这并不重要