Python 如何标记数据帧中特定值的第n个观测值？_Python_Python 3.x_Pandas_Dataframe

Python 如何标记数据帧中特定值的第n个观测值？

python python-3.x pandas dataframe

Python 如何标记数据帧中特定值的第n个观测值？,python,python-3.x,pandas,dataframe,Python,Python 3.x,Pandas,Dataframe,我有一个80000多行的数据帧。我的一列中有可能重复的值，我想创建一个“计数器”列，将该值的每次出现标记为第n次出现。如果我在第10行看到value=v1，这是我第三次看到v1，我想要df.counter==3。这就是我目前所拥有的 d = pd.DataFrame() # create empty df to append results to for val in df.val_id.unique(): # loop through the unique val_id values

我有一个80000多行的数据帧。我的一列中有可能重复的值，我想创建一个“计数器”列，将该值的每次出现标记为第n次出现。如果我在第10行看到value=v1，这是我第三次看到v1，我想要df.counter==3。这就是我目前所拥有的

d = pd.DataFrame() # create empty df to append results to
for val in df.val_id.unique(): # loop through the unique val_id values
    f = pd.DataFrame(df.val_id[df.val_id == val]) # isolate all instances of specific val_id
    f['counter'] = range(1,len(f) + 1) # create counter column that labels each instance as the nth value
    d = pd.concat([d,f]) # append the result to my output df

我认为这段代码是有效的（还没有让循环完成），但问题是这需要很长时间。计时一行需要0.25秒，所以我估计在我的数据帧上完成这项工作需要两个多小时

必须有一种更具pythonic/pandas-y/更快的方法来做到这一点。请帮忙

这里有一种循环方式：

从示例数据帧开始

df

：

通过此循环（它应该比当前循环快很多倍）

您的新

df

将如下所示：

  val_id  counter
0     v1      1.0
1     v3      1.0
2     v2      1.0
3     v2      2.0
4     v1      2.0
5     v3      2.0
6     v3      3.0
7     v2      3.0
8     v3      4.0
9     v2      4.0

话虽如此，@AlexRiley评论中发布的

groupby

方法更好更快……

您可以尝试使用

groupby

和

cumcount（）
结果:
 Col1
0    a
1    b
2    c
3    a
4    b
5    a

  Col1  Counter
0    a        1
1    b        1
2    c        1
3    a        2
4    b        2
5    a        3

现在，使用cumcount（）
：
结果:
 Col1
0    a
1    b
2    c
3    a
4    b
5    a

  Col1  Counter
0    a        1
1    b        1
2    c        1
3    a        2
4    b        2
5    a        3

这里有一个方法：
In [49]: df
Out[49]: 
   D
0  a
1  b
2  a
3  c
4  b
5  a
6  c
7  c
8  b
9  b

In [50]: counters = df.groupby('D').apply(lambda x: np.arange(len(x)) + 1)

In [51]: df['counters'] = 0

In [52]: for label in counters.index:
    ...:     df.loc[df.D == label, 'counters'] = counters.loc[label]
    ...:     

In [53]: df
Out[53]: 
   D  counters
0  a         1
1  b         1
2  a         2
3  c         1
4  b         2
5  a         3
6  c         2
7  c         3
8  b         3
9  b         4

你的意思是类似于df['counter']=df.groupby（'val_id'）.cumcount（）+1？
In [49]: df
Out[49]: 
   D
0  a
1  b
2  a
3  c
4  b
5  a
6  c
7  c
8  b
9  b

In [50]: counters = df.groupby('D').apply(lambda x: np.arange(len(x)) + 1)

In [51]: df['counters'] = 0

In [52]: for label in counters.index:
    ...:     df.loc[df.D == label, 'counters'] = counters.loc[label]
    ...:     

In [53]: df
Out[53]: 
   D  counters
0  a         1
1  b         1
2  a         2
3  c         1
4  b         2
5  a         3
6  c         2
7  c         3
8  b         3
9  b         4