Python 需要根据重复的值更新dataframe的列
这是问题的一个快速示例: 我有以下数据帧:Python 需要根据重复的值更新dataframe的列,python,python-3.x,pandas,Python,Python 3.x,Pandas,这是问题的一个快速示例: 我有以下数据帧: data = {'name': ["name_1", "name_2" , "name_3" , "name_2" , "name_1" , "name_2" , "name_2" ], 'col_B': ["a", "a" , "a" , "b" , "a" , "c" , "a" ] , 'col_C' : [1 , 1 , 1 , 1 , 5 , 6 , 1]} df = pd.DataFrame(data=data) df # Would g
data = {'name': ["name_1", "name_2" , "name_3" , "name_2" , "name_1" , "name_2" , "name_2" ], 'col_B': ["a", "a" , "a" , "b" , "a" , "c" , "a" ] , 'col_C' : [1 , 1 , 1 , 1 , 5 , 6 , 1]}
df = pd.DataFrame(data=data)
df # Would give the following df below :
name col_B col_C
0 name_1 a 1
1 name_2 a 1
2 name_3 a 1
3 name_2 b 1
4 name_1 a 5
5 name_2 c 6
6 name_2 a 1
我需要的是检查组合名称+列B和重复项,列C->0。例如:
name col_B col_C
0 name_1 a 1
1 name_2 a 1
2 name_3 a 1
3 name_2 b 1
4 name_1 a 0
5 name_2 c 6
6 name_2 a 0
为此,我创建了以下内容:
list_tst = []
for index, row in df.iterrows():
if (row['name']+row['col_B'] in list_tst):
row['col_C'] = 0 # If already in the list set value to zero ( it's a duplicate )
list_tst.append(row['name']+row['col_B']) # if not unique then add to list, could be inside 'else'
但正如所料,这需要花费太长的时间来运行数百万行。
有人能提供一个使用矢量化的建议吗
谢谢
完整代码:
import pandas as pd
data = {'name': ["name_1", "name_2" , "name_3" , "name_2" , "name_1" , "name_2" , "name_2" ], 'col_B': ["a", "a" , "a" , "b" , "a" , "c" , "a" ] , 'col_C' : [1 , 1 , 1 , 1 , 5 , 6 , 1]}
df = pd.DataFrame(data=data)
list_tst = []
for index, row in df.iterrows():
if (row['name']+row['col_B'] in list_tst):
row['col_C'] = 0
list_tst.append(row['name']+row['col_B'])
它是否在
上重复了屏蔽
:
df['col_C'].mask(df.duplicated(['name','col_B']),0)
输出:
0 1
1 1
2 1
3 1
4 0
5 6
6 0
Name: col_C, dtype: int64
应快速cumcount
与groupby
df.col_C*=df.groupby(['name','col_B']).cumcount().eq(0)
df
name col_B col_C
0 name_1 a 1
1 name_2 a 1
2 name_3 a 1
3 name_2 b 1
4 name_1 a 0
5 name_2 c 6
6 name_2 a 0