Warning: file_get_contents(/data/phpspider/zhask/data//catemap/8/python-3.x/18.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python 需要根据重复的值更新dataframe的列_Python_Python 3.x_Pandas - Fatal编程技术网

Python 需要根据重复的值更新dataframe的列

Python 需要根据重复的值更新dataframe的列,python,python-3.x,pandas,Python,Python 3.x,Pandas,这是问题的一个快速示例: 我有以下数据帧: data = {'name': ["name_1", "name_2" , "name_3" , "name_2" , "name_1" , "name_2" , "name_2" ], 'col_B': ["a", "a" , "a" , "b" , "a" , "c" , "a" ] , 'col_C' : [1 , 1 , 1 , 1 , 5 , 6 , 1]} df = pd.DataFrame(data=data) df # Would g

这是问题的一个快速示例:

我有以下数据帧:

data = {'name': ["name_1", "name_2" , "name_3" , "name_2" , "name_1" , "name_2" , "name_2" ], 'col_B': ["a", "a" , "a" , "b" , "a" , "c" , "a" ] , 'col_C' : [1 , 1 , 1 , 1 , 5 , 6 , 1]}
df = pd.DataFrame(data=data)

df # Would give the following df below :
     name col_B  col_C
0  name_1     a      1
1  name_2     a      1
2  name_3     a      1
3  name_2     b      1
4  name_1     a      5
5  name_2     c      6
6  name_2     a      1
我需要的是检查组合名称+列B和重复项,列C->0。例如:

     name col_B  col_C
0  name_1     a      1
1  name_2     a      1
2  name_3     a      1
3  name_2     b      1
4  name_1     a      0
5  name_2     c      6
6  name_2     a      0
为此,我创建了以下内容:

list_tst = []

for index, row in df.iterrows():

   if (row['name']+row['col_B'] in list_tst): 
      row['col_C'] = 0  # If already in the list set value to zero ( it's a duplicate )
   list_tst.append(row['name']+row['col_B']) # if not unique then add to list, could be inside 'else'
但正如所料,这需要花费太长的时间来运行数百万行。 有人能提供一个使用矢量化的建议吗

谢谢

完整代码:

import pandas as pd


data = {'name': ["name_1", "name_2" , "name_3" , "name_2" , "name_1" , "name_2" , "name_2" ], 'col_B': ["a", "a" , "a" , "b" , "a" , "c" , "a" ] , 'col_C' : [1 , 1 , 1 , 1 , 5 , 6 , 1]}
df = pd.DataFrame(data=data)

list_tst = []

for index, row in df.iterrows():

   if (row['name']+row['col_B'] in list_tst):
      row['col_C'] = 0
   list_tst.append(row['name']+row['col_B'])

它是否在
上重复了
屏蔽

df['col_C'].mask(df.duplicated(['name','col_B']),0)
输出:

0    1
1    1
2    1
3    1
4    0
5    6
6    0
Name: col_C, dtype: int64

应快速
cumcount
groupby

df.col_C*=df.groupby(['name','col_B']).cumcount().eq(0)
df
     name col_B  col_C
0  name_1     a      1
1  name_2     a      1
2  name_3     a      1
3  name_2     b      1
4  name_1     a      0
5  name_2     c      6
6  name_2     a      0