Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/python/325.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python 基于两列删除重复的行_Python_Pandas - Fatal编程技术网

Python 基于两列删除重复的行

Python 基于两列删除重复的行,python,pandas,Python,Pandas,我有一个数据框,其中包含根据四列(SFDC_ID和左侧、右侧和右侧)重复的值: 目前SFDC_ID和Right_SFDC_ID正在以以下方式复制: SFDC_ID left_side right_SFDC_ID right_side similairity 0013s00000vEVuwAAG Hague Quality Water 0013s00000vEW72AAG Hague Quality Waters 0.99023304 0013s00000vEW72AAG

我有一个数据框,其中包含根据四列(SFDC_ID和左侧、右侧和右侧)重复的值:

目前SFDC_ID和Right_SFDC_ID正在以以下方式复制:

SFDC_ID left_side   right_SFDC_ID   right_side  similairity

0013s00000vEVuwAAG  Hague Quality Water 0013s00000vEW72AAG  Hague Quality Waters    0.99023304
0013s00000vEW72AAG  Hague Quality Waters    0013s00000vEVuwAAG  Hague Quality Water 0.99023304
如果仔细观察,第1行的SFDC_ID与第2行的右SFDC_ID相同

如何使用pandas删除第二行?

这里有一种方法:

# compares which is greater based on alphabetical order and makes a bool series
mask = df['SFDC_ID'] < df['right_SFDC_ID'] 

# creates a new column checking True vs False, 

#if mask is true item in df['SFDC_ID'] is selected else item in df['right_SFDC_ID'] is selected

df['col1'] = df['SFDC_ID'].where(mask, df['right_SFDC_ID'])

#same as above but a column for df['right_SFDC_ID']
df['col2'] = df['right_SFDC_ID'].where(mask, df['SFDC_ID'])

# checks for duplicates in `col1` and `col2` and removes last duplicate
df = df.drop_duplicates(subset=['col1', 'col2'])
#根据字母顺序比较哪个更大,并生成布尔序列
掩码=df['SFDC\u ID']
这里有一种方法:

# compares which is greater based on alphabetical order and makes a bool series
mask = df['SFDC_ID'] < df['right_SFDC_ID'] 

# creates a new column checking True vs False, 

#if mask is true item in df['SFDC_ID'] is selected else item in df['right_SFDC_ID'] is selected

df['col1'] = df['SFDC_ID'].where(mask, df['right_SFDC_ID'])

#same as above but a column for df['right_SFDC_ID']
df['col2'] = df['right_SFDC_ID'].where(mask, df['SFDC_ID'])

# checks for duplicates in `col1` and `col2` and removes last duplicate
df = df.drop_duplicates(subset=['col1', 'col2'])
#根据字母顺序比较哪个更大,并生成布尔序列
掩码=df['SFDC\u ID']
您可以在行上迭代,并在前一行值匹配的位置删除行

for index,row in df[1::].iterrows():
    prev_SFDC_ID = df.iloc[index-1]['SFDC_ID'] #get prev SFDC_ID value
    if row['right_SFDC_ID'] == prev_SFDC_ID: 
        df.drop(index=index, inplace=True)

您可以在行上迭代,并在前一行值匹配的位置删除行

for index,row in df[1::].iterrows():
    prev_SFDC_ID = df.iloc[index-1]['SFDC_ID'] #get prev SFDC_ID value
    if row['right_SFDC_ID'] == prev_SFDC_ID: 
        df.drop(index=index, inplace=True)

我建议您将您的数据格式设置得更好一些,因为目前无法判断Haque quality waters是单独一列还是与0013组合…在Stackoverflow中设置得更好?我相信我更新了这个,我建议你把你的数据格式化得更好一些,因为目前还不知道Haque quality waters是一个单独的列,还是与0013相结合……在Stackoverflow中格式化得更好?我相信我更新了这个,你能解释一下它到底是做什么的吗?matches_df['SFDC_ID']