Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/python/330.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181

Warning: file_get_contents(/data/phpspider/zhask/data//catemap/9/loops/2.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python数据帧:根据特定条件删除重复项_Python_Pandas_Dataframe_Drop Duplicates - Fatal编程技术网

Python数据帧:根据特定条件删除重复项

Python数据帧:根据特定条件删除重复项,python,pandas,dataframe,drop-duplicates,Python,Pandas,Dataframe,Drop Duplicates,具有重复车间ID的数据帧,其中一些车间ID出现两次,一些车间ID出现三次: 我只想根据分配给其区域的最短店铺距离保留唯一的店铺ID。 Area Shop Name Shop Distance Shop ID 0 AAA Ly 86 5d87790c46a77300 1 AAA Hi 230 5ce5522012138400 2 BBB Hi 780

具有重复车间ID的数据帧,其中一些车间ID出现两次,一些车间ID出现三次:
我只想根据分配给其区域的最短店铺距离保留唯一的店铺ID。

    Area  Shop Name  Shop Distance  Shop ID   

0   AAA   Ly         86             5d87790c46a77300
1   AAA   Hi         230            5ce5522012138400
2   BBB   Hi         780            5ce5522012138400
3   CCC   Ly         450            5d87790c46a77300
...
91  MMM   Ju         43             4f76d0c0e4b01af7
92  MMM   Hi         1150           5ce5522012138400
...
使用pandas drop_duplicates删除行重复项,但条件基于第一个/最后一个出现的店铺ID,该ID不允许我按距离排序:

shops_df = shops_df.drop_duplicates(subset='Shop ID', keep= 'first')
我还尝试按店铺ID分组,然后进行排序,但排序返回错误:重复

bbtshops_new['C'] = bbtshops_new.groupby('Shop ID')['Shop ID'].cumcount()
bbtshops_new.sort_values(by=['C'], axis=1)
到目前为止,我一直在努力做到:

# filter all the duplicates into a new df
df_toclean = shops_df[shops_df['Shop ID'].duplicated(keep= False)]

# create a mask for all unique Shop ID
mask = df_toclean['Shop ID'].value_counts()

# create a mask for the Shop ID that occurred 2 times
shop_2 = mask[mask==2].index

# create a mask for the Shop ID that occurred 3 times
shop_3 = mask[mask==3].index

# create a mask for the Shops that are under radius 750 
dist_1 = df_toclean['Shop Distance']<=750

# returns results for all the Shop IDs that appeared twice and under radius 750
bbtshops_2 = df_toclean[dist_1 & df_toclean['Shop ID'].isin(shop_2)]

* if i use df_toclean['Shop Distance'].min() instead of dist_1 it returns 0 results

#将所有重复项过滤到新df中
df_toclean=shops_df[shops_df['Shop ID']。重复(keep=False)]
#为所有唯一的店铺ID创建掩码
mask=df_toclean['Shop ID'].value_counts()
#为发生2次的店铺ID创建掩码
shop_2=mask[mask==2]。索引
#为发生3次的店铺ID创建掩码
shop_3=mask[mask==3]。索引
#为半径750以下的店铺创建遮罩

dist_1=df_toclean['Shop Distance']尝试首先根据距离对数据帧进行排序,然后删除重复的存储

df = shops_df.sort_values('Distance')
df = df[~df['Shop ID'].duplicated()]  # The tilda (~) inverts the boolean mask.
或者就像一个链式表达式一样(根据@chmielcode的注释)

您可以使用idxmin:

df.loc[df.groupby('Area')['Shop Distance'].idxmin()]

  Area Shop Name  Shop  Distance              Shop ID
0  AAA        Ly              86     5d87790c46a77300
2  BBB        Hi             780     5ce5522012138400
3  CCC        Ly             450     5d87790c46a77300
4  MMM        Ju              43     4f76d0c0e4b01af7

首先尝试按店铺ID和距离对_值进行排序,默认值为升序=True,然后在店铺ID和距离子集上删除_重复项。这应该可以工作,但使用duplicated(),~和掩码选择与链式排序_值和删除_重复项相比显得过于复杂。@Alexander@chmielcode这非常有用!!谢谢这在应用shop_2和shop_3遮罩后同样有效,因为idxmin根据第一次出现的最小shop距离进行过滤,但这需要更长的步骤。
df.loc[df.groupby('Area')['Shop Distance'].idxmin()]

  Area Shop Name  Shop  Distance              Shop ID
0  AAA        Ly              86     5d87790c46a77300
2  BBB        Hi             780     5ce5522012138400
3  CCC        Ly             450     5d87790c46a77300
4  MMM        Ju              43     4f76d0c0e4b01af7