如何删除数据集中的重复值:python

如何删除数据集中的重复值:python,python,pandas,Python,Pandas,我希望通过保留具有最高值的项来删除数据集中的重复项。现在我正在使用熊猫: c_maxes = hospProfiling.groupby(['Hospital_ID', 'District_ID'], group_keys=False)\ .apply(lambda x: x.ix[x['Hospital_employees'].idxmax()]) print c_maxes c_maxes.to_csv('data/external/HospitalProf

我希望通过保留具有最高值的项来删除数据集中的重复项。现在我正在使用熊猫:

c_maxes = hospProfiling.groupby(['Hospital_ID', 'District_ID'], group_keys=False)\
                .apply(lambda x: x.ix[x['Hospital_employees'].idxmax()])
print c_maxes

c_maxes.to_csv('data/external/HospitalProfilingMaxes.csv')
这样做会导致初始数据集:
Hospital\u ID,District\u ID,Hospital\u employees
变成
Hospital\u ID,District\u ID,Hospital\u ID,Hospital\u employees

正在复制用于分组的列。这里有什么错误

编辑:

在使用groupby()函数时,会在数据的开头添加一个额外的列。列没有名称,只是所有行的序列号。这在此处问题的输出第二个答案中显示。我想删除这个额外的列,因为我不需要它。我试过这个:

hosprofiling.drop(hosprofiling.columns[0],axis=1)

此代码不会删除该列。如何将其删除?

我认为您需要:

hospProfiling.loc[hospProfiling.groupby(['Hospital_ID', 'District_ID'])['Hospital_employees']
                               .idxmax()]
我对另一个答案感到非常惊讶,我做了一些研究,看看函数是否无用:

样本:

hospProfiling = pd.DataFrame({'Hospital_ID': {0: 'A', 1: 'A', 2: 'B', 3: 'A', 4: 'A', 5: 'B', 6: 'A', 7: 'A', 8: 'B', 9: 'B', 10: 'A', 11: 'B', 12: 'A'}, 'Name': {0: 'Sam', 1: 'Annie', 2: 'Fred', 3: 'Sam', 4: 'Annie', 5: 'Fred', 6: 'Sam', 7: 'Annie', 8: 'Fred', 9: 'James', 10: 'Alan', 11: 'Julie', 12: 'Greg'}, 'District_ID': {0: 'M', 1: 'F', 2: 'M', 3: 'M', 4: 'F', 5: 'M', 6: 'M', 7: 'F', 8: 'M', 9: 'M', 10: 'M', 11: 'F', 12: 'M'}, 'Hospital_employees': {0: 25, 1: 41, 2: 70, 3: 44, 4: 12, 5: 14, 6: 20, 7: 10, 8: 30, 9: 18, 10: 56, 11: 28, 12: 33}, 'Val': {0: 100, 1: 7, 2: 14, 3: 200, 4: 5, 5: 20, 6: 1, 7: 0, 8: 7, 9: 9, 10: 6, 11: 9, 12: 47}})
hospProfiling = hospProfiling[['Hospital_ID','District_ID','Hospital_employees','Val','Name']]
hospProfiling.sort_values(by=['Hospital_ID','District_ID'], inplace=True)
print (hospProfiling)
   Hospital_ID District_ID  Hospital_employees  Val   Name
1            A           F                  41    7  Annie
4            A           F                  12    5  Annie
7            A           F                  10    0  Annie
0            A           M                  25  100    Sam
3            A           M                  44  200    Sam
6            A           M                  20    1    Sam
10           A           M                  56    6   Alan
12           A           M                  33   47   Greg
11           B           F                  28    9  Julie
2            B           M                  70   14   Fred
5            B           M                  14   20   Fred
8            B           M                  30    7   Fred
9            B           M                  18    9  James
主要区别在于如何处理其他列,如果使用
max
它将从每个列返回最大值-这里是
Hospital_employees
Val

c_maxes = hospProfiling.groupby(['Hospital_ID','District_ID'],as_index = False).max()
print (c_maxes)
  Hospital_ID District_ID  Hospital_employees   Name  Val
0           A           F                  41  Annie    7
1           A           M                  56    Sam  200
2           B           F                  28  Julie    9
3           B           M                  70  James   20

c_maxes = hospProfiling.groupby(['Hospital_ID','District_ID'],as_index = False)
                       .agg({'Hospital_employees': max})
print (c_maxes)
  Hospital_ID District_ID  Hospital_employees
0           A           F                  41
1           A           M                  56
2           B           F                  28
3           B           M                  70
函数
idxmax
返回另一列中最大值的索引:

print (hospProfiling.groupby(['Hospital_ID', 'District_ID'])['Hospital_employees'].idxmax())
A            F               1
             M              10
B            F              11
             M               2
Name: Hospital_employees, dtype: int64
然后您只能通过以下方式选择
DataFrame

我认为你需要:

hospProfiling.loc[hospProfiling.groupby(['Hospital_ID', 'District_ID'])['Hospital_employees']
                               .idxmax()]
我对另一个答案感到非常惊讶,我做了一些研究,看看函数是否无用:

样本:

hospProfiling = pd.DataFrame({'Hospital_ID': {0: 'A', 1: 'A', 2: 'B', 3: 'A', 4: 'A', 5: 'B', 6: 'A', 7: 'A', 8: 'B', 9: 'B', 10: 'A', 11: 'B', 12: 'A'}, 'Name': {0: 'Sam', 1: 'Annie', 2: 'Fred', 3: 'Sam', 4: 'Annie', 5: 'Fred', 6: 'Sam', 7: 'Annie', 8: 'Fred', 9: 'James', 10: 'Alan', 11: 'Julie', 12: 'Greg'}, 'District_ID': {0: 'M', 1: 'F', 2: 'M', 3: 'M', 4: 'F', 5: 'M', 6: 'M', 7: 'F', 8: 'M', 9: 'M', 10: 'M', 11: 'F', 12: 'M'}, 'Hospital_employees': {0: 25, 1: 41, 2: 70, 3: 44, 4: 12, 5: 14, 6: 20, 7: 10, 8: 30, 9: 18, 10: 56, 11: 28, 12: 33}, 'Val': {0: 100, 1: 7, 2: 14, 3: 200, 4: 5, 5: 20, 6: 1, 7: 0, 8: 7, 9: 9, 10: 6, 11: 9, 12: 47}})
hospProfiling = hospProfiling[['Hospital_ID','District_ID','Hospital_employees','Val','Name']]
hospProfiling.sort_values(by=['Hospital_ID','District_ID'], inplace=True)
print (hospProfiling)
   Hospital_ID District_ID  Hospital_employees  Val   Name
1            A           F                  41    7  Annie
4            A           F                  12    5  Annie
7            A           F                  10    0  Annie
0            A           M                  25  100    Sam
3            A           M                  44  200    Sam
6            A           M                  20    1    Sam
10           A           M                  56    6   Alan
12           A           M                  33   47   Greg
11           B           F                  28    9  Julie
2            B           M                  70   14   Fred
5            B           M                  14   20   Fred
8            B           M                  30    7   Fred
9            B           M                  18    9  James
主要区别在于如何处理其他列,如果使用
max
它将从每个列返回最大值-这里是
Hospital_employees
Val

c_maxes = hospProfiling.groupby(['Hospital_ID','District_ID'],as_index = False).max()
print (c_maxes)
  Hospital_ID District_ID  Hospital_employees   Name  Val
0           A           F                  41  Annie    7
1           A           M                  56    Sam  200
2           B           F                  28  Julie    9
3           B           M                  70  James   20

c_maxes = hospProfiling.groupby(['Hospital_ID','District_ID'],as_index = False)
                       .agg({'Hospital_employees': max})
print (c_maxes)
  Hospital_ID District_ID  Hospital_employees
0           A           F                  41
1           A           M                  56
2           B           F                  28
3           B           M                  70
函数
idxmax
返回另一列中最大值的索引:

print (hospProfiling.groupby(['Hospital_ID', 'District_ID'])['Hospital_employees'].idxmax())
A            F               1
             M              10
B            F              11
             M               2
Name: Hospital_employees, dtype: int64
然后您只能通过以下方式选择
DataFrame


为什么不使用groupby
max
方法

hopsProfiling.groupby(['Hospital_ID','District_ID'],as_index = False).max()
如果恰好有三列以上,请将max替换为agg:

hopsProfiling.groupby(['Hospital_ID','District_ID'],as_index = False).agg({'Hospital employees': max})

为什么不使用groupby
max
方法

hopsProfiling.groupby(['Hospital_ID','District_ID'],as_index = False).max()
如果恰好有三列以上,请将max替换为agg:

hopsProfiling.groupby(['Hospital_ID','District_ID'],as_index = False).agg({'Hospital employees': max})