如何使用Python删除数据帧中的重复项_Python_Python 3.x_Pandas_Duplicates

如何使用Python删除数据帧中的重复项

python python-3.x pandas

如何使用Python删除数据帧中的重复项,python,python-3.x,pandas,duplicates,Python,Python 3.x,Pandas,Duplicates,因此，数据帧是 Product Price Weight Range Count A 40 20 1-3 20 A 40 20 4-7 23 B 20 73 1-3 54 B 20 73 4-7 43 B 20 73 8-15 34 B

因此，数据帧是

Product    Price  Weight  Range   Count
   A        40      20      1-3     20
   A        40      20      4-7     23
   B        20      73      1-3     54
   B        20      73      4-7     43
   B        20      73      8-15    34
   B        20      73      >=16    12
   C        10      20      4-7     22

因此，基本上有一种产品，它有价格和重量，这里的范围指定了产品连续销售的天数，计数指定了在该范围内销售的产品数量

预期产量

Product    Price  Weight  Range   Count
   A        40      20      1-3     20
                            4-7     23
   B        20      73      1-3     54
                            4-7     43
                            8-15    34
   B        20      73      >=16    12
   C        10      20      4-7     22

或

实现第二个输出比第一个输出更有意义。使用

设置索引

，然后使用

取消堆栈

(df.set_index(['Product', 'Price', 'Weight', 'Range'])
  .Count
  .unstack(fill_value=0)
  .reset_index()
)

Range Product  Price  Weight  1-3  4-7  8-15  >=16
0           A     40      20   20   23     0     0
1           B     20      73   54   43    34    12
2           C     10     100    0   22     0     0

试试这个

mask=df.duplicated(subset=['Product'])
df.loc[mask,['Product','Price','Weight']]=''

输出：

  Product Price Weight Range  Count
0       A    40     20   1-3     20
1                        4-7     23
2       B    20     73   1-3     54
3                        4-7     43
4                       8-15     34
5                       >=16     12
6       C    10    100   4-7     22

Range Product  Price  Weight   1-3   4-7  8-15  >=16
0           A     40      20  20.0  23.0   NaN   NaN
1           B     20      73  54.0  43.0  34.0  12.0
2           C     10     100   NaN  22.0   NaN   NaN

输出：

  Product Price Weight Range  Count
0       A    40     20   1-3     20
1                        4-7     23
2       B    20     73   1-3     54
3                        4-7     43
4                       8-15     34
5                       >=16     12
6       C    10    100   4-7     22

Range Product  Price  Weight   1-3   4-7  8-15  >=16
0           A     40      20  20.0  23.0   NaN   NaN
1           B     20      73  54.0  43.0  34.0  12.0
2           C     10     100   NaN  22.0   NaN   NaN

在我看来，如果以后需要处理

DataFrame

，不建议使用第一种解决方案

第二种解决方案要好得多，如果需要在实际数据中复制，则聚合值，例如通过

求和：
#convert catagoricals to strings
df['Range'] = df['Range'].astype(str)

df = (df.groupby(['Product', 'Price', 'Weight', 'Range'])['Count']
        .sum()
        .unstack(fill_value=0)
        .reset_index())
print (df)
Range Product  Price  Weight  1-3  4-7  8-15  >=16
0           A     40      20   20   23     0     0
1           B     20      73   54   43    34    12
2           C     10      20    0   22     0     0

你尝试过我的解决方案吗？这几乎和你刚才接受的答案一样。有些东西告诉我这篇文章是有偏见的。两个几乎相同的解决方案，一个在一小时前发布。我打赌set索引也比groupby快，但他们似乎并不在意。哦，你是个幸运的人。Enjoyits给出了一个错误：无法将一个项目插入到一个已经不是现有类别的分类索引中。我刚刚看到了以下代码：df['Range']=df['Range']。astype（str）和workedperfectly@san谢谢你没有告诉我这件事。如果你不告诉我，我就不会知道有什么错误，因为我看不懂你的心思。我的朋友不会再发生了。无意冒犯