Python 熊猫从具有特定条件的数据帧（分组）中删除重复项_Python_String_Pandas_Group By_Duplicates

Python 熊猫从具有特定条件的数据帧（分组）中删除重复项

python string pandas

Python 熊猫从具有特定条件的数据帧（分组）中删除重复项,python,string,pandas,group-by,duplicates,Python,String,Pandas,Group By,Duplicates,大家好，我有一个数据框，它的内容如下 name,mv_str abc,Exorsist part1 abc,doc str 2D abc,doc str 3D abc,doc str QA abc,doc flash def,plastic def,plastic income def,doc str 2D ###i added this row for better clarity 我预期的o/p应该是。。。。每组获取唯一的记录行——对于每个mailid mv_str，其类型不应类似，即

大家好，我有一个数据框，它的内容如下

name,mv_str
abc,Exorsist part1
abc,doc str 2D
abc,doc str 3D
abc,doc str QA
abc,doc flash
def,plastic
def,plastic income
def,doc str 2D   ###i added this row for better clarity

我预期的o/p应该是。。。。每组获取唯一的记录行——对于每个mailid mv_str，其类型不应类似，即：一个“mv_str”中的前两个单词不应出现在该特定用户名的第二行/任何行中

注意：比较应按用户名级别进行

name,mv_str
abc,Exorist part1
abc,doc str 2D   ###3D and QA removes as 1st 2 words "doc str" matched
abc, doc flash   ###only 1st word is matching, 2nd word does not
def,plastic
def,plastic income  #It should be present as only one word is matching
def,doc str 2D   ###this row should be there as this is for another User

请任何人帮助我形成的逻辑，或代码样本将是巨大的帮助。谢谢。

我想您需要

mv_str

列中的第一个字符串，通过

whitespace

创建新的

DataFrame

df1

：

df1 = df.mv_str.str.split(expand=True)
print (df1)
          0       1     2
0  Exorsist   part1  None
1       doc     str    2D
2       doc     str    3D
3       doc     str    QA
4       doc   flash  None
5   plastic    None  None
6   plastic  income  None
7       doc     str    2D

通过以下方式添加原始数据帧：

然后通过列

name

、

和

，第一个值保留：

print (df.drop_duplicates(['name',0,1]))
  name          mv_str         0       1     2
0  abc  Exorsist part1  Exorsist   part1  None
1  abc      doc str 2D       doc     str    2D
4  abc       doc flash       doc   flash  None
5  def         plastic   plastic    None  None
6  def  plastic income   plastic  income  None
7  def      doc str 2D       doc     str    2D

通过以下方式删除列

、

：

或者最好只选择

name

和

mv_str

列来删除列：

print (df.drop_duplicates(['name',0,1])[['name','mv_str']])
  name          mv_str
0  abc  Exorsist part1
1  abc      doc str 2D
4  abc       doc flash
5  def         plastic
6  def  plastic income
7  def      doc str 2D

@jezrael你能解释一下你上面所做的事情吗？我是一个初学者，所以很难分析。是的，我正在寻找另一个解决方案。给我一点时间，我解释一下。请检查我的解释，我还不能100%确定我是否理解您的意思。@jazrael:假设我有另一行def作为def doc str 2D。。。然后我就失去了这一行…这不正确是的，但为什么

def，塑料收入行也没有被删除？我不了解这种情况。
print (df.drop_duplicates(['name',0,1]).drop([0,1,2], axis=1))
  name          mv_str
0  abc  Exorsist part1
1  abc      doc str 2D
4  abc       doc flash
5  def         plastic
6  def  plastic income
7  def      doc str 2D

print (df.drop_duplicates(['name',0,1])[['name','mv_str']])
  name          mv_str
0  abc  Exorsist part1
1  abc      doc str 2D
4  abc       doc flash
5  def         plastic
6  def  plastic income
7  def      doc str 2D