数据帧Python的多条件复制筛选器（可能会删除）_Python_Pandas_Dataframe_Duplicates

数据帧Python的多条件复制筛选器（可能会删除）

python pandas dataframe

数据帧Python的多条件复制筛选器（可能会删除）,python,pandas,dataframe,duplicates,Python,Pandas,Dataframe,Duplicates,首先，我认为这两个问题是正确的，但并没有完全符合我的要求我有一个非常大的数据框，由票组成。每个票证都有几种类型的文本字段。在某些票据中，两种不同类型的文本字段将具有相同的文本。如果是这种情况，我只使用DESCRIPTION类型。示例数据帧如下所示： TICKETID TYPE TEXT 123 PROBLEMCODE I want to use description for this item because it is a duplicate 123 DESCRIPTION

首先，我认为这两个问题是正确的，但并没有完全符合我的要求

我有一个非常大的数据框，由票组成。每个票证都有几种类型的文本字段。在某些票据中，两种不同类型的文本字段将具有相同的文本。如果是这种情况，我只使用

DESCRIPTION

类型。示例数据帧如下所示：

TICKETID    TYPE    TEXT
123 PROBLEMCODE I want to use description for this item because it is a duplicate
123 DESCRIPTION I want to use description for this item because it is a duplicate
123 CODE1       Other field
124 PROBLEMCODE I need both here
124 DESCRIPTION Because there are not duplicated
124 CODE1       Other field
125 PROBLEMCODE I need both here
125 DESCRIPTION I do not want to delete the above problem code because TICKETID is different
125 CODE1       This field is not super important but matches data and never know where problems arise

TICKETID    TYPE    TEXT
123 DESCRIPTION I want to use description for this item because it is a duplicate
123 CODE1       Other field
124 PROBLEMCODE I need both here
124 DESCRIPTION Because there are not duplicated
124 CODE1       Other field
125 PROBLEMCODE I need both here
125 DESCRIPTION I do not want to delete the above problem code because TICKETID is different
125 CODE1       This field is not super important but matches data and never know where problems arise

基本上，我想检查每个

TICKETID

作为它自己的实体。比较

问题代码

和

说明

文本；如果相等，则过滤掉

PROBLEMCODE

行并保留描述

在我看来，伪代码是：

For a given ticketID:
    if Type = PROBLEMCODE or DESCRIPTION
        if TEXT = TEXT
            DROP PROBLEMCODE

显然，以这种方式循环数据帧是没有效率的。熊猫有很多方法可以做到这一点，在前面发布的问题中有提到。我只是很难弄清楚哪种方法和作业的组合能够完成这一点。我试过：

#创建dup行
数据['Dup']=数据。重复（子集=['TEXT']）
#那么团体票呢？
data.groupby（['TICKETID']））
#以某种方式比较真与假，但我只能按索引顺序（沿帧向下）进行比较。
#我百分之九十九肯定在看其他问题时，应该有一两行
#这样做可以实现：
dataTest=data.loc[data.groupby（['TICKETID']）和（data['TYPE']=='PROBLEMCODE'|'DESCRIPTION'）。重复（子集=['TEXT']））
#然后根据真假进行过滤

示例案例的预期输出仅删除TICKET=123 PROBLEMCODE行，如下所示：

TICKETID    TYPE    TEXT
123 PROBLEMCODE I want to use description for this item because it is a duplicate
123 DESCRIPTION I want to use description for this item because it is a duplicate
123 CODE1       Other field
124 PROBLEMCODE I need both here
124 DESCRIPTION Because there are not duplicated
124 CODE1       Other field
125 PROBLEMCODE I need both here
125 DESCRIPTION I do not want to delete the above problem code because TICKETID is different
125 CODE1       This field is not super important but matches data and never know where problems arise

TICKETID    TYPE    TEXT
123 DESCRIPTION I want to use description for this item because it is a duplicate
123 CODE1       Other field
124 PROBLEMCODE I need both here
124 DESCRIPTION Because there are not duplicated
124 CODE1       Other field
125 PROBLEMCODE I need both here
125 DESCRIPTION I do not want to delete the above problem code because TICKETID is different
125 CODE1       This field is not super important but matches data and never know where problems arise

如果你需要更多信息，请告诉我

    df = pd.DataFrame(
        {
            'ticket':[123,123,123,124,124,124],
            'type':['PROBLEMCODE','DESCRIPTION','code1','PROBLEMCODE','DESCRIPTION','code1'],
            'text':[' I want to use description fo',' I want to use description fo','other',
             'another str','second one','other'],
    
        }
    )
    print(df)
       ticket         type                           text
    0     123  PROBLEMCODE   I want to use description fo
    1     123  DESCRIPTION   I want to use description fo
    2     123        code1                          other
    3     124  PROBLEMCODE                    another str
    4     124  DESCRIPTION                     second one
    5     124        code1                          other
    
    # you can see here in this df(duplicates), all duplicated rows for type == DESCRIPTION or PROBLEMCODE
    duplicates = df[
        (df.type.isin(['DESCRIPTION','PROBLEMCODE'])) &
        (df.duplicated(subset=['ticket','text'],keep=False))
    ]
    
    print(duplicates)
       ticket         type                           text
    0     123  PROBLEMCODE   I want to use description fo
    1     123  DESCRIPTION   I want to use description fo
    
# remove duplicates from main df (using index to improve time)

df = df.drop(duplicates.index.tolist())
print(df)

# now concat duplicates with df (without description and problemcode

result = pd.concat([
    duplicates[duplicates.type=='DESCRIPTION'],df
]).sort_values(by='ticket').reset_index(drop=True)
print(result)
       ticket         type                           text
0     123  DESCRIPTION   I want to use description fo
1     123        code1                          other
2     124  PROBLEMCODE                    another str
3     124  DESCRIPTION                     second one
4     124        code1                          other

对于上述解决方案，当票证和文本相同时，

DESCRIPTION

和

PROIBLEMCODE

将收到不重复的输出。您好，请向我们展示您预期输出的示例参见编辑。它只会删除123问题代码行查看我的答案，但您只想检查问题代码和描述之间的重复项？并留下描述？是的，如果给定的TICKETID有2个重复的文本字段，请保留描述。我将在主数据上测试答案