数据帧Python的多条件复制筛选器(可能会删除)
首先,我认为这两个问题是正确的,但并没有完全符合我的要求 我有一个非常大的数据框,由票组成。每个票证都有几种类型的文本字段。在某些票据中,两种不同类型的文本字段将具有相同的文本。如果是这种情况,我只使用数据帧Python的多条件复制筛选器(可能会删除),python,pandas,dataframe,duplicates,Python,Pandas,Dataframe,Duplicates,首先,我认为这两个问题是正确的,但并没有完全符合我的要求 我有一个非常大的数据框,由票组成。每个票证都有几种类型的文本字段。在某些票据中,两种不同类型的文本字段将具有相同的文本。如果是这种情况,我只使用DESCRIPTION类型。示例数据帧如下所示: TICKETID TYPE TEXT 123 PROBLEMCODE I want to use description for this item because it is a duplicate 123 DESCRIPTION
DESCRIPTION
类型。示例数据帧如下所示:
TICKETID TYPE TEXT
123 PROBLEMCODE I want to use description for this item because it is a duplicate
123 DESCRIPTION I want to use description for this item because it is a duplicate
123 CODE1 Other field
124 PROBLEMCODE I need both here
124 DESCRIPTION Because there are not duplicated
124 CODE1 Other field
125 PROBLEMCODE I need both here
125 DESCRIPTION I do not want to delete the above problem code because TICKETID is different
125 CODE1 This field is not super important but matches data and never know where problems arise
TICKETID TYPE TEXT
123 DESCRIPTION I want to use description for this item because it is a duplicate
123 CODE1 Other field
124 PROBLEMCODE I need both here
124 DESCRIPTION Because there are not duplicated
124 CODE1 Other field
125 PROBLEMCODE I need both here
125 DESCRIPTION I do not want to delete the above problem code because TICKETID is different
125 CODE1 This field is not super important but matches data and never know where problems arise
基本上,我想检查每个TICKETID
作为它自己的实体。比较问题代码
和说明
文本;如果相等,则过滤掉PROBLEMCODE
行并保留描述
在我看来,伪代码是:
For a given ticketID:
if Type = PROBLEMCODE or DESCRIPTION
if TEXT = TEXT
DROP PROBLEMCODE
显然,以这种方式循环数据帧是没有效率的。熊猫有很多方法可以做到这一点,在前面发布的问题中有提到。我只是很难弄清楚哪种方法和作业的组合能够完成这一点。我试过:
#创建dup行
数据['Dup']=数据。重复(子集=['TEXT'])
#那么团体票呢?
data.groupby(['TICKETID']))
#以某种方式比较真与假,但我只能按索引顺序(沿帧向下)进行比较。
#我百分之九十九肯定在看其他问题时,应该有一两行
#这样做可以实现:
dataTest=data.loc[data.groupby(['TICKETID'])和(data['TYPE']=='PROBLEMCODE'|'DESCRIPTION')。重复(子集=['TEXT']))
#然后根据真假进行过滤
示例案例的预期输出仅删除TICKET=123 PROBLEMCODE行,如下所示:
TICKETID TYPE TEXT
123 PROBLEMCODE I want to use description for this item because it is a duplicate
123 DESCRIPTION I want to use description for this item because it is a duplicate
123 CODE1 Other field
124 PROBLEMCODE I need both here
124 DESCRIPTION Because there are not duplicated
124 CODE1 Other field
125 PROBLEMCODE I need both here
125 DESCRIPTION I do not want to delete the above problem code because TICKETID is different
125 CODE1 This field is not super important but matches data and never know where problems arise
TICKETID TYPE TEXT
123 DESCRIPTION I want to use description for this item because it is a duplicate
123 CODE1 Other field
124 PROBLEMCODE I need both here
124 DESCRIPTION Because there are not duplicated
124 CODE1 Other field
125 PROBLEMCODE I need both here
125 DESCRIPTION I do not want to delete the above problem code because TICKETID is different
125 CODE1 This field is not super important but matches data and never know where problems arise
如果你需要更多信息,请告诉我
df = pd.DataFrame(
{
'ticket':[123,123,123,124,124,124],
'type':['PROBLEMCODE','DESCRIPTION','code1','PROBLEMCODE','DESCRIPTION','code1'],
'text':[' I want to use description fo',' I want to use description fo','other',
'another str','second one','other'],
}
)
print(df)
ticket type text
0 123 PROBLEMCODE I want to use description fo
1 123 DESCRIPTION I want to use description fo
2 123 code1 other
3 124 PROBLEMCODE another str
4 124 DESCRIPTION second one
5 124 code1 other
# you can see here in this df(duplicates), all duplicated rows for type == DESCRIPTION or PROBLEMCODE
duplicates = df[
(df.type.isin(['DESCRIPTION','PROBLEMCODE'])) &
(df.duplicated(subset=['ticket','text'],keep=False))
]
print(duplicates)
ticket type text
0 123 PROBLEMCODE I want to use description fo
1 123 DESCRIPTION I want to use description fo
# remove duplicates from main df (using index to improve time)
df = df.drop(duplicates.index.tolist())
print(df)
# now concat duplicates with df (without description and problemcode
result = pd.concat([
duplicates[duplicates.type=='DESCRIPTION'],df
]).sort_values(by='ticket').reset_index(drop=True)
print(result)
ticket type text
0 123 DESCRIPTION I want to use description fo
1 123 code1 other
2 124 PROBLEMCODE another str
3 124 DESCRIPTION second one
4 124 code1 other
对于上述解决方案,当票证和文本相同时,
DESCRIPTION
和PROIBLEMCODE
将收到不重复的输出。您好,请向我们展示您预期输出的示例参见编辑。它只会删除123问题代码行查看我的答案,但您只想检查问题代码和描述之间的重复项?并留下描述?是的,如果给定的TICKETID有2个重复的文本字段,请保留描述。我将在主数据上测试答案