Python 当您的数据较大时,是否有有效的方法使用第二个表填写正确的不一致数据?
我有一个数据不一致的表,如下所示: 表1: 航班号 发动机号 飞机尾翼 年 月 000000_20180121 000000 G-RHBZ 2018 01 258741_20171021 258741 H-RZBE 2017 10 _20150214 V-order 2015 02 _20110287 编号 G-EHRK 2011 12Python 当您的数据较大时,是否有有效的方法使用第二个表填写正确的不一致数据?,python,pandas,data-science,data-analysis,fuzzywuzzy,Python,Pandas,Data Science,Data Analysis,Fuzzywuzzy,我有一个数据不一致的表,如下所示: 表1: 航班号 发动机号 飞机尾翼 年 月 000000_20180121 000000 G-RHBZ 2018 01 258741_20171021 258741 H-RZBE 2017 10 _20150214 V-order 2015 02 _20110287 编号 G-EHRK 2011 12 我想你可以用合并 航班号 发动机号坏了 飞机尾翼 年 月 发动机号良好 000000_20180121 000000 G-RHBZ 2018 01 589745
我想你可以用
合并
航班号
发动机号坏了
飞机尾翼
年
月
发动机号良好
000000_20180121
000000
G-RHBZ
2018
01
589745
_20150214
V-order
2015
02
348741
_20110287
编号
G-EHRK
2011
12
587981
是的,它可以工作,我将测试整个数据并检查执行时间性能。非常感谢。
import pandas as pd
df1 = pd.DataFrame(data=[
{"flight_id":"000000_20180121","engine_number":"000000",
"aircraft_tail":"G-RHBZ","year":"2018","month":"01"},
{"flight_id":"258741_20171021","engine_number":"258741",
"aircraft_tail":"H-RZBE","year":"2017","month":"10"},
{"flight_id":"_20150214","engine_number":"",
"aircraft_tail":"V-RDER","year":"2015","month":"02"},
{"flight_id":"_20110287","engine_number":"NO-NUMBER",
"aircraft_tail":"G-EHRK","year":"2011","month":"12"}]
)
df2 = pd.DataFrame(data=[
{"engine_number":"258741","aircraft_tail":"H-RZBE","year":"2017","month":"10"},
{"engine_number":"348741","aircraft_tail":"V-RDER","year":"2015","month":"02"},
{"engine_number":"348741","aircraft_tail":"V-RDER","year":"2015","month":"03"},
{"engine_number":"589745","aircraft_tail":"G-RHBZ","year":"2018","month":"01"},
{"engine_number":"587981","aircraft_tail":"G-EHRK","year":"2011","month":"12"}]
)
# Validator function
def bad_engine_number_detector(engine_number):
lst_invalid_engine_number = ["000000", "NO-NUMBER"]
is_bad_engine_number = False
if engine_number == "":
is_bad_engine_number = True
elif engine_number in lst_invalid_engine_number:
is_bad_engine_number = True
return is_bad_engine_number
# Identify invalid entries on df1
mask = df1["engine_number"].apply(bad_engine_number_detector)
# Merge both tables (df1 filtered only with bad entries)
df1.loc[mask].merge(df2,
on=["aircraft_tail","year","month"],
suffixes=["_bad","_good"])