Python 熊猫中的数据清理:如果这些字符串包含在另一列中,则用特定字符串替换空值
我目前正在开发一个汽车排放数据集,我想在其中清理/标准化车型名称。数据集相当大,但以下是前10行:Python 熊猫中的数据清理:如果这些字符串包含在另一列中,则用特定字符串替换空值,python,pandas,data-cleaning,Python,Pandas,Data Cleaning,我目前正在开发一个汽车排放数据集,我想在其中清理/标准化车型名称。数据集相当大,但以下是前10行: cars_em_df = pd.DataFrame({'manufacturer_name_mapped': ['FIAT', 'FIAT','FIAT','FIAT','FIAT','BMW AG','BMW AG','BMW AG','BMW AG','BMW AG'], 'commercial_name':['124 gt multiair auto', '500l wagon pop st
cars_em_df = pd.DataFrame({'manufacturer_name_mapped': ['FIAT', 'FIAT','FIAT','FIAT','FIAT','BMW AG','BMW AG','BMW AG','BMW AG','BMW AG'],
'commercial_name':['124 gt multiair auto', '500l wagon pop star t-jet',
'doblo combi 1.4 95', 'panda 0.9t sge 85 natural power', 'punto 1.4 77 lpg', 'x4 xdrive20d se auto', '216d active tourer b37 f45','220d gran tourer b47 f46','x1 xdrive18d sport','320i xdrive m sport gt auto'],
'fuel_type_mapped':['Petrol', 'Petrol', 'Petrol', 'NG-Biomethane', 'LPG','Diesel','Diesel','Diesel','Diesel','Petrol'],
'file_year':[2018, 2018, 2018, 2018, 2018,2018, 2018, 2018, 2018, 2018], 'emissions': [153,158,165,86,114,131,166,200,151,149], 'commercial_name_cleaned':['124','500',None,'panda','punto','x4',None,None,'x1',None]})
右边的'commercial\u name\u cleaned'一栏是我第一次清理的结果,我将'commercial\u name'一栏中的名称与来自不同来源的标准化名称列表进行了匹配。正如你所看到的,这些都是非常简单和简短的名字。每当我无法匹配模型名时,我的函数返回“None”
作为第二步,我现在要执行以下操作:如果是“无”,则在相邻的'commercial_name'列中搜索特定字符串,并将其替换为我指定的型号名称。我试过这个:
def str_ops(commercial_name_cleaned,commercial_name):
if commercial_name_cleaned == None:
if '216' in commercial_name:
return '2-series'
elif '220' in commercial_name:
return '2-series'
elif '320' in commercial_name:
return '3-series'
然后我会将此函数应用于数据帧:
cars_em_df['commercial_name_cleaned'] = cars_em_df.apply(lambda x: str_ops(str(x.commercial_name_cleaned), str(x.commercial_name)), axis=1)
需要注意的是,如果在'commercial\u name'中找不到'320'或'220'等,则函数不应更改任何内容,只返回'commercial\u name\u>中已有的值。但是,当我应用该函数时,整个'commercial\u name\u cleaned'列就变成了“None”值。所以这个函数肯定有问题。有人知道如何解决这个问题吗
非常感谢您的帮助,谢谢 您在
commercial\u name\u
列中得到None
值,因为您没有从函数str\u ops
返回任何内容,当您没有显式隐式返回任何内容时,将返回None
类型。
替换:
def str_ops(commercial_name_cleaned,commercial_name):
if commercial_name_cleaned == None:
if '216' in commercial_name:
return '2-series'
elif '220' in commercial_name:
return '2-series'
elif '320' in commercial_name:
return '3-series'
与:
def str_ops(commercial_name_cleaned,commercial_name):
if commercial_name_cleaned == 'None':
if '216' in commercial_name:
return '2-series'
elif '220' in commercial_name:
return '2-series'
elif '320' in commercial_name:
return '3-series'
else:
return commercial_name_cleaned
manufacturer_name_mapped commercial_name ... emissions commercial_name_cleaned
0 FIAT 124 gt multiair auto ... 153 124
1 FIAT 500l wagon pop star t-jet ... 158 500
2 FIAT doblo combi 1.4 95 ... 165 None
3 FIAT panda 0.9t sge 85 natural power ... 86 panda
4 FIAT punto 1.4 77 lpg ... 114 punto
5 BMW AG x4 xdrive20d se auto ... 131 x4
6 BMW AG 216d active tourer b37 f45 ... 166 2-series
7 BMW AG 220d gran tourer b47 f46 ... 200 2-series
8 BMW AG x1 xdrive18d sport ... 151 x1
9 BMW AG 320i xdrive m sport gt auto ... 149 3-series
输出:
def str_ops(commercial_name_cleaned,commercial_name):
if commercial_name_cleaned == 'None':
if '216' in commercial_name:
return '2-series'
elif '220' in commercial_name:
return '2-series'
elif '320' in commercial_name:
return '3-series'
else:
return commercial_name_cleaned
manufacturer_name_mapped commercial_name ... emissions commercial_name_cleaned
0 FIAT 124 gt multiair auto ... 153 124
1 FIAT 500l wagon pop star t-jet ... 158 500
2 FIAT doblo combi 1.4 95 ... 165 None
3 FIAT panda 0.9t sge 85 natural power ... 86 panda
4 FIAT punto 1.4 77 lpg ... 114 punto
5 BMW AG x4 xdrive20d se auto ... 131 x4
6 BMW AG 216d active tourer b37 f45 ... 166 2-series
7 BMW AG 220d gran tourer b47 f46 ... 200 2-series
8 BMW AG x1 xdrive18d sport ... 151 x1
9 BMW AG 320i xdrive m sport gt auto ... 149 3-series
您在
commercial\u name\u cleaned
列中获得None
值,因为您没有从函数str\u ops
返回任何内容,当您没有显式地隐式返回任何内容时,将返回无类型。
替换:
def str_ops(commercial_name_cleaned,commercial_name):
if commercial_name_cleaned == None:
if '216' in commercial_name:
return '2-series'
elif '220' in commercial_name:
return '2-series'
elif '320' in commercial_name:
return '3-series'
与:
def str_ops(commercial_name_cleaned,commercial_name):
if commercial_name_cleaned == 'None':
if '216' in commercial_name:
return '2-series'
elif '220' in commercial_name:
return '2-series'
elif '320' in commercial_name:
return '3-series'
else:
return commercial_name_cleaned
manufacturer_name_mapped commercial_name ... emissions commercial_name_cleaned
0 FIAT 124 gt multiair auto ... 153 124
1 FIAT 500l wagon pop star t-jet ... 158 500
2 FIAT doblo combi 1.4 95 ... 165 None
3 FIAT panda 0.9t sge 85 natural power ... 86 panda
4 FIAT punto 1.4 77 lpg ... 114 punto
5 BMW AG x4 xdrive20d se auto ... 131 x4
6 BMW AG 216d active tourer b37 f45 ... 166 2-series
7 BMW AG 220d gran tourer b47 f46 ... 200 2-series
8 BMW AG x1 xdrive18d sport ... 151 x1
9 BMW AG 320i xdrive m sport gt auto ... 149 3-series
输出:
def str_ops(commercial_name_cleaned,commercial_name):
if commercial_name_cleaned == 'None':
if '216' in commercial_name:
return '2-series'
elif '220' in commercial_name:
return '2-series'
elif '320' in commercial_name:
return '3-series'
else:
return commercial_name_cleaned
manufacturer_name_mapped commercial_name ... emissions commercial_name_cleaned
0 FIAT 124 gt multiair auto ... 153 124
1 FIAT 500l wagon pop star t-jet ... 158 500
2 FIAT doblo combi 1.4 95 ... 165 None
3 FIAT panda 0.9t sge 85 natural power ... 86 panda
4 FIAT punto 1.4 77 lpg ... 114 punto
5 BMW AG x4 xdrive20d se auto ... 131 x4
6 BMW AG 216d active tourer b37 f45 ... 166 2-series
7 BMW AG 220d gran tourer b47 f46 ... 200 2-series
8 BMW AG x1 xdrive18d sport ... 151 x1
9 BMW AG 320i xdrive m sport gt auto ... 149 3-series
您希望使用其中的多少个条件,它们的逻辑有多复杂。。。例如-总是简单的子串测试还是更复杂的测试?您希望使用多少这样的条件,它们的逻辑有多复杂。。。例如,它总是简单的子字符串测试还是更复杂的测试?