Python 试图将zipcodes从一个数据帧拉入另一个地址数据帧
我有一个没有Zipcode的地址数据帧:Python 试图将zipcodes从一个数据帧拉入另一个地址数据帧,python,pandas,dataframe,Python,Pandas,Dataframe,我有一个没有Zipcode的地址数据帧: df1 = pd.DataFrame({'address1':['1 o\'toole st','2 main st','3 high street','5 foo street','10 foo street'], 'address2':['town1',np.nan,np.nan,'Bartown',np.nan], 'address3':[np.nan,'village',
df1 = pd.DataFrame({'address1':['1 o\'toole st','2 main st','3 high street','5 foo street','10 foo street'],
'address2':['town1',np.nan,np.nan,'Bartown',np.nan],
'address3':[np.nan,'village','city','county2','county3']})
df1['zipcode']=''
df1
address1 address2 address3 zipcode
0 1 o'toole st town1 NaN
1 2 main st NaN village
2 3 high street NaN city
3 5 foo street Bartown county2
4 10 foo street NaN county3
我还有第二个数据框,上面有地址和Zipcode。请注意,这与df1
的顺序相同,但在我处理的实际数据中不是这样的:
df2 = pd.DataFrame({'address1':['1 o\'toole st','2 main st','7 mill street','5 foo street','10 foo street'],
'address2':['town1','village','city','Bartown','county3'],
'address3':[np.nan,np.nan,np.nan,'county2','USA'],
'zipcode': ['er45','qw23','rt67','yu89','yu83']})
df2
address1 address2 address3 zipcode
0 1 o'toole st town1 NaN er45
1 2 main st village NaN qw23
2 7 mill street city NaN rt67
3 5 foo street Bartown county2 yu89
4 10 foo street county3 USA yu83
我想检查df1
中的地址是否在df2
中,如果在df1
中,请将zipcodes拖到df1>中
这就是我遇到麻烦的地方,我不确定这是否是最好的方法
到目前为止,我所做的是为两个数据帧创建一个主键,使用地址的前两行:地址1
和地址2
,去掉所有空格和非alpha,转换为小写:
df1['key'] = (df1['address1'] + df1['address2']).str.lower().str.replace(' ', '').str.replace('\W', '')
df2['key'] = (df2['address1'] + df2['address2']).str.lower().str.replace(' ', '').str.replace('\W', '')
print(df1)
address1 address2 address3 zipcode key
0 1 o'toole st town1 NaN 1otoolesttown1
1 2 main st NaN village NaN
2 3 high street NaN city NaN
3 5 foo street Bartown county2 5foostreetbartown
4 10 foo street NaN county3 NaN
print(df2)
address1 address2 address3 zipcode key
0 1 o'toole st town1 NaN er45 1otoolesttown1
1 2 main st village NaN qw23 2mainstvillage
2 7 mill street city NaN rt67 7millstreetcity
3 5 foo street Bartown county2 yu89 5foostreetbartown
4 10 foo street county3 USA yu83 10foostreetcounty3
现在我将使用np.where
将信息拖到df1中的空zipcode
列,如果找不到匹配的地址,则返回no_match
:
df1['zipcode'] = np.where(df1['key'].isin(df2['key']), df2['zipcode'], 'no_match')
print(df1)
address1 address2 address3 zipcode key
0 1 o'toole st town1 NaN er45 1otoolesttown1
1 2 main st NaN village no_match NaN
2 3 high street NaN city no_match NaN
3 5 foo street Bartown county2 yu89 5foostreetbartown
4 10 foo street NaN county3 no_match NaN
我的问题是为df1创建的键
。如您所见,其中一些是NaN
。这是由于地址格式不同于df2
。这就是我目前正在处理的数据集
我试图通过跳过任何NaN
并添加下一行来绕过此问题,但得到一个ValueError:
# add address1 + address2 if it's not null, otherwise use address3
df1['key'] = (df1['address1'] + (df1['address2'] if pd.notnull(df1['address2']) else df1['address3']))
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
任何关于如何解决这一问题的反馈或建议都将不胜感激。如果有更简单的方法,我很想知道 我将首先用空字符串替换NaN值,然后将3个地址列连接起来,以在一列中获得地址,有点像您所做的:
# filling NaN values
df1.fillna('', inplace=True)
df2.fillna('', inplace=True)
# concatenate the address columns
df1['address'] = df1['address1']+df1['address2']+df1['address3']
df2['address'] = df2['address1']+df2['address2']+df2['address3']
然后将新的“地址”列设置为两个数据帧中的索引:
df1.set_index('address', inplace=True)
df2.set_index('address', inplace=True)
最后将邮政编码添加到df1:
df1['zipcode'] = df2['zipcode']
结果如下:
address1 address2 address zipcode
address
1 o'toole sttown1 1 o'toole st town1 er45
2 main stvillage 2 main st village qw23
3 high streetcity 3 high street city NaN
5 foo streetBartowncounty2 5 foo street Bartown county2 yu89
10 foo streetcounty3 10 foo street county3 yu89
您的问题是这一行:
df1['key'] = (df1['address1'] + (df1['address2'] if pd.notnull(df1['address2']) else df1['address3']))
此处使用的if
会导致错误,因为pd.notnull
生成布尔序列,但if
运算符需要一个布尔值。
您可以使用以下方法解决此问题:
这将生成一个带有您要查找的密钥的df1
:
address1 address2 address3 key
0 1 o'toole st town1 NaN 1otoolesttown1
1 2 main st NaN village 2mainstvillage
2 3 high street NaN city 3highstreetcity
3 5 foo street Bartown county2 5foostreetbartown
4 10 foo street NaN county3 10foostreetcounty3
现在,您可以合并zipcodes。用于将缺少的值替换为df1['address3']
:
df1['key'] = df1['address1'] + df1['address2'].fillna(df1['address3'])
相反:
df1['key'] = (df1['address1'] + (df1['address2'] if
pd.notnull(df1['address2']) else df1['address3']))
有关错误的详细信息,请参阅。问题是,我只需要前两个地址条目。您正在添加df1['address1']+df1['address2']+df1['address3']
,这对于带有NaN
的记录很好,但是对于具有完整地址的记录,我最终得到的行数比我原来想要的多。我已更新了我的问题,并在df2
中的记录中添加了“USA”,这将显示您的方法无法100%工作。如果你重新运行你的代码,我的新的df2
将在你的最终结果中出现一个nan。顺便说一句,感谢你花时间阅读我的大量问题。好的,我看到你的问题了。我认为耶斯雷尔的anwser比我的好,应该适用于你的情况。你睡过觉吗?谢谢:)第二部分的作用是什么?这里的y
是。其中(条件,y)
有点像else
。从文档中:如果条件为真,则保留原始值。如果为False,则替换为其他的相应值。另一个是您的y
df1['key'] = (df1['address1'] + (df1['address2'] if
pd.notnull(df1['address2']) else df1['address3']))