Python 试图将zipcodes从一个数据帧拉入另一个地址数据帧

Python 试图将zipcodes从一个数据帧拉入另一个地址数据帧,python,pandas,dataframe,Python,Pandas,Dataframe,我有一个没有Zipcode的地址数据帧: df1 = pd.DataFrame({'address1':['1 o\'toole st','2 main st','3 high street','5 foo street','10 foo street'], 'address2':['town1',np.nan,np.nan,'Bartown',np.nan], 'address3':[np.nan,'village',

我有一个没有Zipcode的地址数据帧:

df1 = pd.DataFrame({'address1':['1 o\'toole st','2 main st','3 high street','5 foo street','10 foo street'],
                   'address2':['town1',np.nan,np.nan,'Bartown',np.nan],
                   'address3':[np.nan,'village','city','county2','county3']})
df1['zipcode']=''
df1

        address1 address2 address3 zipcode
0   1 o'toole st    town1      NaN        
1      2 main st      NaN  village        
2  3 high street      NaN     city        
3   5 foo street  Bartown  county2        
4  10 foo street      NaN  county3 
我还有第二个数据框,上面有地址和Zipcode。请注意,这与
df1
的顺序相同,但在我处理的实际数据中不是这样的:

df2 = pd.DataFrame({'address1':['1 o\'toole st','2 main st','7 mill street','5 foo street','10 foo street'],
                   'address2':['town1','village','city','Bartown','county3'],
                   'address3':[np.nan,np.nan,np.nan,'county2','USA'],
                   'zipcode': ['er45','qw23','rt67','yu89','yu83']})
df2

        address1 address2 address3 zipcode
0   1 o'toole st    town1      NaN    er45
1      2 main st  village      NaN    qw23
2  7 mill street     city      NaN    rt67
3   5 foo street  Bartown  county2    yu89
4  10 foo street  county3      USA    yu83
我想检查
df1
中的地址是否在
df2
中,如果在
df1
中,请将zipcodes拖到
df1>中

这就是我遇到麻烦的地方,我不确定这是否是最好的方法

到目前为止,我所做的是为两个数据帧创建一个主键,使用地址的前两行:
地址1
地址2
,去掉所有空格和非alpha,转换为小写:

df1['key'] = (df1['address1'] + df1['address2']).str.lower().str.replace(' ', '').str.replace('\W', '')


df2['key'] = (df2['address1'] + df2['address2']).str.lower().str.replace(' ', '').str.replace('\W', '')


print(df1)

        address1 address2 address3 zipcode                key
0   1 o'toole st    town1      NaN             1otoolesttown1
1      2 main st      NaN  village                        NaN
2  3 high street      NaN     city                        NaN
3   5 foo street  Bartown  county2          5foostreetbartown
4  10 foo street      NaN  county3                        NaN

print(df2)

        address1 address2 address3 zipcode                 key
0   1 o'toole st    town1      NaN    er45      1otoolesttown1
1      2 main st  village      NaN    qw23      2mainstvillage
2  7 mill street     city      NaN    rt67     7millstreetcity
3   5 foo street  Bartown  county2    yu89   5foostreetbartown
4  10 foo street  county3      USA    yu83  10foostreetcounty3
现在我将使用
np.where
将信息拖到df1中的空
zipcode
列,如果找不到匹配的地址,则返回
no_match

df1['zipcode'] = np.where(df1['key'].isin(df2['key']), df2['zipcode'], 'no_match')

print(df1)

        address1 address2 address3   zipcode                key
0   1 o'toole st    town1      NaN      er45     1otoolesttown1
1      2 main st      NaN  village  no_match                NaN
2  3 high street      NaN     city  no_match                NaN
3   5 foo street  Bartown  county2      yu89  5foostreetbartown
4  10 foo street      NaN  county3  no_match                NaN
我的问题是为df1创建的
。如您所见,其中一些是
NaN
。这是由于地址格式不同于
df2
。这就是我目前正在处理的数据集

我试图通过跳过任何
NaN
并添加下一行来绕过此问题,但得到一个ValueError:

# add address1 + address2 if it's not null, otherwise use address3

df1['key'] = (df1['address1'] + (df1['address2'] if pd.notnull(df1['address2']) else df1['address3']))

ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

任何关于如何解决这一问题的反馈或建议都将不胜感激。如果有更简单的方法,我很想知道

我将首先用空字符串替换NaN值,然后将3个地址列连接起来,以在一列中获得地址,有点像您所做的:

# filling NaN values
df1.fillna('', inplace=True)
df2.fillna('', inplace=True)

# concatenate the address columns
df1['address'] = df1['address1']+df1['address2']+df1['address3']
df2['address'] = df2['address1']+df2['address2']+df2['address3']
然后将新的“地址”列设置为两个数据帧中的索引:

df1.set_index('address', inplace=True)
df2.set_index('address', inplace=True)
最后将邮政编码添加到df1:

df1['zipcode'] = df2['zipcode']
结果如下:

                            address1       address2        address         zipcode
address                 
1 o'toole sttown1           1 o'toole st    town1                           er45
2 main stvillage            2 main st                       village         qw23
3 high streetcity           3 high street                   city            NaN
5 foo streetBartowncounty2  5 foo street    Bartown         county2         yu89
10 foo streetcounty3        10 foo street                   county3         yu89

您的问题是这一行:

df1['key'] = (df1['address1'] + (df1['address2'] if pd.notnull(df1['address2']) else df1['address3']))
此处使用的
if
会导致错误,因为
pd.notnull
生成布尔序列,但
if
运算符需要一个布尔值。
您可以使用以下方法解决此问题:

这将生成一个带有您要查找的密钥的
df1

        address1 address2 address3                 key
0   1 o'toole st    town1      NaN      1otoolesttown1
1      2 main st      NaN  village      2mainstvillage
2  3 high street      NaN     city     3highstreetcity
3   5 foo street  Bartown  county2   5foostreetbartown
4  10 foo street      NaN  county3  10foostreetcounty3
现在,您可以合并zipcodes。

用于将缺少的值替换为
df1['address3']

df1['key'] = df1['address1'] + df1['address2'].fillna(df1['address3'])
相反:

df1['key'] = (df1['address1'] + (df1['address2'] if 
                                   pd.notnull(df1['address2']) else df1['address3']))

有关错误的详细信息,请参阅。

问题是,我只需要前两个地址条目。您正在添加
df1['address1']+df1['address2']+df1['address3']
,这对于带有
NaN
的记录很好,但是对于具有完整地址的记录,我最终得到的行数比我原来想要的多。我已更新了我的问题,并在
df2
中的记录中添加了“USA”,这将显示您的方法无法100%工作。如果你重新运行你的代码,我的新的
df2
将在你的最终结果中出现一个nan。顺便说一句,感谢你花时间阅读我的大量问题。好的,我看到你的问题了。我认为耶斯雷尔的anwser比我的好,应该适用于你的情况。你睡过觉吗?谢谢:)第二部分的作用是什么?这里的
y
。其中(条件,y)
有点像
else
。从文档中:如果条件为真,则保留原始值。如果为False,则替换为其他的相应值。另一个是您的
y
df1['key'] = (df1['address1'] + (df1['address2'] if 
                                   pd.notnull(df1['address2']) else df1['address3']))