Python 使用geopy pandas创建具有坐标的新列
我有一个df:Python 使用geopy pandas创建具有坐标的新列,python,pandas,geopy,Python,Pandas,Geopy,我有一个df: import pandas as pd import numpy as np import datetime as DT import hmac from geopy.geocoders import Nominatim from geopy.distance import vincenty df city_name state_name county_name 0 WASHINGTON DC DIST OF COLUMBIA 1 WASHIN
import pandas as pd
import numpy as np
import datetime as DT
import hmac
from geopy.geocoders import Nominatim
from geopy.distance import vincenty
df
city_name state_name county_name
0 WASHINGTON DC DIST OF COLUMBIA
1 WASHINGTON DC DIST OF COLUMBIA
2 WASHINGTON DC DIST OF COLUMBIA
3 WASHINGTON DC DIST OF COLUMBIA
4 WASHINGTON DC DIST OF COLUMBIA
5 WASHINGTON DC DIST OF COLUMBIA
6 WASHINGTON DC DIST OF COLUMBIA
7 WASHINGTON DC DIST OF COLUMBIA
8 WASHINGTON DC DIST OF COLUMBIA
9 WASHINGTON DC DIST OF COLUMBIA
我想得到下面数据框中任意一列的纬度和经度坐标。在处理各个位置的文档时,文档()非常简单
>>> from geopy.geocoders import Nominatim
>>> geolocator = Nominatim()
>>> location = geolocator.geocode("175 5th Avenue NYC")
>>> print(location.address)
Flatiron Building, 175, 5th Avenue, Flatiron, New York, NYC, New York, ...
>>> print((location.latitude, location.longitude))
(40.7410861, -73.9896297241625)
>>> print(location.raw)
{'place_id': '9167009604', 'type': 'attraction', ...}
但是,我想将该函数应用于df中的每一行,并创建一个新列。我试过以下方法
df['city_coord'] = geolocator.geocode(lambda row: 'state_name' (row))
但我认为我的代码中缺少了一些东西,因为我得到了以下信息:
city_name state_name county_name coordinates
0 WASHINGTON DC DIST OF COLUMBIA None
1 WASHINGTON DC DIST OF COLUMBIA None
2 WASHINGTON DC DIST OF COLUMBIA None
3 WASHINGTON DC DIST OF COLUMBIA None
4 WASHINGTON DC DIST OF COLUMBIA None
5 WASHINGTON DC DIST OF COLUMBIA None
6 WASHINGTON DC DIST OF COLUMBIA None
7 WASHINGTON DC DIST OF COLUMBIA None
8 WASHINGTON DC DIST OF COLUMBIA None
9 WASHINGTON DC DIST OF COLUMBIA None
In [9]:
geolocator = Nominatim()
df['city_coord'] = df['state_name'].apply(geolocator.geocode)
df
Out[9]:
city_name state_name county_name \
0 WASHINGTON DC DIST OF COLUMBIA
1 WASHINGTON DC DIST OF COLUMBIA
city_coord
0 (District of Columbia, United States of Americ...
1 (District of Columbia, United States of Americ...
我希望使用Lambda函数实现类似的功能:
city_name state_name county_name city_coord
0 WASHINGTON DC DIST OF COLUMBIA 38.8949549, -77.0366456
1 WASHINGTON DC DIST OF COLUMBIA 38.8949549, -77.0366456
2 WASHINGTON DC DIST OF COLUMBIA 38.8949549, -77.0366456
3 WASHINGTON DC DIST OF COLUMBIA 38.8949549, -77.0366456
4 WASHINGTON DC DIST OF COLUMBIA 38.8949549, -77.0366456
5 WASHINGTON DC DIST OF COLUMBIA 38.8949549, -77.0366456
6 WASHINGTON DC DIST OF COLUMBIA 38.8949549, -77.0366456
7 WASHINGTON DC DIST OF COLUMBIA 38.8949549, -77.0366456
8 WASHINGTON DC DIST OF COLUMBIA 38.8949549, -77.0366456
9 WASHINGTON DC DIST OF COLUMBIA 38.8949549, -77.0366456
10 GLYNCO GA GLYNN 31.2224512, -81.5101023
谢谢你的帮助。在我得到坐标后,我想绘制它们的地图。我们也非常感谢您推荐的坐标映射资源。谢谢您可以调用
apply
并将要在每一行上执行的函数传递,如下所示:
city_name state_name county_name coordinates
0 WASHINGTON DC DIST OF COLUMBIA None
1 WASHINGTON DC DIST OF COLUMBIA None
2 WASHINGTON DC DIST OF COLUMBIA None
3 WASHINGTON DC DIST OF COLUMBIA None
4 WASHINGTON DC DIST OF COLUMBIA None
5 WASHINGTON DC DIST OF COLUMBIA None
6 WASHINGTON DC DIST OF COLUMBIA None
7 WASHINGTON DC DIST OF COLUMBIA None
8 WASHINGTON DC DIST OF COLUMBIA None
9 WASHINGTON DC DIST OF COLUMBIA None
In [9]:
geolocator = Nominatim()
df['city_coord'] = df['state_name'].apply(geolocator.geocode)
df
Out[9]:
city_name state_name county_name \
0 WASHINGTON DC DIST OF COLUMBIA
1 WASHINGTON DC DIST OF COLUMBIA
city_coord
0 (District of Columbia, United States of Americ...
1 (District of Columbia, United States of Americ...
然后可以访问纬度和经度属性:
In [16]:
df['city_coord'] = df['city_coord'].apply(lambda x: (x.latitude, x.longitude))
df
Out[16]:
city_name state_name county_name city_coord
0 WASHINGTON DC DIST OF COLUMBIA (38.8937154, -76.9877934586326)
1 WASHINGTON DC DIST OF COLUMBIA (38.8937154, -76.9877934586326)
或者在一行程序中调用两次apply
:
In [17]:
df['city_coord'] = df['state_name'].apply(geolocator.geocode).apply(lambda x: (x.latitude, x.longitude))
df
Out[17]:
city_name state_name county_name city_coord
0 WASHINGTON DC DIST OF COLUMBIA (38.8937154, -76.9877934586326)
1 WASHINGTON DC DIST OF COLUMBIA (38.8937154, -76.9877934586326)
此外,您的尝试geologitor.geocode(lambda行:'state_name'(行))
也没有做任何操作,因此您的列中充满了None
值
编辑
@leb在这里提出了一个有趣的观点,如果您有许多重复的值,那么对每个唯一的值进行地理编码会更有效,然后添加以下内容:
In [38]:
states = df['state_name'].unique()
d = dict(zip(states, pd.Series(states).apply(geolocator.geocode).apply(lambda x: (x.latitude, x.longitude))))
d
Out[38]:
{'DC': (38.8937154, -76.9877934586326)}
In [40]:
df['city_coord'] = df['state_name'].map(d)
df
Out[40]:
city_name state_name county_name city_coord
0 WASHINGTON DC DIST OF COLUMBIA (38.8937154, -76.9877934586326)
1 WASHINGTON DC DIST OF COLUMBIA (38.8937154, -76.9877934586326)
因此,上面使用
unique
获取所有唯一值,从中构造一个dict,然后调用map
执行查找并添加坐标,这将比尝试按行地理编码向上投票并接受@EdChum的答案更有效,我只是想补充一下。他的方法非常有效,但根据个人经验,我想与大家分享以下几点:
在处理地理编码时,如果您有多个重复的城市/州组合,则只发送1个以获得地理编码,然后将其余的复制到下面的其他行要快得多:
这对大数据非常有用,可以通过两种方式完成:
drop\u duplicate
group\u by
城市/州组合,请通过调用head(1)
对第一行应用地理编码,然后复制到其余行再说一次,这一切都是从个人角度来处理的。如果现在对您没有好处,请记住以后使用。这是一个有趣的问题,最好只获取唯一的值,对其进行地理编码并将其合并回来,我将更新我的回答谢谢您的回答。非常有用的信息!虽然当我查看[:5]行数据时,我收到了一个良好的数据帧。当我将函数应用于所有(200000条记录)时,我收到一个超时错误。我必须先分组,然后再申请。非常感谢。我仍然收到这个错误:GeocoderTimedOut:服务超时。这是我正在做的事情吗?你是在我的原始代码或优化版本中遇到这个错误吗?如果地理编码超时,您可能必须分块处理数据