Python 如何在PySpark的Megallan中实现多边形中的点?
我在纽约有一个3000万经纬度的数据框。我希望将每个坐标映射到一个普查区域,最好将这些普查区域作为数据框中的另一列 目前,我正在使用Shapely和Pypark来实现这一点。我正在使用PySpark的映射功能,这需要很长时间。将每个坐标映射到普查区域大约需要0.2秒。我想看看我是否能更快地做到这一点,比如说使用MegellanPython 如何在PySpark的Megallan中实现多边形中的点?,python,apache-spark,geospatial,point-in-polygon,Python,Apache Spark,Geospatial,Point In Polygon,我在纽约有一个3000万经纬度的数据框。我希望将每个坐标映射到一个普查区域,最好将这些普查区域作为数据框中的另一列 目前,我正在使用Shapely和Pypark来实现这一点。我正在使用PySpark的映射功能,这需要很长时间。将每个坐标映射到普查区域大约需要0.2秒。我想看看我是否能更快地做到这一点,比如说使用Megellan import shapefile Census_Tracts_Shapefile_Path = 'Data/NYC Census Tracts/NYC/nyct2010w
import shapefile
Census_Tracts_Shapefile_Path = 'Data/NYC Census Tracts/NYC/nyct2010wi.shp'
from shapely.geometry import Polygon, Point
CensusTract_Shapes = CensusTract_Shapefile.shapes()
Polygons = [shape.points for shape in CensusTract_Shapes]
def Census_Tract_Finder(x,y,Polygons):
try:
x = float(x); y = float(y); OK = 1
except ValueError:
OK = 0
if OK == 1:
point = Point(float(x), float(y));
Tract = []
for counter in range(len(Polygons)):
Poly = Polygon(Polygons[counter])
if Poly.contains(point):
Tract.append(counter)
return Tract
else: return []
# In this section, I filter the census tracts
# to find the ones that are in Manhattan
Manhattan_CT = []
CT_Records = CensusTract_Shapefile.shapeRecords()
for counter in range(len(CT_Records)):
if int(CT_Records[counter].record[1]) == 1:
Manhattan_CT.append(counter)
CT_Records_Manhattan = [CT_Records[index] for index in Manhattan_CT]
Polygons_Manhattan = [Polygons[index] for index in Manhattan_CT]
# An example of how I look for the Census tract of each point:
# print Census_Tract_Finder('-73.986191','40.760681', Polygons_Manhattan)
Start = time(); N = 1000 # For testing purposes, I focus on the first N rows.
dd = df_Spark # This is the Spark dataframe that I have already loaded.
Output = dd.rdd .map(lambda x:Census_Tract_Finder(x['longitude'],x['latitude'],Polygons_Manhattan)) .take(N)
Duration = time()-Start
print 'Output Calculations:', Duration
对我来说,不清楚你想要实现什么。我建议在没有Spark的情况下用Python提供实现。