Python 如何在PySpark的Megallan中实现多边形中的点？_Python_Apache Spark_Geospatial_Point In Polygon

Python 如何在PySpark的Megallan中实现多边形中的点？

python apache-spark

Python 如何在PySpark的Megallan中实现多边形中的点？,python,apache-spark,geospatial,point-in-polygon,Python,Apache Spark,Geospatial,Point In Polygon,我在纽约有一个3000万经纬度的数据框。我希望将每个坐标映射到一个普查区域，最好将这些普查区域作为数据框中的另一列目前，我正在使用Shapely和Pypark来实现这一点。我正在使用PySpark的映射功能，这需要很长时间。将每个坐标映射到普查区域大约需要0.2秒。我想看看我是否能更快地做到这一点，比如说使用Megellan import shapefile Census_Tracts_Shapefile_Path = 'Data/NYC Census Tracts/NYC/nyct2010w

我在纽约有一个3000万经纬度的数据框。我希望将每个坐标映射到一个普查区域，最好将这些普查区域作为数据框中的另一列

目前，我正在使用Shapely和Pypark来实现这一点。我正在使用PySpark的映射功能，这需要很长时间。将每个坐标映射到普查区域大约需要0.2秒。我想看看我是否能更快地做到这一点，比如说使用Megellan

import shapefile
Census_Tracts_Shapefile_Path = 'Data/NYC Census Tracts/NYC/nyct2010wi.shp'
from shapely.geometry import Polygon, Point
CensusTract_Shapes = CensusTract_Shapefile.shapes() 
Polygons = [shape.points for shape in CensusTract_Shapes]

def Census_Tract_Finder(x,y,Polygons):
    try:
        x = float(x); y = float(y); OK = 1
    except ValueError:
        OK = 0
    if OK == 1:
        point = Point(float(x), float(y));
        Tract = []
        for counter in range(len(Polygons)):
            Poly = Polygon(Polygons[counter])
            if Poly.contains(point):
                Tract.append(counter)
        return Tract
    else: return []

# In this section, I filter the census tracts 
# to find the ones that are in Manhattan
Manhattan_CT = []
CT_Records = CensusTract_Shapefile.shapeRecords()
for counter in range(len(CT_Records)):
    if int(CT_Records[counter].record[1]) == 1:
        Manhattan_CT.append(counter)

CT_Records_Manhattan = [CT_Records[index] for index in Manhattan_CT]
Polygons_Manhattan = [Polygons[index] for index in Manhattan_CT]

# An example of how I look for the Census tract of each point:
# print Census_Tract_Finder('-73.986191','40.760681', Polygons_Manhattan)

Start = time(); N = 1000 # For testing purposes, I focus on the first N rows.
dd = df_Spark # This is the Spark dataframe that I have already loaded.
Output = dd.rdd .map(lambda x:Census_Tract_Finder(x['longitude'],x['latitude'],Polygons_Manhattan)) .take(N)
Duration = time()-Start
print 'Output Calculations:', Duration

对我来说，不清楚你想要实现什么。我建议在没有Spark的情况下用Python提供实现。