Python 如何优化Shapely和Sklearn代码?

Python 如何优化Shapely和Sklearn代码?,python,distance,geopandas,nearest-neighbor,haversine,Python,Distance,Geopandas,Nearest Neighbor,Haversine,我正在处理一个420万点的数据集,我的代码已经需要一段时间来处理,但是下面的代码需要几个小时来处理(该代码在其他公开问题中提供,基本上它将最近的线字符串带到一个点,从该线字符串找到最近的点并计算距离) 这些代码实际上做得很好,但是对于它的目的来说花费的时间太长了,我怎么能在最短的时间内优化或者做同样的事情呢 import geopandas as gpd import numpy as np from shapely.geometry import Point, LineString from

我正在处理一个420万点的数据集,我的代码已经需要一段时间来处理,但是下面的代码需要几个小时来处理(该代码在其他公开问题中提供,基本上它将最近的线字符串带到一个点,从该线字符串找到最近的点并计算距离)

这些代码实际上做得很好,但是对于它的目的来说花费的时间太长了,我怎么能在最短的时间内优化或者做同样的事情呢

import geopandas as gpd
import numpy as np

from shapely.geometry import Point, LineString
from shapely.ops import nearest_points
from sklearn.neighbors import DistanceMetric

EARTH_RADIUS_IN_MILES = 3440.1 #NAUTICAL MILES

panama = gpd.read_file("/Users/Danilo/Documents/Python/panama_coastline/panama_coastline.shp")

for c in range(b):
    #p = Point(-77.65325423107359,9.222038196656131)
    p=Point(data['longitude'][c],data['latitude'][c])

    def closest_line(point, linestrings):
        return np.argmin( [p.distance(linestring) for linestring in  panama.geometry] )
    
    closest_linestring = panama.geometry[ closest_line(p, panama.geometry) ]
    closest_linestring
    closest_point = nearest_points(p, closest_linestring)
           
    dist = DistanceMetric.get_metric('haversine')
    points_as_floats = [ np.array([p.x, p.y]) for p in closest_point ]
        
    haversine_distances = dist.pairwise(np.radians(points_as_floats), np.radians(points_as_floats) )
    haversine_distances *= EARTH_RADIUS_IN_MILES

    dtc1=haversine_distances[0][1]
    dtc.append(dtc1)

编辑:使用BallTree简化为单个计算

进口

import pandas as pd
import geopandas as gpd
import numpy as np

from shapely.geometry import Point, LineString
from shapely.ops import nearest_points
读巴拿马

panama = gpd.read_file("panama_coastline/panama_coastline.shp")

获取所有点,长,纬度格式:

def get_points_as_numpy(geom):
    work_list = []
    for g in geom:
        work_list.append( np.array(g.coords) )
        
    return np.concatenate(work_list)
        
all_coastline_points = get_points_as_numpy(panama.geometry)
创建Balltree

from sklearn.neighbors import BallTree
import numpy as np

panama_radians =  np.radians(np.flip(all_coastline_points,axis=1))

tree = BallTree(panama_radians, leaf_size=12, metric='haversine')
创建1M随机点:

mean = [8.5,-80]
cov = [[1,0],[0,5]] # diagonal covariance, points lie on x or y-axis


random_gps = np.random.multivariate_normal(mean,cov,(10**6))
random_points = pd.DataFrame( {'lat' : random_gps[:,0], 'long' : random_gps[:,1]})
random_points.head()

计算最近的海岸点(编辑:使用BallTree简化为单个计算

进口

import pandas as pd
import geopandas as gpd
import numpy as np

from shapely.geometry import Point, LineString
from shapely.ops import nearest_points
读巴拿马

panama = gpd.read_file("panama_coastline/panama_coastline.shp")

获取所有点,长,纬度格式:

def get_points_as_numpy(geom):
    work_list = []
    for g in geom:
        work_list.append( np.array(g.coords) )
        
    return np.concatenate(work_list)
        
all_coastline_points = get_points_as_numpy(panama.geometry)
创建Balltree

from sklearn.neighbors import BallTree
import numpy as np

panama_radians =  np.radians(np.flip(all_coastline_points,axis=1))

tree = BallTree(panama_radians, leaf_size=12, metric='haversine')
创建1M随机点:

mean = [8.5,-80]
cov = [[1,0],[0,5]] # diagonal covariance, points lie on x or y-axis


random_gps = np.random.multivariate_normal(mean,cov,(10**6))
random_points = pd.DataFrame( {'lat' : random_gps[:,0], 'long' : random_gps[:,1]})
random_points.head()

计算最近的滑行点(您好,欢迎:)您是否尝试过自己对其进行优化,但在这样做时遇到了问题?您是否测量了此方法的性能?您对此方法是否有特定的时间/性能要求?您从何处获得此代码?请勿放置
def nearest_line()
在循环内。
dist=DistanceMetric.get_metric('haversine')
没有变化,请将其置于循环外。您可以放置数据样本吗?在处理要使用树计算的大型点时。您好,欢迎:)您是否尝试过自己优化它,但在这样做时遇到问题?你测量过这种方法的性能吗?您对这种方法有具体的时间/性能要求吗?您从何处获得此代码?不要将
def nearest_line()
放在循环中
dist=DistanceMetric。get_metric('haversine')
没有变化,请将其置于循环之外。您可以放置数据样本吗?处理要使用树计算的大型点时。