Python 基于名称拆分行_Python_Python 3.x_Pandas_Geolocation_Geopandas

Python 基于名称拆分行

python python-3.x pandas geolocation

Python 基于名称拆分行,python,python-3.x,pandas,geolocation,geopandas,Python,Python 3.x,Pandas,Geolocation,Geopandas,我有一个GeoPandas数据帧，它是从shapefile对象创建的。然而，certian line具有相同的名称，但位于非常不同的位置我希望每行都有一个唯一的名称因此，我需要以某种方式分割线，如果它们在几何上分开并重命名人们可以尝试计算所有街区之间的距离，如果它们在附近，可以将它们重新组合距离的计算可以在Geopandas中轻松完成：要尝试的一组行： from shapely.geometry import Point, LineString import geopandas as

我有一个GeoPandas数据帧，它是从shapefile对象创建的。然而，certian line具有相同的名称，但位于非常不同的位置

我希望每行都有一个唯一的名称因此，我需要以某种方式分割线，如果它们在几何上分开并重命名
人们可以尝试计算所有街区之间的距离，如果它们在附近，可以将它们重新组合
距离的计算可以在Geopandas中轻松完成：
要尝试的一组行：

from shapely.geometry import Point, LineString import geopandas as gpd line1 = LineString([ Point(0, 0), Point(0, 1), Point(1, 1), Point(1, 2), Point(3, 3), Point(5, 6), ]) line2 = LineString([ Point(5, 3), Point(5, 5), Point(9, 5), Point(10, 7), Point(11, 8), Point(12, 12), ]) line3 = LineString([ Point(9, 10), Point(10, 14), Point(11, 12), Point(12, 15), ]) df = gpd.GeoDataFrame( data={'name': ['A', 'A', 'A']}, geometry=[line1, line2, line3] )

一种可能的方法是使用每个数据点的空间聚类。下面的代码使用DBSCAN，但其他类型可能更适合。以下是它们如何工作的概述：
df的每一行都是若干点。我们想把它们全部取出来，以获得群集：

ids = [] coords = [] for row in df.itertuples(): geom = np.asarray(row.geometry) coords.extend(geom) ids.extend([row.id] * geom.shape[0])
我们在这里需要ID，以便在计算之后将集群返回到df。下面是每个点的聚类（我们还进行了数据标准化以获得更好的质量）：
下一部分有点混乱，但我们希望确保每个id只获得一个群集。我们为每个id选择最频繁的点群集

points_clusters = pd.DataFrame({"id":ids, "cluster":clusters}) points_clusters["count"] = points_clusters.groupby(["id", "cluster"])["id"].transform('size') max_inds = points_clusters.groupby(["id", "cluster"])['count'].transform(max) == points_clusters['count'] id_to_cluster = points_clusters[max_inds].drop_duplicates(subset ="id").set_index("id")["cluster"]
然后，我们将集群编号返回到数据帧，这样我们就可以借助这个编号来枚举我们的街道

df["cluster"] = df["id"].map(id_to_cluster)
对于DBSCAN和eps=0.5的数据（您可以使用此参数-这是在一个簇中获得点之间的最大距离。eps越多，得到的簇越少），我们有这样的图片：

plt.scatter(np.array(coords)[:, 0], np.array(coords)[:, 1], c=clusters, cmap="autumn") plt.show()

独立街道的数量为8条：

print(len(df["cluster"].drop_duplicates()))
如果我们使用较低的eps，例如clust=DBSCAN（eps=0.15），我们会得到更多的聚类（此时为12个），这会更好地分离数据：

关于代码的混乱部分：在源数据框中，我们有170行，每行是一个单独的LINESTRING对象。每个线串由二维点组成，线串之间的点数量不同。因此，首先我们得到所有点（“代码中的coords”列表），并预测每个点的聚类。我们很可能会在一条线串的点上显示不同的簇。为了解决这种情况，我们获得每个簇的计数，然后过滤最大值。
dbscan对sklearn在坐标上的聚类是一个选项。数据使用示例：还请共享所需的所有文件。仅shp文件不足以加载数据。详细信息：是的，加载数据后一切正常。我会检查聚类方法。我在scikit学习库中添加了一些基本方法。您可以使用这句话：“clust=DBSCAN（eps=0.5）”（更改eps，甚至从这里使用不同的聚类算法）来获得所需的结果。@james，df[“STREET”]=df[“name”]+““+df[“cluster”]”。astype（str）来获得街道名称。“我会很快把解释写在答案里。”詹姆斯，我更新了帖子。另外，eps=1.5更好地分离数据。@james请将subset=“id”添加到此行的drop\u duplicates中：id\u to\u cluster=points\u clusters[max\u inds]。drop\u duplicates（）。设置索引（“id”）[“cluster”]。当两个集群的最大数量相同时，可能会出现这种情况
plt.scatter(np.array(coords)[:, 0], np.array(coords)[:, 1], c=clusters, cmap="autumn") plt.show()

print(len(df["cluster"].drop_duplicates()))