PySpark中的联接表条件:如果点位于多边形内
我有2个PySpark数据帧:一个带有点PySpark中的联接表条件:如果点位于多边形内,pyspark,gis,geospatial,geo,Pyspark,Gis,Geospatial,Geo,我有2个PySpark数据帧:一个带有点df_pnt,另一个带有多边形df_poly。因为我对PySpark不太熟悉,所以我正在努力在一个点是否在多边形内的条件下正确连接这个数据帧。 我从以下材料构建的代码开始: 如果我们想要绘制第一个多边形,我们应该启动代码 wkt.loads(df_poly.take(1)[0].wkt) 如果我们想检查一个多边形对象是否包含一个点对象,我们需要下面一行 Polygon.contains(Point) 问题是如何在连接过程中处理此自定义条件? df_po
df_pnt
,另一个带有多边形df_poly
。因为我对PySpark不太熟悉,所以我正在努力在一个点是否在多边形内的条件下正确连接这个数据帧。
我从以下材料构建的代码开始:
如果我们想要绘制第一个多边形,我们应该启动代码
wkt.loads(df_poly.take(1)[0].wkt)
如果我们想检查一个多边形
对象是否包含一个点
对象,我们需要下面一行
Polygon.contains(Point)
问题是如何在连接过程中处理此自定义条件?
df_poly
比点df小得多,所以我也想利用广播
UPD:
如果我需要在geopandas中实现这一点,它将如下所示:
df_pnt
id geometry
0 0 POINT (0.08834 0.23203)
1 1 POINT (0.67457 0.19285)
2 2 POINT (0.71186 0.25128)
3 3 POINT (0.55621 0.35016)
4 4 POINT (0.79637 0.24668)
5 5 POINT (0.40932 0.37155)
6 6 POINT (0.36124 0.68229)
7 7 POINT (0.13476 0.58242)
8 8 POINT (0.41659 0.46298)
9 9 POINT (0.74878 0.78191)
10 10 POINT (0.82088 0.58064)
11 11 POINT (0.28797 0.24399)
12 12 POINT (0.40502 0.99233)
13 13 POINT (0.68928 0.73251)
14 14 POINT (0.37765 0.71518)
df_poly
id geometry
0 0 POLYGON ((0.00000 0.00000, 0.50000 0.00000, 0....
1 1 POLYGON ((0.60000 0.00000, 0.60000 0.30000, 0....
2 2 POLYGON ((0.60000 0.50000, 0.50000 0.50000, 0....
3 3 POLYGON ((0.00000 0.50000, 0.20000 0.40000, 0....
gpd.sjoin(df_pnt, df_poly, how="left", op='intersects')
id_left geometry index_right id_right
0 0 POINT (0.08834 0.23203) NaN NaN
1 1 POINT (0.67457 0.19285) 1.0 1.0
2 2 POINT (0.71186 0.25128) NaN NaN
3 3 POINT (0.55621 0.35016) NaN NaN
4 4 POINT (0.79637 0.24668) NaN NaN
5 5 POINT (0.40932 0.37155) NaN NaN
6 6 POINT (0.36124 0.68229) 2.0 2.0
7 7 POINT (0.13476 0.58242) NaN NaN
8 8 POINT (0.41659 0.46298) NaN NaN
9 9 POINT (0.74878 0.78191) NaN NaN
10 10 POINT (0.82088 0.58064) NaN NaN
11 11 POINT (0.28797 0.24399) NaN NaN
12 12 POINT (0.40502 0.99233) NaN NaN
13 13 POINT (0.68928 0.73251) NaN NaN
14 14 POINT (0.37765 0.71518) 2.0 2.0
Hm。。。我不知道PySpark的所有特性,也没有清晰的表示模式适合我的情况。。。但我在问题中添加了一些信息,以解释如果使用geopandas会产生什么结果。当您创建spark数据帧
df_pnt
和df_poly
时,您能打印模式吗?(df.printSchema()
)并显示一些值?(df.show(truncate=False)
)。并非每个pyspark用户都熟悉pandas,因此很难回答您的问题。Spark df的printSchema
的root--id:long(nullable=true)|--wkt:string(nullable=true)df.show
df|pnt的df.show
看起来是这样的+--+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--id | wkt+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------0点(0.29213573767676767696959469469 0.6871580673326519)1点(0.628691318334046)0.13527466点(0.8953860983142878 0.585111896234707)| 3 |点(0.3906532809342733 0.774248079342560)| 4 |点(0.26806206335805934 0.1676353319933286)|+--+-------------------------------------------------------------+
用于df|u poly
的df.show类似于|id|wkt|0|多边形((0.0000000000000000 0.0000000000000000,0.50000000000000000.0000000000000000,0.300000000000000000.200000000000000,0.0000000000000000 0.200000000000000,0.0000000000000000 0.0000000000000000))|
df_pnt
id geometry
0 0 POINT (0.08834 0.23203)
1 1 POINT (0.67457 0.19285)
2 2 POINT (0.71186 0.25128)
3 3 POINT (0.55621 0.35016)
4 4 POINT (0.79637 0.24668)
5 5 POINT (0.40932 0.37155)
6 6 POINT (0.36124 0.68229)
7 7 POINT (0.13476 0.58242)
8 8 POINT (0.41659 0.46298)
9 9 POINT (0.74878 0.78191)
10 10 POINT (0.82088 0.58064)
11 11 POINT (0.28797 0.24399)
12 12 POINT (0.40502 0.99233)
13 13 POINT (0.68928 0.73251)
14 14 POINT (0.37765 0.71518)
df_poly
id geometry
0 0 POLYGON ((0.00000 0.00000, 0.50000 0.00000, 0....
1 1 POLYGON ((0.60000 0.00000, 0.60000 0.30000, 0....
2 2 POLYGON ((0.60000 0.50000, 0.50000 0.50000, 0....
3 3 POLYGON ((0.00000 0.50000, 0.20000 0.40000, 0....
gpd.sjoin(df_pnt, df_poly, how="left", op='intersects')
id_left geometry index_right id_right
0 0 POINT (0.08834 0.23203) NaN NaN
1 1 POINT (0.67457 0.19285) 1.0 1.0
2 2 POINT (0.71186 0.25128) NaN NaN
3 3 POINT (0.55621 0.35016) NaN NaN
4 4 POINT (0.79637 0.24668) NaN NaN
5 5 POINT (0.40932 0.37155) NaN NaN
6 6 POINT (0.36124 0.68229) 2.0 2.0
7 7 POINT (0.13476 0.58242) NaN NaN
8 8 POINT (0.41659 0.46298) NaN NaN
9 9 POINT (0.74878 0.78191) NaN NaN
10 10 POINT (0.82088 0.58064) NaN NaN
11 11 POINT (0.28797 0.24399) NaN NaN
12 12 POINT (0.40502 0.99233) NaN NaN
13 13 POINT (0.68928 0.73251) NaN NaN
14 14 POINT (0.37765 0.71518) 2.0 2.0