PySpark中的联接表条件:如果点位于多边形内

PySpark中的联接表条件:如果点位于多边形内,pyspark,gis,geospatial,geo,Pyspark,Gis,Geospatial,Geo,我有2个PySpark数据帧:一个带有点df_pnt,另一个带有多边形df_poly。因为我对PySpark不太熟悉,所以我正在努力在一个点是否在多边形内的条件下正确连接这个数据帧。 我从以下材料构建的代码开始: 如果我们想要绘制第一个多边形,我们应该启动代码 wkt.loads(df_poly.take(1)[0].wkt) 如果我们想检查一个多边形对象是否包含一个点对象,我们需要下面一行 Polygon.contains(Point) 问题是如何在连接过程中处理此自定义条件? df_po

我有2个PySpark数据帧:一个带有点
df_pnt
,另一个带有多边形
df_poly
。因为我对PySpark不太熟悉,所以我正在努力在一个点是否在多边形内的条件下正确连接这个数据帧。 我从以下材料构建的代码开始:

如果我们想要绘制第一个多边形,我们应该启动代码

wkt.loads(df_poly.take(1)[0].wkt)
如果我们想检查一个
多边形
对象是否包含一个
对象,我们需要下面一行

Polygon.contains(Point)
问题是如何在连接过程中处理此自定义条件?
df_poly
比点df小得多,所以我也想利用广播

UPD: 如果我需要在geopandas中实现这一点,它将如下所示:

df_pnt
    id  geometry
0   0   POINT (0.08834 0.23203)
1   1   POINT (0.67457 0.19285)
2   2   POINT (0.71186 0.25128)
3   3   POINT (0.55621 0.35016)
4   4   POINT (0.79637 0.24668)
5   5   POINT (0.40932 0.37155)
6   6   POINT (0.36124 0.68229)
7   7   POINT (0.13476 0.58242)
8   8   POINT (0.41659 0.46298)
9   9   POINT (0.74878 0.78191)
10  10  POINT (0.82088 0.58064)
11  11  POINT (0.28797 0.24399)
12  12  POINT (0.40502 0.99233)
13  13  POINT (0.68928 0.73251)
14  14  POINT (0.37765 0.71518)

df_poly

        id  geometry
0   0   POLYGON ((0.00000 0.00000, 0.50000 0.00000, 0....
1   1   POLYGON ((0.60000 0.00000, 0.60000 0.30000, 0....
2   2   POLYGON ((0.60000 0.50000, 0.50000 0.50000, 0....
3   3   POLYGON ((0.00000 0.50000, 0.20000 0.40000, 0....

gpd.sjoin(df_pnt, df_poly, how="left", op='intersects')

    id_left     geometry    index_right     id_right
0   0   POINT (0.08834 0.23203)     NaN     NaN
1   1   POINT (0.67457 0.19285)     1.0     1.0
2   2   POINT (0.71186 0.25128)     NaN     NaN
3   3   POINT (0.55621 0.35016)     NaN     NaN
4   4   POINT (0.79637 0.24668)     NaN     NaN
5   5   POINT (0.40932 0.37155)     NaN     NaN
6   6   POINT (0.36124 0.68229)     2.0     2.0
7   7   POINT (0.13476 0.58242)     NaN     NaN
8   8   POINT (0.41659 0.46298)     NaN     NaN
9   9   POINT (0.74878 0.78191)     NaN     NaN
10  10  POINT (0.82088 0.58064)     NaN     NaN
11  11  POINT (0.28797 0.24399)     NaN     NaN
12  12  POINT (0.40502 0.99233)     NaN     NaN
13  13  POINT (0.68928 0.73251)     NaN     NaN
14  14  POINT (0.37765 0.71518)     2.0     2.0

Hm。。。我不知道PySpark的所有特性,也没有清晰的表示模式适合我的情况。。。但我在问题中添加了一些信息,以解释如果使用geopandas会产生什么结果。当您创建spark数据帧
df_pnt
df_poly
时,您能打印模式吗?(
df.printSchema()
)并显示一些值?(
df.show(truncate=False)
)。并非每个pyspark用户都熟悉pandas,因此很难回答您的问题。Spark df的
printSchema
root--id:long(nullable=true)|--wkt:string(nullable=true)
df.show
df|pnt的
df.show
看起来是这样的
+--+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--id | wkt+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------0点(0.29213573767676767696959469469 0.6871580673326519)1点(0.628691318334046)0.13527466点(0.8953860983142878 0.585111896234707)| 3 |点(0.3906532809342733 0.774248079342560)| 4 |点(0.26806206335805934 0.1676353319933286)|+--+-------------------------------------------------------------+
用于
df|u poly
df.show类似于
|id|wkt|0|多边形((0.0000000000000000 0.0000000000000000,0.50000000000000000.0000000000000000,0.300000000000000000.200000000000000,0.0000000000000000 0.200000000000000,0.0000000000000000 0.0000000000000000))|
df_pnt
    id  geometry
0   0   POINT (0.08834 0.23203)
1   1   POINT (0.67457 0.19285)
2   2   POINT (0.71186 0.25128)
3   3   POINT (0.55621 0.35016)
4   4   POINT (0.79637 0.24668)
5   5   POINT (0.40932 0.37155)
6   6   POINT (0.36124 0.68229)
7   7   POINT (0.13476 0.58242)
8   8   POINT (0.41659 0.46298)
9   9   POINT (0.74878 0.78191)
10  10  POINT (0.82088 0.58064)
11  11  POINT (0.28797 0.24399)
12  12  POINT (0.40502 0.99233)
13  13  POINT (0.68928 0.73251)
14  14  POINT (0.37765 0.71518)

df_poly

        id  geometry
0   0   POLYGON ((0.00000 0.00000, 0.50000 0.00000, 0....
1   1   POLYGON ((0.60000 0.00000, 0.60000 0.30000, 0....
2   2   POLYGON ((0.60000 0.50000, 0.50000 0.50000, 0....
3   3   POLYGON ((0.00000 0.50000, 0.20000 0.40000, 0....

gpd.sjoin(df_pnt, df_poly, how="left", op='intersects')

    id_left     geometry    index_right     id_right
0   0   POINT (0.08834 0.23203)     NaN     NaN
1   1   POINT (0.67457 0.19285)     1.0     1.0
2   2   POINT (0.71186 0.25128)     NaN     NaN
3   3   POINT (0.55621 0.35016)     NaN     NaN
4   4   POINT (0.79637 0.24668)     NaN     NaN
5   5   POINT (0.40932 0.37155)     NaN     NaN
6   6   POINT (0.36124 0.68229)     2.0     2.0
7   7   POINT (0.13476 0.58242)     NaN     NaN
8   8   POINT (0.41659 0.46298)     NaN     NaN
9   9   POINT (0.74878 0.78191)     NaN     NaN
10  10  POINT (0.82088 0.58064)     NaN     NaN
11  11  POINT (0.28797 0.24399)     NaN     NaN
12  12  POINT (0.40502 0.99233)     NaN     NaN
13  13  POINT (0.68928 0.73251)     NaN     NaN
14  14  POINT (0.37765 0.71518)     2.0     2.0