Python Pyspark:为什么ST_intersects函数返回重复的行?
我正在使用Python Pyspark:为什么ST_intersects函数返回重复的行?,python,mysql,apache-spark,gis,geospark,Python,Mysql,Apache Spark,Gis,Geospark,我正在使用geospark的ST_Intersects功能在点和多边形之间进行相交 queryOverlap = """ SELECT p.ID, z.COUNTYNS as zone, p.date, timestamp, p.point FROM gpsPingTable as p, zoneShapes as z WHERE ST_Intersects(p.point, z.geometry) """ pingsDay = spark
geospark
的ST_Intersects
功能在点和多边形之间进行相交
queryOverlap = """
SELECT p.ID, z.COUNTYNS as zone, p.date, timestamp, p.point
FROM gpsPingTable as p, zoneShapes as z
WHERE ST_Intersects(p.point, z.geometry)
"""
pingsDay = spark.sql(queryOverlap)
pingsDay.show()
为什么每行的返回都是重复的
+--------------------+--------+----------+-------------------+--------------------+
| ID| zone| date| timestamp| point|
+--------------------+--------+----------+-------------------+--------------------+
|45cdaabc-a804-46b...|01529224|2020-03-17|2020-03-17 12:29:24|POINT (-122.38825...|
|45cdaabc-a804-46b...|01529224|2020-03-17|2020-03-17 12:29:24|POINT (-122.38825...|
|45cdaabc-a804-46b...|01529224|2020-03-18|2020-03-18 11:21:27|POINT (-122.38851...|
|45cdaabc-a804-46b...|01529224|2020-03-18|2020-03-18 11:21:27|POINT (-122.38851...|
|aae0bb4e-4899-4ce...|01531402|2020-03-18|2020-03-18 06:58:03|POINT (-122.23097...|
|aae0bb4e-4899-4ce...|01531402|2020-03-18|2020-03-18 06:58:03|POINT (-122.23097...|
|f9b58c70-0665-4f5...|01531928|2020-03-17|2020-03-17 17:32:46|POINT (-119.43811...|
|f9b58c70-0665-4f5...|01531928|2020-03-17|2020-03-17 17:32:46|POINT (-119.43811...|
|f9b58c70-0665-4f5...|01531928|2020-03-18|2020-03-18 08:21:34|POINT (-119.41080...|
|f9b58c70-0665-4f5...|01531928|2020-03-18|2020-03-18 08:21:34|POINT (-119.41080...|
|f9b58c70-0665-4f5...|01531928|2020-03-19|2020-03-19 00:26:43|POINT (-119.43623...|
|f9b58c70-0665-4f5...|01531928|2020-03-19|2020-03-19 00:26:43|POINT (-119.43623...|
|fb768b89-b92a-4f0...|01531402|2020-03-18|2020-03-18 06:30:43|POINT (-122.22106...|
|fb768b89-b92a-4f0...|01531402|2020-03-18|2020-03-18 06:30:43|POINT (-122.22106...|
|fb768b89-b92a-4f0...|01531402|2020-03-18|2020-03-18 07:57:47|POINT (-122.22102...|
|fb768b89-b92a-4f0...|01531402|2020-03-18|2020-03-18 07:57:47|POINT (-122.22102...|
|a32f727d-566b-4ad...|01529224|2020-03-18|2020-03-18 14:38:13|POINT (-122.59499...|
|a32f727d-566b-4ad...|01529224|2020-03-18|2020-03-18 14:38:13|POINT (-122.59499...|
|ad7e4d7e-f8e5-45b...|01529224|2020-03-18|2020-03-18 07:58:51|POINT (-122.14959...|
|ad7e4d7e-f8e5-45b...|01529224|2020-03-18|2020-03-18 07:58:51|POINT (-122.14959...|
+--------------------+--------+----------+-------------------+--------------------+
最明显的原因是源表中的点或分区不是唯一的。如果存在重复的点或分区,显然会得到重复的点或分区 检查源表的唯一性:
SELECT p.ID, p.date count(*) c
FROM gpsPingTable as p
GROUP BY ID, data HAVING c > 1
这将报告重复的点。这将报告重复区域:
SELECT z.COUNTYNS as zone, COUNT(*) c
FROM zoneShapes as z
GROUP BY zone HAVING c > 1
您是否尝试添加一个简单的
不同的语句@ManuelCarrero感谢我尝试过,但我不确定在我的查询中添加DISTINCT
语句的位置。你能帮忙吗?选择不同的p.ID,z.COUNTYNS作为区域,p.date,timestamp,p.point作为p,p.points作为z,在圣母大学相交的地方(p.point,z.geometry)
这应该有效