Warning: file_get_contents(/data/phpspider/zhask/data//catemap/3/apache-spark/5.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Scala 加速geomesa查询_Scala_Apache Spark_Postgis_Jts_Geomesa - Fatal编程技术网

Scala 加速geomesa查询

Scala 加速geomesa查询,scala,apache-spark,postgis,jts,geomesa,Scala,Apache Spark,Postgis,Jts,Geomesa,我一直在用简单的空间查询测试geomesa,并将其与Postgis进行比较。例如,此SQL查询在Postgis中运行30秒: with series as ( select generate_series(0, 5000) as i ), points as ( select ST_Point(i, i*2) as geom from series ) select st_distance(a.geom, b.geom) from points as a, points as b

我一直在用简单的空间查询测试geomesa,并将其与Postgis进行比较。例如,此SQL查询在Postgis中运行30秒:

with series as (
select
    generate_series(0, 5000) as i
),
points as (
    select ST_Point(i, i*2) as geom from series
)
select st_distance(a.geom, b.geom) from points as a, points as b
现在,以下geomesa版本需要5分钟(使用-Xmx10g):

import org.apache.spark.sql.SparkSession
导入org.locationtech.geomesa.spark.jts_
导入org.locationtech.jts.geom_
对象HelloWorld{
def main(参数:数组[字符串]):单位={
val spark=SparkSession.builder
.config(“spark.sql.crossJoin.enabled”,“true”)
.config(“spark.executor.memory”,“12g”)
.config(“spark.driver.memory”,“12g”)
.config(“spark.cores.max”,“4”)
.master(“本地”)
.appName(“Geomesa”)
.getOrCreate()
斯帕克·威茨
导入spark.implicits_
val x=0,直到5000

val y=用于(GeoMesa)是为分布式的,NoSQL数据库而设计的。如果你的数据集适合PostGIS,你可能应该只使用PASGIS。一旦你达到了PASGIS的极限,你就应该考虑使用GeOMESA。GeOMESA确实提供了与任意的GooToW数据STOR的集成。es(包括PostGIS),它可以使PostGIS使用一些GeoMesa和功能


对于您的特定片段,我怀疑大部分时间都花在旋转RDD和运行循环上。实际上没有“查询”,因为您只是在运行成对计算。如果您查询存储在表中的数据,那么GeoMesa有机会优化扫描。然而,GeoMesa不是SQL数据库,也没有对联接的任何本机支持。通常,联接是由Spark在内存中完成的,尽管您可以做一些事情来加速它(例如,或)。如果您想进行复杂的空间连接,您可能需要查看GeoSpark和/或Magellan,它们专门从事空间火花操作。

您可以尝试做同样的事情,但使用恒定的列值而不是
st_distance
?即
df.withColumn(“dist”),1)
。我很好奇,与仅仅创建数据帧相比,计算距离花费了多少时间。
import org.apache.spark.sql.SparkSession
import org.locationtech.geomesa.spark.jts._
import org.locationtech.jts.geom._


object HelloWorld {
  def main(args: Array[String]): Unit = {

    val spark = SparkSession.builder
      .config("spark.sql.crossJoin.enabled", "true")
      .config("spark.executor.memory", "12g")
      .config("spark.driver.memory", "12g")
      .config("spark.cores.max", "4")
      .master("local")
      .appName("Geomesa")
      .getOrCreate()
    spark.withJTS
    import spark.implicits._


    val x = 0 until 5000
    val y = for (i <- x) yield i*2
    val coords = for ((i, n) <- x.zipWithIndex) yield (i, y(n))
    val points = for (i <- coords) yield new GeometryFactory().createPoint(new Coordinate(i._1, i._2))
    val points2 = for (i <- coords) yield new GeometryFactory().createPoint(new Coordinate(i._1, i._2))
    val all_points = for {
      i <- points
      j <- points2} yield (i, j)
    val df = all_points.toDF("point", "point2")
    val df2 = df.withColumn("dist", st_distance($"point", $"point2"))
    df2.show()
  }
}
import org.locationtech.jts.geom._

object HelloWorld {
  def main(args: Array[String]): Unit = {

    val x = 0 until 5000
    val y = for (i <- x) yield i*2
    val coords = for ((i, n) <- x.zipWithIndex) yield (i, y(n))
    val points = for (i <- coords) yield new GeometryFactory().createPoint(new Coordinate(i._1, i._2))
    val points2 = for {
      i <- points
      j <- points} yield i.distance(j)

    println(points2.slice(0,30))
  }
}