Apache spark 获取spark SQL查询中键的值_Apache Spark_Apache Spark Sql

Apache spark 获取spark SQL查询中键的值

apache-spark

Apache spark 获取spark SQL查询中键的值,apache-spark,apache-spark-sql,Apache Spark,Apache Spark Sql,我有以下DF模式： scala> hotelsDF.printSchema() root |-- id: long (nullable = true) |-- version: integer (nullable = true) |-- timestamp: long (nullable = true) |-- changeset: long (nullable = true) |-- uid: integer (nullable = true) |-- user_sid: b

我有以下DF模式：

scala> hotelsDF.printSchema()
root
 |-- id: long (nullable = true)
 |-- version: integer (nullable = true)
 |-- timestamp: long (nullable = true)
 |-- changeset: long (nullable = true)
 |-- uid: integer (nullable = true)
 |-- user_sid: binary (nullable = true)
 |-- tags: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- key: binary (nullable = true)
 |    |    |-- value: binary (nullable = true)
 |-- latitude: double (nullable = true)
 |-- longitude: double (nullable = true)

我需要筛选具有

键

等于

旅游

和

值

等于

酒店

的记录。我使用以下SQL查询执行此操作：

sqlContext.sql("select * from nodes where array_contains(tags.key, binary('tourism')) and array_contains(tags.value, binary('hotel'))").show()

到目前为止，一切顺利

现在，我的问题是如何为给定的标记键选择值？伪查询将类似于：

sqlContext.sql("select tags.tourism from nodes where array_contains(tags.key, binary('tourism')) and array_contains(tags.value, binary('hotel'))").show()

然后返回所有条目的

hotel

。

您可以分解数组，然后过滤：

hotelsDF.withColumn(
    "tags1", 
    explode(col("tags"))
).drop(
    "tags"
).filter(
    (col("tags1.key") == "tourism") & (col("tags1.value") == "hotel")
).show()

我用不同的方法解决了这个问题。我添加了以下案例类：

case class Entry(
                  id: Long,
                  version: Int,
                  timestamp: Long,
                  changeset: Long,
                  uid: Int,
                  user_sid: Array[Byte],
                  tags: Array[Tag],
                  latitude: Double,
                  longitude: Double
                )

case class Tag(key: Array[Byte], value: Array[Byte])

case class Hotel(
                  id: Long,
                  stars: Option[String],
                  latiutde: Double,
                  longitude: Double,
                  name: String,
                  rooms: Option[String]
                )

有趣的是（也给我带来了一些问题）spark二进制的等价物只是

Array[Byte]

并按以下方式处理DF：

def process(country: String) = {
    val dir = "/whatever/dir"
    val df = spark.read.parquet(s"$dir/$country/latest.node.parquet")

    df
      .as[Entry]
      .filter(e => e.tags != null && e.tags.nonEmpty)
      .filter(e =>
        e.tags.exists(t => new String(t.key).equalsIgnoreCase("tourism") && new String(t.value).equalsIgnoreCase("hotel"))
      )
      .map(e => Hotel(
        e.id,
        e.tags.find(findTag("stars")).map(t => new String(t.value)),
        e.latitude,
        e.longitude,
        e.tags.find(findTag("name")).map(t => new String(t.value)).orNull,
        e.tags.find(findTag("rooms")).map(t => new String(t.value))
      ))
      .repartition(1)
      .write
      .format("csv")
      .option("nullValue", null)
      .option("header", value = true)
      .option("delimiter", ",")
      .save(s"$dir/$country/csv")
  }

看起来很酷。据我所知，如果我每行有多个标记（数组中的元素），我将得到重复的标记，并且以后需要对其进行分组？我已经设法以类型安全的方式进行了分组，但是+1有助于