Apache spark 获取spark SQL查询中键的值

Apache spark 获取spark SQL查询中键的值,apache-spark,apache-spark-sql,Apache Spark,Apache Spark Sql,我有以下DF模式: scala> hotelsDF.printSchema() root |-- id: long (nullable = true) |-- version: integer (nullable = true) |-- timestamp: long (nullable = true) |-- changeset: long (nullable = true) |-- uid: integer (nullable = true) |-- user_sid: b

我有以下DF模式:

scala> hotelsDF.printSchema()
root
 |-- id: long (nullable = true)
 |-- version: integer (nullable = true)
 |-- timestamp: long (nullable = true)
 |-- changeset: long (nullable = true)
 |-- uid: integer (nullable = true)
 |-- user_sid: binary (nullable = true)
 |-- tags: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- key: binary (nullable = true)
 |    |    |-- value: binary (nullable = true)
 |-- latitude: double (nullable = true)
 |-- longitude: double (nullable = true)
我需要筛选具有
等于
旅游
等于
酒店
的记录。我使用以下SQL查询执行此操作:

sqlContext.sql("select * from nodes where array_contains(tags.key, binary('tourism')) and array_contains(tags.value, binary('hotel'))").show()
到目前为止,一切顺利

现在,我的问题是如何为给定的标记键选择值?伪查询将类似于:

sqlContext.sql("select tags.tourism from nodes where array_contains(tags.key, binary('tourism')) and array_contains(tags.value, binary('hotel'))").show()

然后返回所有条目的
hotel

您可以分解数组,然后过滤:

hotelsDF.withColumn(
    "tags1", 
    explode(col("tags"))
).drop(
    "tags"
).filter(
    (col("tags1.key") == "tourism") & (col("tags1.value") == "hotel")
).show()

我用不同的方法解决了这个问题。我添加了以下案例类:

case class Entry(
                  id: Long,
                  version: Int,
                  timestamp: Long,
                  changeset: Long,
                  uid: Int,
                  user_sid: Array[Byte],
                  tags: Array[Tag],
                  latitude: Double,
                  longitude: Double
                )

case class Tag(key: Array[Byte], value: Array[Byte])

case class Hotel(
                  id: Long,
                  stars: Option[String],
                  latiutde: Double,
                  longitude: Double,
                  name: String,
                  rooms: Option[String]
                )
有趣的是(也给我带来了一些问题)spark二进制的等价物只是
Array[Byte]

并按以下方式处理DF:

def process(country: String) = {
    val dir = "/whatever/dir"
    val df = spark.read.parquet(s"$dir/$country/latest.node.parquet")

    df
      .as[Entry]
      .filter(e => e.tags != null && e.tags.nonEmpty)
      .filter(e =>
        e.tags.exists(t => new String(t.key).equalsIgnoreCase("tourism") && new String(t.value).equalsIgnoreCase("hotel"))
      )
      .map(e => Hotel(
        e.id,
        e.tags.find(findTag("stars")).map(t => new String(t.value)),
        e.latitude,
        e.longitude,
        e.tags.find(findTag("name")).map(t => new String(t.value)).orNull,
        e.tags.find(findTag("rooms")).map(t => new String(t.value))
      ))
      .repartition(1)
      .write
      .format("csv")
      .option("nullValue", null)
      .option("header", value = true)
      .option("delimiter", ",")
      .save(s"$dir/$country/csv")
  }

看起来很酷。据我所知,如果我每行有多个标记(数组中的元素),我将得到重复的标记,并且以后需要对其进行分组?我已经设法以类型安全的方式进行了分组,但是+1有助于