Apache spark 获取spark SQL查询中键的值
我有以下DF模式:Apache spark 获取spark SQL查询中键的值,apache-spark,apache-spark-sql,Apache Spark,Apache Spark Sql,我有以下DF模式: scala> hotelsDF.printSchema() root |-- id: long (nullable = true) |-- version: integer (nullable = true) |-- timestamp: long (nullable = true) |-- changeset: long (nullable = true) |-- uid: integer (nullable = true) |-- user_sid: b
scala> hotelsDF.printSchema()
root
|-- id: long (nullable = true)
|-- version: integer (nullable = true)
|-- timestamp: long (nullable = true)
|-- changeset: long (nullable = true)
|-- uid: integer (nullable = true)
|-- user_sid: binary (nullable = true)
|-- tags: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- key: binary (nullable = true)
| | |-- value: binary (nullable = true)
|-- latitude: double (nullable = true)
|-- longitude: double (nullable = true)
我需要筛选具有键
等于旅游
和值
等于酒店
的记录。我使用以下SQL查询执行此操作:
sqlContext.sql("select * from nodes where array_contains(tags.key, binary('tourism')) and array_contains(tags.value, binary('hotel'))").show()
到目前为止,一切顺利
现在,我的问题是如何为给定的标记键选择值?伪查询将类似于:
sqlContext.sql("select tags.tourism from nodes where array_contains(tags.key, binary('tourism')) and array_contains(tags.value, binary('hotel'))").show()
然后返回所有条目的
hotel
。您可以分解数组,然后过滤:
hotelsDF.withColumn(
"tags1",
explode(col("tags"))
).drop(
"tags"
).filter(
(col("tags1.key") == "tourism") & (col("tags1.value") == "hotel")
).show()
我用不同的方法解决了这个问题。我添加了以下案例类:
case class Entry(
id: Long,
version: Int,
timestamp: Long,
changeset: Long,
uid: Int,
user_sid: Array[Byte],
tags: Array[Tag],
latitude: Double,
longitude: Double
)
case class Tag(key: Array[Byte], value: Array[Byte])
case class Hotel(
id: Long,
stars: Option[String],
latiutde: Double,
longitude: Double,
name: String,
rooms: Option[String]
)
有趣的是(也给我带来了一些问题)spark二进制的等价物只是Array[Byte]
并按以下方式处理DF:
def process(country: String) = {
val dir = "/whatever/dir"
val df = spark.read.parquet(s"$dir/$country/latest.node.parquet")
df
.as[Entry]
.filter(e => e.tags != null && e.tags.nonEmpty)
.filter(e =>
e.tags.exists(t => new String(t.key).equalsIgnoreCase("tourism") && new String(t.value).equalsIgnoreCase("hotel"))
)
.map(e => Hotel(
e.id,
e.tags.find(findTag("stars")).map(t => new String(t.value)),
e.latitude,
e.longitude,
e.tags.find(findTag("name")).map(t => new String(t.value)).orNull,
e.tags.find(findTag("rooms")).map(t => new String(t.value))
))
.repartition(1)
.write
.format("csv")
.option("nullValue", null)
.option("header", value = true)
.option("delimiter", ",")
.save(s"$dir/$country/csv")
}
看起来很酷。据我所知,如果我每行有多个标记(数组中的元素),我将得到重复的标记,并且以后需要对其进行分组?我已经设法以类型安全的方式进行了分组,但是+1有助于