Scala spark数据帧中结构的过滤器数组

Scala spark数据帧中结构的过滤器数组,scala,spark-dataframe,Scala,Spark Dataframe,我有一个JSON文件,我正在使用Scala 2.10和 val df = sqlContext.read.json("file_path") JSON如下所示: { "data": [{ "id":"20180218","parent": [{"name": "Market"}]}, { "id":"20180219","parent": [{"name": "Client"},{"name": "Market" }]}, { "id":"20180220","parent": [{"name

我有一个JSON文件,我正在使用Scala 2.10和

val df = sqlContext.read.json("file_path")
JSON如下所示:

{ "data": [{ "id":"20180218","parent": [{"name": "Market"}]}, { "id":"20180219","parent": [{"name": "Client"},{"name": "Market" }]}, { "id":"20180220","parent": [{"name": "Client"}]},{ "id":"20180221","parent": []}]}
数据是一个结构数组。每个结构都有父键。父级也是一个结构数组,可以容纳0个或多个值

我需要过滤父数组,使其只包含名为“Market”或“nothing”的结构。我的输出应该如下所示:

{ "data": [{ "id":"20180218","parent": [{"name": "Market"}]}, { "id":"20180219","parent": [{"name": "Market" }]}, { "id":"20180220","parent": []},{ "id":"20180221","parent": []}]}
因此,基本上过滤掉每个名为“Market”以外的结构,并保留空的父数组(作为操作的结果,或者如果它已经是空的)

有人能帮忙吗


谢谢

我们需要使用
explode
函数来实现这种嵌套的JSON结构和数组查询

scala> val df = spark.read.json("/Users/pavithranrao/Desktop/test.json")

scala> df.printSchema
root
 |-- data: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- id: string (nullable = true)
 |    |    |-- parent: array (nullable = true)
 |    |    |    |-- element: struct (containsNull = true)
 |    |    |    |    |-- name: string (nullable = true)

scala> val oneDF = df.select(col("data"), explode(col("data"))).toDF("data", "element").select(col("data"), col("element.parent"))
scala> oneDF.show
"""
+--------------------+--------------------+
|                data|              parent|
+--------------------+--------------------+
|[[20180218,Wrappe...|          [[Market]]|
|[[20180218,Wrappe...|[[Client], [Market]]|
|[[20180218,Wrappe...|          [[Client]]|
|[[20180218,Wrappe...|                  []|
+--------------------+--------------------+
"""

scala> val twoDF = oneDF.select(col("data"), explode(col("parent"))).toDF("data", "names")
scala> twoDF.printSchema
root
 |-- data: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- id: string (nullable = true)
 |    |    |-- parent: array (nullable = true)
 |    |    |    |-- element: struct (containsNull = true)
 |    |    |    |    |-- name: string (nullable = true)
 |-- names: struct (nullable = true)
 |    |-- name: string (nullable = true)

scala> twoDF.show
"""
+--------------------+--------+
|                data|   names|
+--------------------+--------+
|[[20180218,Wrappe...|[Market]|
|[[20180218,Wrappe...|[Client]|
|[[20180218,Wrappe...|[Market]|
|[[20180218,Wrappe...|[Client]|
+--------------------+--------+
"""

scala> import org.apache.spark.sql.functions.length

// Extract names struct that is Empty
scala> twoDF.select(length(col("names.name")) === 0).show
+------------------------+
|(length(names.name) = 0)|
+------------------------+
|                   false|
|                   false|
|                   false|
|                   false|
+------------------------+

// Extract names strcut that doesn't have Market
scala> twoDF.select(!col("names.name").contains("Market")).show()
+----------------------------------+
|(NOT contains(names.name, Market))|
+----------------------------------+
|                             false|
|                              true|
|                             false|
|                              true|
+----------------------------------+

// Combining these two

scala> val ansDF = twoDF.select("data").filter(!col("names.name").contains("Market") || length(col("names.name")) === 0)
scala> ansDF.printSchema

// Schema same as input df
root
 |-- data: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- id: string (nullable = true)
 |    |    |-- parent: array (nullable = true)
 |    |    |    |-- element: struct (containsNull = true)
 |    |    |    |    |-- name: string (nullable = true)

scala> ansDF.show(false)
+----------------------------------------------------------------------------------------------------------------------------------------------+
|data                                                                                                                                          |
+----------------------------------------------------------------------------------------------------------------------------------------------+
|[[20180218,WrappedArray([Market])], [20180219,WrappedArray([Client], [Market])], [20180220,WrappedArray([Client])], [20180221,WrappedArray()]]|
|[[20180218,WrappedArray([Market])], [20180219,WrappedArray([Client], [Market])], [20180220,WrappedArray([Client])], [20180221,WrappedArray()]]|
+----------------------------------------------------------------------------------------------------------------------------------------------+
最终的ansDF具有满足条件
name
不包含“Market”或isEmpty的过滤记录

PS:如果我错过了确切的过滤器场景,请从 以上代码中的过滤函数


希望这有帮助

到目前为止你试着做什么?您可以分享一些您尝试过的代码示例吗?另一方面,如果您使用的是
Spark v 2.0+
,那么对于嵌套结构使用带有适当case类的数据帧来代替数据帧,这是非常容易实现的。DataSet允许我们使用RDD操作,比如
过滤器
,我们不需要使用explode来达到结构或数组的峰值。这并不能解决问题。由于不正确的取消引用,筛选条件不起作用。我尝试了
twoDF.where(length(col(“names.name”)===0 | | |!col(“names.name”).contains(“Market”).show
,它给出了结果
+-----------------------+--------------+-----------+-----------++-----------[[20180218,Wrappe…|[Client]]|[[20180218,Wrappe…|[Client]|+----------------+
Did
twoDF.where(length(col(“names.name”)===0 | |!col(“names.name”)。contains(“Market”)
给出了正确的结果?如果需要,请编辑答案。不,它没有。它返回的行仅以客户端名称命名,并过滤掉空/空的父行。