Scala 基于嵌套结构/数组的过滤_Scala_Apache Spark

Scala 基于嵌套结构/数组的过滤

scala apache-spark

Scala 基于嵌套结构/数组的过滤,scala,apache-spark,Scala,Apache Spark,以下是模式： root |-- target_column: array (nullable = true) | |-- element: struct (containsNull = true) | | |-- id: string (nullable = true) | | |-- sub_column: array (nullable = true) | | | |-- element: struct (containsNull =

以下是模式：

root
 |-- target_column: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- id: string (nullable = true)
 |    |    |-- sub_column: array (nullable = true)
 |    |    |    |-- element: struct (containsNull = true)
 |    |    |    |    |-- sub_id: string (nullable = true)
 |    |    |    |    |-- title: string (nullable = true)
 |    |    |    |    |-- scopes: array (nullable = true)
 |    |    |    |    |    |-- element: string (containsNull = true)

我想做以下工作：

如果

scopes

包含字符串“dog”（如果不返回“”），则添加一个返回

标题的列


以下是一个例子：
目标列
[{a,[{a1, title1, []}}, {b,[{b1, title2, [cat, dog]}]}]
[{a,[{a1, title1, []}}, {c,[{c1, title3, [cat, rabbit]}]}, {d,[{d1, title4, [cat]}, {d2, title5, [kitten]}]}]

这会回来的
final_columns
title2
[]

我是scala新手，因此非常感谢您的帮助。
尝试以下代码片段以获得所需的输出：
val sac = new SparkContext("local[*]", " first Program");
val sqlc = new SQLContext(sac);
import sqlc.implicits._;
 val jsondata=
      """
        |{
        |   "target_column": [{
                    |   "id" : "a",
                    |   "sub_column" : [{
                    |   "sub_id" : "a1", 
                    |   "title" : "title1", 
                    |   "scopes": []
                |   }]
            |   },
                |   {
                    |   "id" : "b",
                    |   "sub_column" : [{
                    |   "sub_id" : "b1", 
                    |   "title" : "title2", 
                    |   "scopes": ["cat", "dog"]
                |   }]
            |   }
        |   ]
        |}
      """.stripMargin 
      
    val df = sqlc.read.json(Seq(jsondata).toDS())
    df.show(false)
    df.printSchema()
    df.createOrReplaceTempView("targetInfo")
    val df2 = sqlc.sql("select explode(target_column.sub_column) from targetInfo")
    df2.show(false)
    df2.printSchema()
    df2.createOrReplaceTempView("scopesInfo")
    sqlc.sql("select col.title from scopesInfo where array_contains(flatten(col.scopes), 'dog')").show(false)
    sqlc.sql("SELECT *, CASE WHEN array_contains(flatten(col.scopes), 'dog') THEN concat_ws(',',col.title) ELSE '' END AS final_columns FROM scopesInfo").show(false)

这将产生以下输出：
+--------------------------+-------------+
|col                       |final_columns|
+--------------------------+-------------+
|[[[], a1, title1]]        |             |
|[[[cat, dog], b1, title2]]|title2       |
+--------------------------+-------------+

希望我没有弄错你的问题。
这里有一个稍微不同的方法，尽管Rakhi可能是你想要的方法
这将为target\u列中的每个数组元素提供一条记录。这对我来说更有意义，但如果每个原始记录只需要一条记录，您可以在之后执行展平（collect_list（$“titles”）
（如下所示）
首先，数据（编辑以显示展平效果）：
现在加载数据并添加一些数组操作（请注意，在调用show
之前，这些只是惰性操作）：
如果您更喜欢按原始源记录分组：
val dfGrouped = dfResult.groupBy($"target_column").agg(flatten(collect_list($"titles")).alias("titles"))

结果（分组）：
import org.apache.spark.sql.functions._

// "spark" is your SparkSession
// Note all variables with "df" in them are of DataFrame type
// and titleCond is an org.apache.spark.sql.Column
val df = spark.read.json("example.json")
val dfSub = df.withColumn("sub_column", explode($"target_column.sub_column"))
val titleCond = expr("array_contains(flatten(sub_column.scopes), 'dog')")
val dfResult = dfSub.withColumn("titles", when(titleCond, $"sub_column.title").otherwise(array().cast("array<string>")))

dfResult.drop($"target_column").show(false)

+---------------------------------------------+--------+
|sub_column                                   |titles  |
+---------------------------------------------+--------+
|[[[dog], a1, title1]]                        |[title1]|
|[[[cat, dog], b1, title2]]                   |[title2]|
|[[[], a1, title1]]                           |[]      |
|[[[cat, rabbit], c1, title3]]                |[]      |
|[[[cat], d1, title4], [[kitten], d2, title5]]|[]      |
+---------------------------------------------+--------+

val dfGrouped = dfResult.groupBy($"target_column").agg(flatten(collect_list($"titles")).alias("titles"))

+-----------------------------------------------------------------------------------------------------------------+----------------+
|target_column                                                                                                    |titles          |
+-----------------------------------------------------------------------------------------------------------------+----------------+
|[[a, [[[dog], a1, title1]]], [b, [[[cat, dog], b1, title2]]]]                                                    |[title1, title2]|
|[[a, [[[], a1, title1]]], [c, [[[cat, rabbit], c1, title3]]], [d, [[[cat], d1, title4], [[kitten], d2, title5]]]]|[]              |
+-----------------------------------------------------------------------------------------------------------------+----------------+