Scala 基于嵌套结构/数组的过滤

Scala 基于嵌套结构/数组的过滤,scala,apache-spark,Scala,Apache Spark,以下是模式: root |-- target_column: array (nullable = true) | |-- element: struct (containsNull = true) | | |-- id: string (nullable = true) | | |-- sub_column: array (nullable = true) | | | |-- element: struct (containsNull =

以下是模式:

root
 |-- target_column: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- id: string (nullable = true)
 |    |    |-- sub_column: array (nullable = true)
 |    |    |    |-- element: struct (containsNull = true)
 |    |    |    |    |-- sub_id: string (nullable = true)
 |    |    |    |    |-- title: string (nullable = true)
 |    |    |    |    |-- scopes: array (nullable = true)
 |    |    |    |    |    |-- element: string (containsNull = true)
我想做以下工作:

  • 如果
    scopes
    包含字符串“dog”(如果不返回“”),则添加一个返回
    标题的列
  • 以下是一个例子:

    目标列

    [{a,[{a1, title1, []}}, {b,[{b1, title2, [cat, dog]}]}]
    [{a,[{a1, title1, []}}, {c,[{c1, title3, [cat, rabbit]}]}, {d,[{d1, title4, [cat]}, {d2, title5, [kitten]}]}]
    
    这会回来的

    final_columns
    title2
    []
    

    我是scala新手,因此非常感谢您的帮助。

    尝试以下代码片段以获得所需的输出:

    val sac = new SparkContext("local[*]", " first Program");
    val sqlc = new SQLContext(sac);
    import sqlc.implicits._;
     val jsondata=
          """
            |{
            |   "target_column": [{
                        |   "id" : "a",
                        |   "sub_column" : [{
                        |   "sub_id" : "a1", 
                        |   "title" : "title1", 
                        |   "scopes": []
                    |   }]
                |   },
                    |   {
                        |   "id" : "b",
                        |   "sub_column" : [{
                        |   "sub_id" : "b1", 
                        |   "title" : "title2", 
                        |   "scopes": ["cat", "dog"]
                    |   }]
                |   }
            |   ]
            |}
          """.stripMargin 
          
        val df = sqlc.read.json(Seq(jsondata).toDS())
        df.show(false)
        df.printSchema()
        df.createOrReplaceTempView("targetInfo")
        val df2 = sqlc.sql("select explode(target_column.sub_column) from targetInfo")
        df2.show(false)
        df2.printSchema()
        df2.createOrReplaceTempView("scopesInfo")
        sqlc.sql("select col.title from scopesInfo where array_contains(flatten(col.scopes), 'dog')").show(false)
        sqlc.sql("SELECT *, CASE WHEN array_contains(flatten(col.scopes), 'dog') THEN concat_ws(',',col.title) ELSE '' END AS final_columns FROM scopesInfo").show(false)
    
    这将产生以下输出:

    +--------------------------+-------------+
    |col                       |final_columns|
    +--------------------------+-------------+
    |[[[], a1, title1]]        |             |
    |[[[cat, dog], b1, title2]]|title2       |
    +--------------------------+-------------+
    

    希望我没有弄错你的问题。

    这里有一个稍微不同的方法,尽管Rakhi可能是你想要的方法

    这将为
    target\u列中的每个数组元素提供一条记录。这对我来说更有意义,但如果每个原始记录只需要一条记录,您可以在之后执行
    展平(collect_list($“titles”)
    (如下所示)

    首先,数据(编辑以显示展平效果):

    现在加载数据并添加一些数组操作(请注意,在调用
    show
    之前,这些只是惰性操作):

    如果您更喜欢按原始源记录分组:

    val dfGrouped = dfResult.groupBy($"target_column").agg(flatten(collect_list($"titles")).alias("titles"))
    
    结果(分组):

    import org.apache.spark.sql.functions._
    
    // "spark" is your SparkSession
    // Note all variables with "df" in them are of DataFrame type
    // and titleCond is an org.apache.spark.sql.Column
    val df = spark.read.json("example.json")
    val dfSub = df.withColumn("sub_column", explode($"target_column.sub_column"))
    val titleCond = expr("array_contains(flatten(sub_column.scopes), 'dog')")
    val dfResult = dfSub.withColumn("titles", when(titleCond, $"sub_column.title").otherwise(array().cast("array<string>")))
    
    dfResult.drop($"target_column").show(false)
    
    +---------------------------------------------+--------+
    |sub_column                                   |titles  |
    +---------------------------------------------+--------+
    |[[[dog], a1, title1]]                        |[title1]|
    |[[[cat, dog], b1, title2]]                   |[title2]|
    |[[[], a1, title1]]                           |[]      |
    |[[[cat, rabbit], c1, title3]]                |[]      |
    |[[[cat], d1, title4], [[kitten], d2, title5]]|[]      |
    +---------------------------------------------+--------+
    
    val dfGrouped = dfResult.groupBy($"target_column").agg(flatten(collect_list($"titles")).alias("titles"))
    
    +-----------------------------------------------------------------------------------------------------------------+----------------+
    |target_column                                                                                                    |titles          |
    +-----------------------------------------------------------------------------------------------------------------+----------------+
    |[[a, [[[dog], a1, title1]]], [b, [[[cat, dog], b1, title2]]]]                                                    |[title1, title2]|
    |[[a, [[[], a1, title1]]], [c, [[[cat, rabbit], c1, title3]]], [d, [[[cat], d1, title4], [[kitten], d2, title5]]]]|[]              |
    +-----------------------------------------------------------------------------------------------------------------+----------------+