Scala 基于嵌套结构/数组的过滤
以下是模式:Scala 基于嵌套结构/数组的过滤,scala,apache-spark,Scala,Apache Spark,以下是模式: root |-- target_column: array (nullable = true) | |-- element: struct (containsNull = true) | | |-- id: string (nullable = true) | | |-- sub_column: array (nullable = true) | | | |-- element: struct (containsNull =
root
|-- target_column: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- id: string (nullable = true)
| | |-- sub_column: array (nullable = true)
| | | |-- element: struct (containsNull = true)
| | | | |-- sub_id: string (nullable = true)
| | | | |-- title: string (nullable = true)
| | | | |-- scopes: array (nullable = true)
| | | | | |-- element: string (containsNull = true)
我想做以下工作:
scopes
包含字符串“dog”(如果不返回“”),则添加一个返回标题的列
[{a,[{a1, title1, []}}, {b,[{b1, title2, [cat, dog]}]}]
[{a,[{a1, title1, []}}, {c,[{c1, title3, [cat, rabbit]}]}, {d,[{d1, title4, [cat]}, {d2, title5, [kitten]}]}]
这会回来的
final_columns
title2
[]
我是scala新手,因此非常感谢您的帮助。尝试以下代码片段以获得所需的输出:
val sac = new SparkContext("local[*]", " first Program");
val sqlc = new SQLContext(sac);
import sqlc.implicits._;
val jsondata=
"""
|{
| "target_column": [{
| "id" : "a",
| "sub_column" : [{
| "sub_id" : "a1",
| "title" : "title1",
| "scopes": []
| }]
| },
| {
| "id" : "b",
| "sub_column" : [{
| "sub_id" : "b1",
| "title" : "title2",
| "scopes": ["cat", "dog"]
| }]
| }
| ]
|}
""".stripMargin
val df = sqlc.read.json(Seq(jsondata).toDS())
df.show(false)
df.printSchema()
df.createOrReplaceTempView("targetInfo")
val df2 = sqlc.sql("select explode(target_column.sub_column) from targetInfo")
df2.show(false)
df2.printSchema()
df2.createOrReplaceTempView("scopesInfo")
sqlc.sql("select col.title from scopesInfo where array_contains(flatten(col.scopes), 'dog')").show(false)
sqlc.sql("SELECT *, CASE WHEN array_contains(flatten(col.scopes), 'dog') THEN concat_ws(',',col.title) ELSE '' END AS final_columns FROM scopesInfo").show(false)
这将产生以下输出:
+--------------------------+-------------+
|col |final_columns|
+--------------------------+-------------+
|[[[], a1, title1]] | |
|[[[cat, dog], b1, title2]]|title2 |
+--------------------------+-------------+
希望我没有弄错你的问题。这里有一个稍微不同的方法,尽管Rakhi可能是你想要的方法 这将为
target\u列中的每个数组元素提供一条记录。这对我来说更有意义,但如果每个原始记录只需要一条记录,您可以在之后执行展平(collect_list($“titles”)
(如下所示)
首先,数据(编辑以显示展平效果):
现在加载数据并添加一些数组操作(请注意,在调用show
之前,这些只是惰性操作):
如果您更喜欢按原始源记录分组:
val dfGrouped = dfResult.groupBy($"target_column").agg(flatten(collect_list($"titles")).alias("titles"))
结果(分组):
import org.apache.spark.sql.functions._
// "spark" is your SparkSession
// Note all variables with "df" in them are of DataFrame type
// and titleCond is an org.apache.spark.sql.Column
val df = spark.read.json("example.json")
val dfSub = df.withColumn("sub_column", explode($"target_column.sub_column"))
val titleCond = expr("array_contains(flatten(sub_column.scopes), 'dog')")
val dfResult = dfSub.withColumn("titles", when(titleCond, $"sub_column.title").otherwise(array().cast("array<string>")))
dfResult.drop($"target_column").show(false)
+---------------------------------------------+--------+
|sub_column |titles |
+---------------------------------------------+--------+
|[[[dog], a1, title1]] |[title1]|
|[[[cat, dog], b1, title2]] |[title2]|
|[[[], a1, title1]] |[] |
|[[[cat, rabbit], c1, title3]] |[] |
|[[[cat], d1, title4], [[kitten], d2, title5]]|[] |
+---------------------------------------------+--------+
val dfGrouped = dfResult.groupBy($"target_column").agg(flatten(collect_list($"titles")).alias("titles"))
+-----------------------------------------------------------------------------------------------------------------+----------------+
|target_column |titles |
+-----------------------------------------------------------------------------------------------------------------+----------------+
|[[a, [[[dog], a1, title1]]], [b, [[[cat, dog], b1, title2]]]] |[title1, title2]|
|[[a, [[[], a1, title1]]], [c, [[[cat, rabbit], c1, title3]]], [d, [[[cat], d1, title4], [[kitten], d2, title5]]]]|[] |
+-----------------------------------------------------------------------------------------------------------------+----------------+