Apache spark 选择子实体时,Spark返回空值数组
我有一个实体:Apache spark 选择子实体时,Spark返回空值数组,apache-spark,Apache Spark,我有一个实体: { "id": "123", "col_1": null, "sub_entities": [ { "sub_entity_id": "s-1", "col_2": null }, { "sub_entity_id": "s-2", "col_2": null } ] } 我将其加载到spark:val entities=spark.read.json(“…”) entities.filter(大小($“sub_entities.col_2”)=
{
"id": "123",
"col_1": null,
"sub_entities": [
{ "sub_entity_id": "s-1", "col_2": null },
{ "sub_entity_id": "s-2", "col_2": null }
]
}
我将其加载到spark:val entities=spark.read.json(“…”)
entities.filter(大小($“sub_entities.col_2”)==0)
不返回任何内容。这种行为看起来很奇怪,因为所有的col_2
都是null,但是null值被计算在内
然后我尝试选择col_2
,注意到它返回一个空值数组(本例中为2个空值)
如何编写只返回列2不为空的数组中的对象的查询?要查询
数组
对象,我们需要首先使用分解
函数展平数组,然后查询数据帧
示例:
val df=spark.read.json(Seq("""{"id": "123","col_1": null,"sub_entities": [ { "sub_entity_id": "s-1", "col_2": null }, { "sub_entity_id": "s-2", "col_2": null }]}""").toDS)
df.selectExpr("explode(sub_entities)","*").select("col.*","id","col_1").show()
//+-----+-------------+---+-----+
//|col_2|sub_entity_id| id|col_1|
//+-----+-------------+---+-----+
//| null| s-1|123| null|
//| null| s-2|123| null|
//+-----+-------------+---+-----+
df.selectExpr("explode(sub_entities)","*").select("col.*","id","col_1").filter(col("col_2").isNull).show()
//+-----+-------------+---+-----+
//|col_2|sub_entity_id| id|col_1|
//+-----+-------------+---+-----+
//| null| s-1|123| null|
//| null| s-2|123| null|
//+-----+-------------+---+-----+
要查询
数组
对象,我们需要首先使用分解
函数展平数组,然后查询数据帧
示例:
val df=spark.read.json(Seq("""{"id": "123","col_1": null,"sub_entities": [ { "sub_entity_id": "s-1", "col_2": null }, { "sub_entity_id": "s-2", "col_2": null }]}""").toDS)
df.selectExpr("explode(sub_entities)","*").select("col.*","id","col_1").show()
//+-----+-------------+---+-----+
//|col_2|sub_entity_id| id|col_1|
//+-----+-------------+---+-----+
//| null| s-1|123| null|
//| null| s-2|123| null|
//+-----+-------------+---+-----+
df.selectExpr("explode(sub_entities)","*").select("col.*","id","col_1").filter(col("col_2").isNull).show()
//+-----+-------------+---+-----+
//|col_2|sub_entity_id| id|col_1|
//+-----+-------------+---+-----+
//| null| s-1|123| null|
//| null| s-2|123| null|
//+-----+-------------+---+-----+
如您所述,如果在执行
df时需要不同的输出,则只过滤出col\u 2
数组。选择($“col\u 1”,“sub\u entities”)。显示,我可以更新答案:
val json =
"""
{
"id": "123",
"col_1": null,
"sub_entities": [
{ "sub_entity_id": "s-1", "col_2": null },
{ "sub_entity_id": "s-2", "col_2": null }
]
}
"""
val df = spark.read.json(Seq(json).toDS)
val removeNulls = udf((arr : Seq[String]) => arr.filter((x: String) => x != null))
df.select($"col_1", removeNulls($"sub_entities.col_2").as("sub_entities.col_2")).show(false)
+-----+------------------+
|col_1|sub_entities.col_2|
+-----+------------------+
|null |[] |
+-----+------------------+
如您所述,如果在执行df时需要不同的输出,则只过滤出col\u 2
数组。选择($“col\u 1”,“sub\u entities”)。显示,我可以更新答案:
val json =
"""
{
"id": "123",
"col_1": null,
"sub_entities": [
{ "sub_entity_id": "s-1", "col_2": null },
{ "sub_entity_id": "s-2", "col_2": null }
]
}
"""
val df = spark.read.json(Seq(json).toDS)
val removeNulls = udf((arr : Seq[String]) => arr.filter((x: String) => x != null))
df.select($"col_1", removeNulls($"sub_entities.col_2").as("sub_entities.col_2")).show(false)
+-----+------------------+
|col_1|sub_entities.col_2|
+-----+------------------+
|null |[] |
+-----+------------------+
在此上下文中过滤掉null
值意味着什么?是否要将上面的数组返回为空?是否希望子实体
仅返回列2不为null的数组中的对象?请写下预期的输出。@DusanVasiljevic编辑了这个问题,我想要一个查询来返回数组中col_2
不为null的对象。在这个上下文中过滤掉null
值意味着什么?是否要将上面的数组返回为空?是否希望子实体
仅返回列2不为null的数组中的对象?请写下预期的输出。@DusanVasiljevic编辑了这个问题,我想要一个查询,从col_2
不为空的数组返回对象。