Warning: file_get_contents(/data/phpspider/zhask/data//catemap/3/apache-spark/5.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181

Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/jquery/79.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Apache spark 选择子实体时,Spark返回空值数组_Apache Spark - Fatal编程技术网

Apache spark 选择子实体时,Spark返回空值数组

Apache spark 选择子实体时,Spark返回空值数组,apache-spark,Apache Spark,我有一个实体: { "id": "123", "col_1": null, "sub_entities": [ { "sub_entity_id": "s-1", "col_2": null }, { "sub_entity_id": "s-2", "col_2": null } ] } 我将其加载到spark:val entities=spark.read.json(“…”) entities.filter(大小($“sub_entities.col_2”)=

我有一个实体:

{
  "id": "123",
  "col_1": null,
  "sub_entities": [
    { "sub_entity_id": "s-1", "col_2": null },
    { "sub_entity_id": "s-2", "col_2": null }
  ]
}
我将其加载到spark:
val entities=spark.read.json(“…”)

entities.filter(大小($“sub_entities.col_2”)==0)
不返回任何内容。这种行为看起来很奇怪,因为所有的
col_2
都是null,但是null值被计算在内

然后我尝试选择
col_2
,注意到它返回一个空值数组(本例中为2个空值)


如何编写只返回列2不为空的数组中的对象的查询?

要查询
数组
对象,我们需要首先使用
分解
函数展平数组,然后查询数据帧

示例:

val df=spark.read.json(Seq("""{"id": "123","col_1": null,"sub_entities": [  { "sub_entity_id": "s-1", "col_2": null },  { "sub_entity_id": "s-2", "col_2": null }]}""").toDS)

df.selectExpr("explode(sub_entities)","*").select("col.*","id","col_1").show()

//+-----+-------------+---+-----+
//|col_2|sub_entity_id| id|col_1|
//+-----+-------------+---+-----+
//| null|          s-1|123| null|
//| null|          s-2|123| null|
//+-----+-------------+---+-----+

df.selectExpr("explode(sub_entities)","*").select("col.*","id","col_1").filter(col("col_2").isNull).show()

//+-----+-------------+---+-----+
//|col_2|sub_entity_id| id|col_1|
//+-----+-------------+---+-----+
//| null|          s-1|123| null|
//| null|          s-2|123| null|
//+-----+-------------+---+-----+

要查询
数组
对象,我们需要首先使用
分解
函数展平数组,然后查询数据帧

示例:

val df=spark.read.json(Seq("""{"id": "123","col_1": null,"sub_entities": [  { "sub_entity_id": "s-1", "col_2": null },  { "sub_entity_id": "s-2", "col_2": null }]}""").toDS)

df.selectExpr("explode(sub_entities)","*").select("col.*","id","col_1").show()

//+-----+-------------+---+-----+
//|col_2|sub_entity_id| id|col_1|
//+-----+-------------+---+-----+
//| null|          s-1|123| null|
//| null|          s-2|123| null|
//+-----+-------------+---+-----+

df.selectExpr("explode(sub_entities)","*").select("col.*","id","col_1").filter(col("col_2").isNull).show()

//+-----+-------------+---+-----+
//|col_2|sub_entity_id| id|col_1|
//+-----+-------------+---+-----+
//| null|          s-1|123| null|
//| null|          s-2|123| null|
//+-----+-------------+---+-----+

如您所述,如果在执行
df时需要不同的输出,则只过滤出
col\u 2
数组。选择($“col\u 1”,“sub\u entities”)。显示
,我可以更新答案:

val json =  
"""
{
    "id": "123",
    "col_1": null,
    "sub_entities": [
        { "sub_entity_id": "s-1", "col_2": null },
        { "sub_entity_id": "s-2", "col_2": null }
    ]
}
"""
val df = spark.read.json(Seq(json).toDS)

val removeNulls = udf((arr : Seq[String]) => arr.filter((x: String) => x != null))
df.select($"col_1", removeNulls($"sub_entities.col_2").as("sub_entities.col_2")).show(false)

+-----+------------------+
|col_1|sub_entities.col_2|
+-----+------------------+
|null |[]                |
+-----+------------------+

如您所述,如果在执行
df时需要不同的输出,则只过滤出
col\u 2
数组。选择($“col\u 1”,“sub\u entities”)。显示
,我可以更新答案:

val json =  
"""
{
    "id": "123",
    "col_1": null,
    "sub_entities": [
        { "sub_entity_id": "s-1", "col_2": null },
        { "sub_entity_id": "s-2", "col_2": null }
    ]
}
"""
val df = spark.read.json(Seq(json).toDS)

val removeNulls = udf((arr : Seq[String]) => arr.filter((x: String) => x != null))
df.select($"col_1", removeNulls($"sub_entities.col_2").as("sub_entities.col_2")).show(false)

+-----+------------------+
|col_1|sub_entities.col_2|
+-----+------------------+
|null |[]                |
+-----+------------------+

在此上下文中过滤掉
null
值意味着什么?是否要将上面的数组返回为空?是否希望
子实体
仅返回
列2不为null的数组中的对象?请写下预期的输出。@DusanVasiljevic编辑了这个问题,我想要一个查询来返回数组中
col_2
不为null的对象。在这个上下文中过滤掉
null
值意味着什么?是否要将上面的数组返回为空?是否希望
子实体
仅返回
列2不为null的数组中的对象?请写下预期的输出。@DusanVasiljevic编辑了这个问题,我想要一个查询,从
col_2
不为空的数组返回对象。