Warning: file_get_contents(/data/phpspider/zhask/data//catemap/3/apache-spark/5.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181

Warning: file_get_contents(/data/phpspider/zhask/data//catemap/4/jquery-ui/2.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Apache spark 如何读取Spark中的多嵌套JSON数据_Apache Spark_Apache Spark Sql - Fatal编程技术网

Apache spark 如何读取Spark中的多嵌套JSON数据

Apache spark 如何读取Spark中的多嵌套JSON数据,apache-spark,apache-spark-sql,Apache Spark,Apache Spark Sql,如何读取Spark中的多嵌套JSON数据。我有一个JSON文件 我需要将此模式格式提取到行项目中,如下所示: trialTherapeuticAreas_ID,trialTherapeuticAreas_name,trialDiseases_id,trialDiseases_name,trialPatientSegments_id,trialPatientSegments_name 您需要以嵌套方式分解数组,并在单独的列中选择结构元素,因为您需要分解内置函数和选择api和别名 要尝试的代码:

如何读取Spark中的多嵌套JSON数据。我有一个JSON文件

我需要将此模式格式提取到行项目中,如下所示:

trialTherapeuticAreas_ID,trialTherapeuticAreas_name,trialDiseases_id,trialDiseases_name,trialPatientSegments_id,trialPatientSegments_name

您需要
以嵌套方式分解
数组,并在单独的列中选择
结构
元素
,因为您需要
分解
内置函数和
选择
api和别名

要尝试的代码:

import org.apache.spark.sql.functions._
val finalDF = df.withColumn("trialTherapeuticAreas", explode(col("trialTherapeuticAreas")))
                                      .select(col("trialTherapeuticAreas.id").as("trialTherapeuticAreas_ID"), col("trialTherapeuticAreas.name").as("trialTherapeuticAreas_name"), explode(col("trialTherapeuticAreas.trialDiseases")).as("trialDiseases"))
                                      .select(col("trialTherapeuticAreas_ID"), col("trialTherapeuticAreas_name"), col("trialDiseases.id").as("trialDiseases_id"), col("trialDiseases.name").as("trialDiseases_name"), explode(col("trialDiseases.trialPatientSegments")).as("trialPatientSegments"))
                                      .select(col("trialTherapeuticAreas_ID"), col("trialTherapeuticAreas_name"), col("trialDiseases_id"), col("trialDiseases_name"), col("trialPatientSegments.id").as("trialPatientSegments_id"), col("trialPatientSegments.name").as("trialPatientSegments_name"))
你应该满足你的要求

您可以使用三个
withColumn
api和一个
select
语句执行上述转换

import org.apache.spark.sql.functions._
val finalDF = df.withColumn("trialTherapeuticAreas", explode(col("trialTherapeuticAreas")))
                .withColumn("trialDiseases", explode(col("trialTherapeuticAreas.trialDiseases")))
                .withColumn("trialPatientSegments", explode(col("trialDiseases.trialPatientSegments")))
                .select(col("trialTherapeuticAreas.id").as("trialTherapeuticAreas_ID"), col("trialTherapeuticAreas.name").as("trialTherapeuticAreas_name"), col("trialDiseases.id").as("trialDiseases_id"), col("trialDiseases.name").as("trialDiseases_name"), col("trialPatientSegments.id").as("trialPatientSegments_id"), col("trialPatientSegments.name").as("trialPatientSegments_name"))
对于大型数据集,不建议连续使用
with column
,因为它可能会提供随机输出。原因是
with column
是分布式的,并且没有证明以串行方式遵循执行顺序;这将是一个良好的开端