Warning: file_get_contents(/data/phpspider/zhask/data//catemap/3/apache-spark/5.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Apache spark 分解pyspark中的嵌套结构并将结构名称用作列_Apache Spark_Pyspark - Fatal编程技术网

Apache spark 分解pyspark中的嵌套结构并将结构名称用作列

Apache spark 分解pyspark中的嵌套结构并将结构名称用作列,apache-spark,pyspark,Apache Spark,Pyspark,我正在pyspark中读取json文件,其中第三级嵌套的结构名称每行都不同 Schema looks something like below |-- A: string (nullable = true) |-- Plugins: struct (nullable = true) | |-- RfS: struct (nullable = true) | | |-- A: string (nullable = true) | | |-- B: stri

我正在pyspark中读取json文件,其中第三级嵌套的结构名称每行都不同

Schema looks something like below

 |-- A: string (nullable = true)
 |-- Plugins: struct (nullable = true)
 |    |-- RfS: struct (nullable = true)
 |    |    |-- A: string (nullable = true)
 |    |    |-- B: string (nullable = true)
 |    |-- RtW: struct (nullable = true)
 |    |    |-- A: string (nullable = true)
 |    |    |-- B: string (nullable = true)

which I want to convert to dataframe of following schma

 |-- A: string (nullable = true)
 |-- Plugins: struct (nullable = true)
 |-- A: string (nullable = true)
 |-- B: string (nullable = true)

Plugins will contain value from struct name RfS/RtW etc.
我读取数据并删除第一级嵌套

jsonData = """{
  "A" : "some A",
  "Plugins": {
    "RfS": {
      "A" : "RfSA",
      "B" : "RfSB"
    },
    "RtW" : {
      "A" : "RtWA",
      "B" : "RtWA"
    }
  }
}"""

df = spark.read.json(sc.parallelize([jsonData]))
no_plug_cols = ["A"] # cols not in Plugins i.e A
plug_df = df.select("A", "Plugins.*")

# plug_df.printSchema()
# root
#  |-- A: string (nullable = true)
#  |-- RfS: struct (nullable = true)
#  |    |-- A: string (nullable = true)
#  |    |-- B: string (nullable = true)
#  |-- RtW: struct (nullable = true)
#  |    |-- A: string (nullable = true)
#  |    |-- B: string (nullable = true)
根据这里的一个答案,我发现得到以下内容很简单

icols = [(col(f"{c}.A").alias(f"{c}.A"), col(f"{c}.B").alias(f"{c}.B")) for c in (set(plug_df.columns) - set(no_plug_cols))]

# we use chain to flatten icols which is a list of tuples   
plug_df.select(no_plug_cols + list(chain(*icols))).show()

# +------+-----+-----+-----+-----+
# |     A|RfS.A|RfS.B|RtW.A|RtW.B|
# +------+-----+-----+-----+-----+
# |some A| RfSA| RfSB| RtWA| RtWA|
# +------+-----+-----+-----+-----+

有没有一种方法可以代替上面的输出,我可以将RfS/RtW作为列值,并使用所需的名称,以便输出如下所示。是否只有在转换为上述格式后使用pivot转换数据才有可能

# +------+-----  +-----+-----+
# |     A|Plugins|    A|    B|
# +------+-------+-----+-----+
# |some A| RfS   | RfSA| RfSB|
# +------+-------+-----+-----+
# |some A| RtW   | RfWA| RtWA|
# +------+-------+-----+-----+