Arrays 如何防止pyspark使用explode()复制数据?

Arrays 如何防止pyspark使用explode()复制数据?,arrays,scala,apache-spark,pyspark,jupyter-notebook,Arrays,Scala,Apache Spark,Pyspark,Jupyter Notebook,使用上面的json,我试图创建一个表,显示每个客户的一行及其独特的价格优惠 [{ "Data": [{ "Customer": [{ "Prices": { "USD": [[86, "2.18"], [172, "1.67"]

使用上面的json,我试图创建一个表,显示每个客户的一行及其独特的价格优惠

[{
        "Data": [{
                "Customer": [{
                        "Prices": {
                            "USD": [[86, "2.18"], [172, "1.67"], [344, "1.52"]]
                        },
                        "Seller": {
                            "Name": "Customer1"
                        }
                    }, {
                        "Prices": {
                            "USD": [[1, "1.99"], [100, "1.55"], [500, "1.24"]]
                        },
                        "Seller": {
                            "Name": "Customer2"
                        }
                    }
                ]
            }
        ],
        "PartNumber": "ABC"
    }
]
我需要解析客户,因此如果我分解客户,我会得到重复(不正确)的结果:

我做错了什么?以下是我试图返回的结果:

df5 = df4.withColumn("Customer", explode("Name"))

df5.select("Customer", "PartNumber", "Quantity", "Price").show()
+---------+----------+--------+-----+
| Customer|PartNumber|Quantity|Price|
+---------+----------+--------+-----+
|Customer1|       ABC|      86| 2.18|
|Customer2|       ABC|      86| 2.18|
|Customer1|       ABC|     172| 1.67|
|Customer2|       ABC|     172| 1.67|
|Customer1|       ABC|     344| 1.52|
|Customer2|       ABC|     344| 1.52|
|Customer1|       ABC|       1| 1.99|
|Customer2|       ABC|       1| 1.99|
|Customer1|       ABC|     100| 1.55|
|Customer2|       ABC|     100| 1.55|
|Customer1|       ABC|     500| 1.24|
|Customer2|       ABC|     500| 1.24|
+---------+----------+--------+-----+

问题是,您正在使用“分解”添加列,而您希望选择不希望复制的列,然后分解这些列,如下所示:

Customer    Quantity    Price
Customer1   86          2.18
Customer1   172         1.67
Customer1   344         1.52
Customer2   1           1.99
Customer2   100         1.55
Customer2   500         1.24

我相信您知道如何完成此操作:-)

问题是您正在使用“分解”添加列,而您希望选择不希望复制的列,然后分解您所做的列,如下所示:

Customer    Quantity    Price
Customer1   86          2.18
Customer1   172         1.67
Customer1   344         1.52
Customer2   1           1.99
Customer2   100         1.55
Customer2   500         1.24

我相信您知道如何完成此任务:-)

df1=dfJsonFile.withColumn(“Customer”,explode(“Data.Customer”))
会给您带来问题,因为它会将每个价格分配给两个客户。如何在不分解客户的情况下获得价格?上述评论不正确,导致重复的是子序列代码语句。
df1=dfJsonFile.withColumn(“Customer”,explode(“Data.Customer”))
会导致您的问题,因为它会将每个价格分配给两个客户。如何在不分解客户的情况下获得价格?上述评论不正确,是子序列代码语句导致了重复。
df1 = dfJsonFile.withColumn("Customer", explode("Data.Customer"))
df2 = df1.select(explode("Customer")).select("col.*")
df3 = df2.select(col("Seller.Name").alias("name"), explode("Prices.USD"))
+---------+-----------+
|     name|        col|
+---------+-----------+
|Customer1| [86, 2.18]|
|Customer1|[172, 1.67]|
|Customer1|[344, 1.52]|
|Customer2|  [1, 1.99]|
|Customer2|[100, 1.55]|
|Customer2|[500, 1.24]|
+---------+-----------+