Arrays 如何防止pyspark使用explode（）复制数据？_Arrays_Scala_Apache Spark_Pyspark_Jupyter Notebook

Arrays 如何防止pyspark使用explode（）复制数据？

arrays scala apache-spark pyspark jupyter-notebook

Arrays 如何防止pyspark使用explode（）复制数据？,arrays,scala,apache-spark,pyspark,jupyter-notebook,Arrays,Scala,Apache Spark,Pyspark,Jupyter Notebook,使用上面的json，我试图创建一个表，显示每个客户的一行及其独特的价格优惠 [{ "Data": [{ "Customer": [{ "Prices": { "USD": [[86, "2.18"], [172, "1.67"]

使用上面的json，我试图创建一个表，显示每个客户的一行及其独特的价格优惠

[{
        "Data": [{
                "Customer": [{
                        "Prices": {
                            "USD": [[86, "2.18"], [172, "1.67"], [344, "1.52"]]
                        },
                        "Seller": {
                            "Name": "Customer1"
                        }
                    }, {
                        "Prices": {
                            "USD": [[1, "1.99"], [100, "1.55"], [500, "1.24"]]
                        },
                        "Seller": {
                            "Name": "Customer2"
                        }
                    }
                ]
            }
        ],
        "PartNumber": "ABC"
    }
]

我需要解析客户，因此如果我分解客户，我会得到重复（不正确）的结果：

我做错了什么？以下是我试图返回的结果：

df5 = df4.withColumn("Customer", explode("Name"))

df5.select("Customer", "PartNumber", "Quantity", "Price").show()
+---------+----------+--------+-----+
| Customer|PartNumber|Quantity|Price|
+---------+----------+--------+-----+
|Customer1|       ABC|      86| 2.18|
|Customer2|       ABC|      86| 2.18|
|Customer1|       ABC|     172| 1.67|
|Customer2|       ABC|     172| 1.67|
|Customer1|       ABC|     344| 1.52|
|Customer2|       ABC|     344| 1.52|
|Customer1|       ABC|       1| 1.99|
|Customer2|       ABC|       1| 1.99|
|Customer1|       ABC|     100| 1.55|
|Customer2|       ABC|     100| 1.55|
|Customer1|       ABC|     500| 1.24|
|Customer2|       ABC|     500| 1.24|
+---------+----------+--------+-----+

问题是，您正在使用“分解”添加列，而您希望选择不希望复制的列，然后分解这些列，如下所示：

Customer    Quantity    Price
Customer1   86          2.18
Customer1   172         1.67
Customer1   344         1.52
Customer2   1           1.99
Customer2   100         1.55
Customer2   500         1.24

我相信您知道如何完成此操作：-）

问题是您正在使用“分解”添加列，而您希望选择不希望复制的列，然后分解您所做的列，如下所示：

Customer    Quantity    Price
Customer1   86          2.18
Customer1   172         1.67
Customer1   344         1.52
Customer2   1           1.99
Customer2   100         1.55
Customer2   500         1.24

我相信您知道如何完成此任务：-）

df1=dfJsonFile.withColumn（“Customer”，explode（“Data.Customer”））

会给您带来问题，因为它会将每个价格分配给两个客户。如何在不分解客户的情况下获得价格？上述评论不正确，导致重复的是子序列代码语句。

df1=dfJsonFile.withColumn（“Customer”，explode（“Data.Customer”））

会导致您的问题，因为它会将每个价格分配给两个客户。如何在不分解客户的情况下获得价格？上述评论不正确，是子序列代码语句导致了重复。

df1 = dfJsonFile.withColumn("Customer", explode("Data.Customer"))
df2 = df1.select(explode("Customer")).select("col.*")
df3 = df2.select(col("Seller.Name").alias("name"), explode("Prices.USD"))
+---------+-----------+
|     name|        col|
+---------+-----------+
|Customer1| [86, 2.18]|
|Customer1|[172, 1.67]|
|Customer1|[344, 1.52]|
|Customer2|  [1, 1.99]|
|Customer2|[100, 1.55]|
|Customer2|[500, 1.24]|
+---------+-----------+