Arrays 如何防止pyspark使用explode()复制数据?
使用上面的json,我试图创建一个表,显示每个客户的一行及其独特的价格优惠Arrays 如何防止pyspark使用explode()复制数据?,arrays,scala,apache-spark,pyspark,jupyter-notebook,Arrays,Scala,Apache Spark,Pyspark,Jupyter Notebook,使用上面的json,我试图创建一个表,显示每个客户的一行及其独特的价格优惠 [{ "Data": [{ "Customer": [{ "Prices": { "USD": [[86, "2.18"], [172, "1.67"]
[{
"Data": [{
"Customer": [{
"Prices": {
"USD": [[86, "2.18"], [172, "1.67"], [344, "1.52"]]
},
"Seller": {
"Name": "Customer1"
}
}, {
"Prices": {
"USD": [[1, "1.99"], [100, "1.55"], [500, "1.24"]]
},
"Seller": {
"Name": "Customer2"
}
}
]
}
],
"PartNumber": "ABC"
}
]
我需要解析客户,因此如果我分解客户,我会得到重复(不正确)的结果:
我做错了什么?以下是我试图返回的结果:
df5 = df4.withColumn("Customer", explode("Name"))
df5.select("Customer", "PartNumber", "Quantity", "Price").show()
+---------+----------+--------+-----+
| Customer|PartNumber|Quantity|Price|
+---------+----------+--------+-----+
|Customer1| ABC| 86| 2.18|
|Customer2| ABC| 86| 2.18|
|Customer1| ABC| 172| 1.67|
|Customer2| ABC| 172| 1.67|
|Customer1| ABC| 344| 1.52|
|Customer2| ABC| 344| 1.52|
|Customer1| ABC| 1| 1.99|
|Customer2| ABC| 1| 1.99|
|Customer1| ABC| 100| 1.55|
|Customer2| ABC| 100| 1.55|
|Customer1| ABC| 500| 1.24|
|Customer2| ABC| 500| 1.24|
+---------+----------+--------+-----+
问题是,您正在使用“分解”添加列,而您希望选择不希望复制的列,然后分解这些列,如下所示:
Customer Quantity Price
Customer1 86 2.18
Customer1 172 1.67
Customer1 344 1.52
Customer2 1 1.99
Customer2 100 1.55
Customer2 500 1.24
我相信您知道如何完成此操作:-)问题是您正在使用“分解”添加列,而您希望选择不希望复制的列,然后分解您所做的列,如下所示:
Customer Quantity Price
Customer1 86 2.18
Customer1 172 1.67
Customer1 344 1.52
Customer2 1 1.99
Customer2 100 1.55
Customer2 500 1.24
我相信您知道如何完成此任务:-)
df1=dfJsonFile.withColumn(“Customer”,explode(“Data.Customer”))
会给您带来问题,因为它会将每个价格分配给两个客户。如何在不分解客户的情况下获得价格?上述评论不正确,导致重复的是子序列代码语句。df1=dfJsonFile.withColumn(“Customer”,explode(“Data.Customer”))
会导致您的问题,因为它会将每个价格分配给两个客户。如何在不分解客户的情况下获得价格?上述评论不正确,是子序列代码语句导致了重复。
df1 = dfJsonFile.withColumn("Customer", explode("Data.Customer"))
df2 = df1.select(explode("Customer")).select("col.*")
df3 = df2.select(col("Seller.Name").alias("name"), explode("Prices.USD"))
+---------+-----------+
| name| col|
+---------+-----------+
|Customer1| [86, 2.18]|
|Customer1|[172, 1.67]|
|Customer1|[344, 1.52]|
|Customer2| [1, 1.99]|
|Customer2|[100, 1.55]|
|Customer2|[500, 1.24]|
+---------+-----------+