Apache spark 使用PySpark分解数组值_Apache Spark_Hadoop_Pyspark_Apache Spark Sql_Pyspark Dataframes

Apache spark 使用PySpark分解数组值

apache-spark hadoop pyspark

Apache spark 使用PySpark分解数组值,apache-spark,hadoop,pyspark,apache-spark-sql,pyspark-dataframes,Apache Spark,Hadoop,Pyspark,Apache Spark Sql,Pyspark Dataframes,我是pyspark的新手，我需要分解我的值数组，这样每个值都会被分配到一个新列。我尝试使用explode，但无法获得所需的输出。下面是我的输出 +---------------+----------+------------------+----------+---------+------------+--------------------+ |account_balance|account_id|credit_Card_Number|first_name|last_name|phone_n

我是pyspark的新手，我需要分解我的值数组，这样每个值都会被分配到一个新列。我尝试使用explode，但无法获得所需的输出。下面是我的输出

+---------------+----------+------------------+----------+---------+------------+--------------------+
|account_balance|account_id|credit_Card_Number|first_name|last_name|phone_number|        transactions|
+---------------+----------+------------------+----------+---------+------------+--------------------+
|         100000|     12345|             12345|       abc|      xyz|  1234567890|[1000, 01/06/2020...|
|         100000|     12345|             12345|       abc|      xyz|  1234567890|[1100, 02/06/2020...|
|         100000|     12345|             12345|       abc|      xyz|  1234567890|[6146, 02/06/2020...|
|         100000|     12345|             12345|       abc|      xyz|  1234567890|[253, 03/06/2020,...|
|         100000|     12345|             12345|       abc|      xyz|  1234567890|[4521, 04/06/2020...|
|         100000|     12345|             12345|       abc|      xyz|  1234567890|[955, 05/06/2020,...|
+---------------+----------+------------------+----------+---------+------------+--------------------+

下面是程序的模式

root
 |-- account_balance: long (nullable = true)
 |-- account_id: long (nullable = true)
 |-- credit_Card_Number: long (nullable = true)
 |-- first_name: string (nullable = true)
 |-- last_name: string (nullable = true)
 |-- phone_number: long (nullable = true)
 |-- transactions: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- amount: long (nullable = true)
 |    |    |-- date: string (nullable = true)
 |    |    |-- shop: string (nullable = true)
 |    |    |-- transaction_code: string (nullable = true)

我想要一个输出，其中我有金额、日期、商店、交易代码等附加列及其各自的值

amount date        shop     transaction_code
1000   01/06/2020  amazon      buy
1100   02/06/2020  amazon      sell
6146   02/06/2020  ebay        buy
253    03/06/2020  ebay        buy 
4521   04/06/2020  amazon      buy
955    05/06/2020  amazon      buy

使用分解
，然后拆分结构
文件，最后删除新分解的和事务数组列

示例：

from pyspark.sql.functions import *

#got only some columns from json
df.printSchema()
#root
# |-- account_balance: long (nullable = true)
# |-- transactions: array (nullable = true)
# |    |-- element: struct (containsNull = true)
# |    |    |-- amount: long (nullable = true)
# |    |    |-- date: string (nullable = true)
df.selectExpr("*","explode(transactions)").select("*","col.*").drop(*['col','transactions']).show()
#+---------------+------+--------+
#|account_balance|amount|    date|
#+---------------+------+--------+
#|             10|  1000|20200202|
#+---------------+------+--------+

伟大的解决方案！非常感谢你！