Apache spark 使用PySpark分解数组值
我是pyspark的新手,我需要分解我的值数组,这样每个值都会被分配到一个新列。我尝试使用explode,但无法获得所需的输出。下面是我的输出Apache spark 使用PySpark分解数组值,apache-spark,hadoop,pyspark,apache-spark-sql,pyspark-dataframes,Apache Spark,Hadoop,Pyspark,Apache Spark Sql,Pyspark Dataframes,我是pyspark的新手,我需要分解我的值数组,这样每个值都会被分配到一个新列。我尝试使用explode,但无法获得所需的输出。下面是我的输出 +---------------+----------+------------------+----------+---------+------------+--------------------+ |account_balance|account_id|credit_Card_Number|first_name|last_name|phone_n
+---------------+----------+------------------+----------+---------+------------+--------------------+
|account_balance|account_id|credit_Card_Number|first_name|last_name|phone_number| transactions|
+---------------+----------+------------------+----------+---------+------------+--------------------+
| 100000| 12345| 12345| abc| xyz| 1234567890|[1000, 01/06/2020...|
| 100000| 12345| 12345| abc| xyz| 1234567890|[1100, 02/06/2020...|
| 100000| 12345| 12345| abc| xyz| 1234567890|[6146, 02/06/2020...|
| 100000| 12345| 12345| abc| xyz| 1234567890|[253, 03/06/2020,...|
| 100000| 12345| 12345| abc| xyz| 1234567890|[4521, 04/06/2020...|
| 100000| 12345| 12345| abc| xyz| 1234567890|[955, 05/06/2020,...|
+---------------+----------+------------------+----------+---------+------------+--------------------+
下面是程序的模式
root
|-- account_balance: long (nullable = true)
|-- account_id: long (nullable = true)
|-- credit_Card_Number: long (nullable = true)
|-- first_name: string (nullable = true)
|-- last_name: string (nullable = true)
|-- phone_number: long (nullable = true)
|-- transactions: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- amount: long (nullable = true)
| | |-- date: string (nullable = true)
| | |-- shop: string (nullable = true)
| | |-- transaction_code: string (nullable = true)
我想要一个输出,其中我有金额、日期、商店、交易代码等附加列及其各自的值
amount date shop transaction_code
1000 01/06/2020 amazon buy
1100 02/06/2020 amazon sell
6146 02/06/2020 ebay buy
253 03/06/2020 ebay buy
4521 04/06/2020 amazon buy
955 05/06/2020 amazon buy
使用
分解
,然后拆分结构
文件,最后删除新分解的和事务数组列
示例:
from pyspark.sql.functions import *
#got only some columns from json
df.printSchema()
#root
# |-- account_balance: long (nullable = true)
# |-- transactions: array (nullable = true)
# | |-- element: struct (containsNull = true)
# | | |-- amount: long (nullable = true)
# | | |-- date: string (nullable = true)
df.selectExpr("*","explode(transactions)").select("*","col.*").drop(*['col','transactions']).show()
#+---------------+------+--------+
#|account_balance|amount| date|
#+---------------+------+--------+
#| 10| 1000|20200202|
#+---------------+------+--------+
伟大的解决方案!非常感谢你!