如果pyspark中不存在键列,则从数据中将其选择为null
我的数据帧df的结构如下:如果pyspark中不存在键列,则从数据中将其选择为null,pyspark,apache-spark-sql,pyspark-sql,Pyspark,Apache Spark Sql,Pyspark Sql,我的数据帧df的结构如下: root |-- val1: string (nullable = true) |-- val2: string (nullable = true) |-- val3: array (nullable = true) | |-- element: struct (containsNull = true) | | |-- _type: string (nullable = true) | | |-- key: string (n
root
|-- val1: string (nullable = true)
|-- val2: string (nullable = true)
|-- val3: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- _type: string (nullable = true)
| | |-- key: string (nullable = true)
| | |-- value: string (nullable = true)
+------+------+-----------------------------------+
| val1 | val2 | val3 |
+------+------+-----------------------------------+
| A | a | {k1: A1, k2: A2, k3: A3} |
+------+------+-----------------------------------+
| B | b | {k3: B3} |
+------+------+-----------------------------------+
我有两个样本记录如下:
root
|-- val1: string (nullable = true)
|-- val2: string (nullable = true)
|-- val3: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- _type: string (nullable = true)
| | |-- key: string (nullable = true)
| | |-- value: string (nullable = true)
+------+------+-----------------------------------+
| val1 | val2 | val3 |
+------+------+-----------------------------------+
| A | a | {k1: A1, k2: A2, k3: A3} |
+------+------+-----------------------------------+
| B | b | {k3: B3} |
+------+------+-----------------------------------+
我试图从中选择以下数据:
root
|-- val1: string (nullable = true)
|-- val2: string (nullable = true)
|-- val3: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- _type: string (nullable = true)
| | |-- key: string (nullable = true)
| | |-- value: string (nullable = true)
+------+------+-----------------------------------+
| val1 | val2 | val3 |
+------+------+-----------------------------------+
| A | a | {k1: A1, k2: A2, k3: A3} |
+------+------+-----------------------------------+
| B | b | {k3: B3} |
+------+------+-----------------------------------+
df.选择VAL1、val2、val3.k1、val3.k2、val3.k3
我希望我的输出看起来像:
+------+------+---------+---------+---------+
| val1 | val2 | k1 | k2 | k3 |
+------+------+---------+---------+---------+
| A | a | A1 | A2 | A3 |
+------+------+-----------------------------+
| B | b | NULL | NULL | B3 |
+------+------+-----------------------------+
但是由于我没有所有记录的键k1和k2,select语句会抛出一个错误。我如何解决这个问题?我对pyspark比较陌生。我想你可以用
df.selectExpr('val3.*')
让我知道这是否有效您能否显示在达到此状态之前在此数据帧上所做的转换类型?val3是一个结构数组,它似乎不正确,我无法复制创建相同的模式。一般来说,查看您的数据val3应该是一个映射或一个结构对不起,val3是一个数组,数据如下所示:+--+--+--+--+--+---------+--+--+--+--val1 | val2 | val3 |+--+--+--+--a | a |[[k1:A1],[k2:A2],[k3:A3]]|+--+--+--+--+--+--+--B | B |[[k3:B3]]+--+--+--+--+--+我基本上希望在pyspark中将数组分解为列