Python 3.x pyspark dataframes:为什么我可以选择一些嵌套字段而不能选择其他字段?

Python 3.x pyspark dataframes:为什么我可以选择一些嵌套字段而不能选择其他字段?,python-3.x,apache-spark,pyspark,apache-spark-sql,pyspark-dataframes,Python 3.x,Apache Spark,Pyspark,Apache Spark Sql,Pyspark Dataframes,我试图在Python3.9.1中编写一些代码,使用pyspark(3.0.1)将JSON解嵌套到数据帧中 我有一些带有模式的虚拟数据,如下所示: data.printSchema() root |-- recordID: string (nullable = true) |-- customerDetails: struct (nullable = true) | |-- name: string (nullable = true) | |-- dob: string (nu

我试图在Python3.9.1中编写一些代码,使用pyspark(3.0.1)将JSON解嵌套到数据帧中

我有一些带有模式的虚拟数据,如下所示:

data.printSchema()
root
 |-- recordID: string (nullable = true)
 |-- customerDetails: struct (nullable = true)
 |    |-- name: string (nullable = true)
 |    |-- dob: string (nullable = true)
 |-- familyMembers: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- relationship: string (nullable = true)
 |    |    |-- name: string (nullable = true)
 |    |    |-- contactNumbers: struct (nullable = true)
 |    |    |    |-- work: string (nullable = true)
 |    |    |    |-- home: string (nullable = true)
 |    |    |-- addressDetails: array (nullable = true)
 |    |    |    |-- element: struct (containsNull = true)
 |    |    |    |    |-- addressType: string (nullable = true)
 |    |    |    |    |-- address: string (nullable = true)
当我从
familyMembers
中选择
字段时,我得到了如下预期结果:

data.select('familyMembers.contactNumbers.work').show(truncate=False)
+------------------------------------------------+
|work                                            |
+------------------------------------------------+
|[(07) 4612 3880, (03) 5855 2377, (07) 4979 1871]|
|[(07) 4612 3880, (03) 5855 2377]                |
+------------------------------------------------+

data.select('familyMembers.name').show(truncate=False)
+------------------------------------+
|name                                |
+------------------------------------+
|[Jane Smith, Bob Smith, Simon Smith]|
|[Jackie Sacamano, Simon Sacamano]   |
+------------------------------------+
但是,当我尝试从
地址详细信息
数组类型(在
家庭成员
下面)中选择
字段时,我得到一个错误:

>>> data.select('familyMembers.addressDetails.address').show(truncate=False)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python3.9/site-packages/pyspark/sql/dataframe.py", line 1421, in select
    jdf = self._jdf.select(self._jcols(*cols))
  File "/usr/local/lib/python3.9/site-packages/py4j/java_gateway.py", line 1304, in __call__
    return_value = get_return_value(
  File "/usr/local/lib/python3.9/site-packages/pyspark/sql/utils.py", line 134, in deco
    raise_from(converted)
  File "<string>", line 3, in raise_from
pyspark.sql.utils.AnalysisException: cannot resolve '`familyMembers`.`addressDetails`['address']' due to data type mismatch: argument 2 requires integral type, however, ''address'' is of string type.;;
'Project [familyMembers#71.addressDetails[address] AS address#277]
+- LogicalRDD [recordID#69, customerDetails#70, familyMembers#71], false

要了解这一点,您可以打印以下内容的架构:

data.select('familyMembers.addressDetails').printSchema()

#root
# |-- familyMembers.addressDetails: array (nullable = true)
# |    |-- element: array (containsNull = true)
# |    |    |-- element: struct (containsNull = true)
# |    |    |    |-- addressType: string (nullable = true)
# |    |    |    |-- address: string (nullable = true)
请看这里,您有一个结构数组,它与您的初始模式不同。因此,您不能从根目录直接访问
地址
,但可以选择嵌套数组的第一个元素,然后访问结构字段
地址

data.selectExpr("familyMembers.addressDetails[0].address").show(truncate=False)

#+--------------------------------------------------------------------------+
#|familyMembers.addressDetails AS addressDetails#29[0].address              |
#+--------------------------------------------------------------------------+
#|[29 Commonwealth St, Clifton, QLD 4361 , 20 A Yeo Ave, Highgate, SA 5063 ]|
#+--------------------------------------------------------------------------+
或:


除了@blackishop提供的答案外,您还可以使用
select
expr
的组合来获得如下输出:

data.select(expr('familyMembers.addressDetails[0].address')))
data.select(explode('familyMembers.addressDetails')).select("col.address")
输出:

如果需要,您还可以使用
explode
获取所有地址,如下所示:

data.select(expr('familyMembers.addressDetails[0].address')))
data.select(explode('familyMembers.addressDetails')).select("col.address")
输出: