Python 3.x pyspark dataframes:为什么我可以选择一些嵌套字段而不能选择其他字段?
我试图在Python3.9.1中编写一些代码,使用pyspark(3.0.1)将JSON解嵌套到数据帧中 我有一些带有模式的虚拟数据,如下所示:Python 3.x pyspark dataframes:为什么我可以选择一些嵌套字段而不能选择其他字段?,python-3.x,apache-spark,pyspark,apache-spark-sql,pyspark-dataframes,Python 3.x,Apache Spark,Pyspark,Apache Spark Sql,Pyspark Dataframes,我试图在Python3.9.1中编写一些代码,使用pyspark(3.0.1)将JSON解嵌套到数据帧中 我有一些带有模式的虚拟数据,如下所示: data.printSchema() root |-- recordID: string (nullable = true) |-- customerDetails: struct (nullable = true) | |-- name: string (nullable = true) | |-- dob: string (nu
data.printSchema()
root
|-- recordID: string (nullable = true)
|-- customerDetails: struct (nullable = true)
| |-- name: string (nullable = true)
| |-- dob: string (nullable = true)
|-- familyMembers: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- relationship: string (nullable = true)
| | |-- name: string (nullable = true)
| | |-- contactNumbers: struct (nullable = true)
| | | |-- work: string (nullable = true)
| | | |-- home: string (nullable = true)
| | |-- addressDetails: array (nullable = true)
| | | |-- element: struct (containsNull = true)
| | | | |-- addressType: string (nullable = true)
| | | | |-- address: string (nullable = true)
当我从familyMembers
中选择字段时,我得到了如下预期结果:
data.select('familyMembers.contactNumbers.work').show(truncate=False)
+------------------------------------------------+
|work |
+------------------------------------------------+
|[(07) 4612 3880, (03) 5855 2377, (07) 4979 1871]|
|[(07) 4612 3880, (03) 5855 2377] |
+------------------------------------------------+
data.select('familyMembers.name').show(truncate=False)
+------------------------------------+
|name |
+------------------------------------+
|[Jane Smith, Bob Smith, Simon Smith]|
|[Jackie Sacamano, Simon Sacamano] |
+------------------------------------+
但是,当我尝试从地址详细信息数组类型(在家庭成员下面)中选择字段时,我得到一个错误:
>>> data.select('familyMembers.addressDetails.address').show(truncate=False)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/local/lib/python3.9/site-packages/pyspark/sql/dataframe.py", line 1421, in select
jdf = self._jdf.select(self._jcols(*cols))
File "/usr/local/lib/python3.9/site-packages/py4j/java_gateway.py", line 1304, in __call__
return_value = get_return_value(
File "/usr/local/lib/python3.9/site-packages/pyspark/sql/utils.py", line 134, in deco
raise_from(converted)
File "<string>", line 3, in raise_from
pyspark.sql.utils.AnalysisException: cannot resolve '`familyMembers`.`addressDetails`['address']' due to data type mismatch: argument 2 requires integral type, however, ''address'' is of string type.;;
'Project [familyMembers#71.addressDetails[address] AS address#277]
+- LogicalRDD [recordID#69, customerDetails#70, familyMembers#71], false
要了解这一点,您可以打印以下内容的架构:
data.select('familyMembers.addressDetails').printSchema()
#root
# |-- familyMembers.addressDetails: array (nullable = true)
# | |-- element: array (containsNull = true)
# | | |-- element: struct (containsNull = true)
# | | | |-- addressType: string (nullable = true)
# | | | |-- address: string (nullable = true)
请看这里,您有一个结构数组,它与您的初始模式不同。因此,您不能从根目录直接访问地址
,但可以选择嵌套数组的第一个元素,然后访问结构字段地址
:
data.selectExpr("familyMembers.addressDetails[0].address").show(truncate=False)
#+--------------------------------------------------------------------------+
#|familyMembers.addressDetails AS addressDetails#29[0].address |
#+--------------------------------------------------------------------------+
#|[29 Commonwealth St, Clifton, QLD 4361 , 20 A Yeo Ave, Highgate, SA 5063 ]|
#+--------------------------------------------------------------------------+
或:
除了@blackishop提供的答案外,您还可以使用select
和expr
的组合来获得如下输出:
data.select(expr('familyMembers.addressDetails[0].address')))
data.select(explode('familyMembers.addressDetails')).select("col.address")
输出:
如果需要,您还可以使用explode
获取所有地址,如下所示:
data.select(expr('familyMembers.addressDetails[0].address')))
data.select(explode('familyMembers.addressDetails')).select("col.address")
输出: