在PySpark中，如何解析嵌入式JSON_Pyspark

在PySpark中，如何解析嵌入式JSON

pyspark

在PySpark中，如何解析嵌入式JSON,pyspark,Pyspark,我是PySpark的新手我有一个JSON文件，它有下面的模式 df = spark.read.json(input_file) df.printSchema() |-- UrlsInfo: array (nullable = true) | |-- element: struct (containsNull = true) | | |-- displayUrl: string (nullable = true) | | |-- type: string

我是PySpark的新手

我有一个JSON文件，它有下面的模式

df = spark.read.json(input_file)

df.printSchema()

 |-- UrlsInfo: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- displayUrl: string (nullable = true)
 |    |    |-- type: string (nullable = true)
 |    |    |-- url: string (nullable = true)
 |-- type: long (nullable = true)

我想要一个新的结果数据框，它应该只有两列type和UrlsInfo.element.DisplayUrl

这是我的try代码，它没有给出预期的输出

  df.createOrReplaceTempView("the_table")  
  resultDF = spark.sql("SELECT type, UrlsInfo.element.DisplayUrl FROM the_table")
  resultDF.show()

我希望resultDF是这样的：

Type | DisplayUrl
----- ------------
2    | http://example.com

这是相关的，但没有回答我的问题。

正如您在模式中看到的，

UrlsInfo

是数组类型，而不是结构。因此，“element”模式项不是指命名属性（您试图通过

.element

访问它），而是指数组元素（它响应像

[0]

这样的索引）

我手工复制了您的模式：

from pyspark.sql import Row
df = spark.createDataFrame([Row(UrlsInfo=[Row(displayUri="http://example.com", type="narf", url="poit")], Type=2)])
df.printSchema()

root
 |-- Type: long (nullable = true)
 |-- UrlsInfo: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- displayUri: string (nullable = true)
 |    |    |-- type: string (nullable = true)
 |    |    |-- url: string (nullable = true)

通过使用索引，我可以生成一个类似于您所要查找的表：

df.createOrReplaceTempView("temp")
resultDF = spark.sql("SELECT type, UrlsInfo[0].DisplayUri FROM temp")
resultDF.show()

+----+----------------------+
|type|UrlsInfo[0].DisplayUri|
+----+----------------------+
|   2|    http://example.com|
+----+----------------------+

但是，这仅在第二列中给出

UrlsInfo

的第一个元素（如果有）

EDIT:我已经忘记了，您可以在这里使用它将

UrlsInfo

元素视为一组行：

from pyspark.sql import Row
df = spark.createDataFrame([Row(UrlsInfo=[Row(displayUri="http://example.com", type="narf", url="poit"), Row(displayUri="http://another-example.com", type="narf", url="poit")], Type=2)])
df.createOrReplaceTempView("temp")
resultDF = spark.sql("SELECT type, EXPLODE(UrlsInfo.displayUri) AS displayUri FROM temp")
resultDF.show()

+----+--------------------+
|type|          displayUri|
+----+--------------------+
|   2|  http://example.com|
|   2|http://another-ex...|
+----+--------------------+

正如您在模式中看到的，

UrlsInfo

是一种数组类型，而不是结构。因此，“element”模式项不是指命名属性（您试图通过

.element