Pyspark分解sql函数

Pyspark分解sql函数,pyspark,apache-spark-sql,pyspark-sql,pyspark-dataframes,Pyspark,Apache Spark Sql,Pyspark Sql,Pyspark Dataframes,我有这个模式: root |-- _id: long (nullable = true) |-- _published-at: string (nullable = true) |-- _title: string (nullable = true) |-- a: array (nullable = true) | |-- element: struct (containsNull = true) | | |-- _VALUE: string (nullable =

我有这个模式:

root
 |-- _id: long (nullable = true)
 |-- _published-at: string (nullable = true)
 |-- _title: string (nullable = true)
 |-- a: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- _VALUE: string (nullable = true)
 |    |    |-- _href: string (nullable = true)
 |    |    |-- _type: string (nullable = true)
 |-- p: array (nullable = true)
 |    |-- element: string (containsNull = true)

link_structure = StructType([
    StructField("_VALUE", StringType(), True),
    StructField("_href", StringType(), True),
    StructField("_type", StringType(), True)
    ])

articles_schema = StructType([
    StructField("_id", LongType(), True),
    StructField("_published-at", StringType(), True),
    StructField("_title", StringType(), True),
    StructField("a", ArrayType(link_structure), True),
    StructField("p", ArrayType(StringType()), True)])
样本数据:

+---+-------------+--------------------+--------------------+--------------------+
|_id|_published-at|              _title|                   a|                   p|
+---+-------------+--------------------+--------------------+--------------------+
| 17|   2004-07-29|SAN FRANCISCO / H...|[[Gwendolyn Tucke...|[Chief juvenile p...|
| 19|   2017-10-05|Nancy Pelosi Lies...|[[so he asked her...|[CNN recently hos...|
| 23|   2017-04-20|University leader...|[[letter, http://...|[Pro-life student...|
| 24|   2011-01-14|What Wine Prices ...|[[A new working p...|[More on:, <a>Fos...|
+---+-------------+--------------------+--------------------+--------------------+
数据:

但我得到了这个结果:

+---+----+
|_id| col|
+---+----+
| 17|null|
| 19|null|
| 23|null|
| 24|null|
+---+----+

如何做到这一点?

链接中的值看起来如何?使用链接数据更新我认为您不需要在那里创建数组。您的
a.。\u href
看起来就像一个数组本身。一个简单的
explode('A.。\u href')
应该可以工作。首先是这样(没有数组),但是我得到了空集,使用数组我开始获得ids值,但是在链接列中有空值!是否需要创建一个包含每个链接的id而不是数组的映射?否。
array
部分只需将两列压缩以创建成对的值。我认为这是因为
a.\u href
列数组中没有引用这些值。
+---+--------------------+--------------------+
|_id|             content|               links|
+---+--------------------+--------------------+
| 17|[Chief juvenile p...|[[/search/?action...|
| 19|[CNN recently hos...|[[https://www.you...|
| 23|[Pro-life student...|[[http://yourstud...|
| 24|[More on:, <a>Fos...|[[http://www.imf....|
+---+--------------------+--------------------+
df2 = articles_df.select("_id", fun.explode(fun.col('links')))
+---+----+
|_id| col|
+---+----+
| 17|null|
| 19|null|
| 23|null|
| 24|null|
+---+----+