Pyspark分解sql函数_Pyspark_Apache Spark Sql_Pyspark Sql_Pyspark Dataframes

Pyspark分解sql函数

pyspark

Pyspark分解sql函数,pyspark,apache-spark-sql,pyspark-sql,pyspark-dataframes,Pyspark,Apache Spark Sql,Pyspark Sql,Pyspark Dataframes,我有这个模式： root |-- _id: long (nullable = true) |-- _published-at: string (nullable = true) |-- _title: string (nullable = true) |-- a: array (nullable = true) | |-- element: struct (containsNull = true) | | |-- _VALUE: string (nullable =

我有这个模式：

root
 |-- _id: long (nullable = true)
 |-- _published-at: string (nullable = true)
 |-- _title: string (nullable = true)
 |-- a: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- _VALUE: string (nullable = true)
 |    |    |-- _href: string (nullable = true)
 |    |    |-- _type: string (nullable = true)
 |-- p: array (nullable = true)
 |    |-- element: string (containsNull = true)

link_structure = StructType([
    StructField("_VALUE", StringType(), True),
    StructField("_href", StringType(), True),
    StructField("_type", StringType(), True)
    ])

articles_schema = StructType([
    StructField("_id", LongType(), True),
    StructField("_published-at", StringType(), True),
    StructField("_title", StringType(), True),
    StructField("a", ArrayType(link_structure), True),
    StructField("p", ArrayType(StringType()), True)])

样本数据：

+---+-------------+--------------------+--------------------+--------------------+
|_id|_published-at|              _title|                   a|                   p|
+---+-------------+--------------------+--------------------+--------------------+
| 17|   2004-07-29|SAN FRANCISCO / H...|[[Gwendolyn Tucke...|[Chief juvenile p...|
| 19|   2017-10-05|Nancy Pelosi Lies...|[[so he asked her...|[CNN recently hos...|
| 23|   2017-04-20|University leader...|[[letter, http://...|[Pro-life student...|
| 24|   2011-01-14|What Wine Prices ...|[[A new working p...|[More on:, <a>Fos...|
+---+-------------+--------------------+--------------------+--------------------+

数据：

但我得到了这个结果：

+---+----+
|_id| col|
+---+----+
| 17|null|
| 19|null|
| 23|null|
| 24|null|
+---+----+

如何做到这一点？

链接中的值看起来如何？使用链接数据更新我认为您不需要在那里创建数组。您的

a.。\u href

看起来就像一个数组本身。一个简单的

explode（'A.。\u href'）

应该可以工作。首先是这样（没有数组），但是我得到了空集，使用数组我开始获得ids值，但是在链接列中有空值！是否需要创建一个包含每个链接的id而不是数组的映射？否。

array

部分只需将两列压缩以创建成对的值。我认为这是因为

a.\u href

列数组中没有引用这些值。

+---+--------------------+--------------------+
|_id|             content|               links|
+---+--------------------+--------------------+
| 17|[Chief juvenile p...|[[/search/?action...|
| 19|[CNN recently hos...|[[https://www.you...|
| 23|[Pro-life student...|[[http://yourstud...|
| 24|[More on:, <a>Fos...|[[http://www.imf....|
+---+--------------------+--------------------+

df2 = articles_df.select("_id", fun.explode(fun.col('links')))

+---+----+
|_id| col|
+---+----+
| 17|null|
| 19|null|
| 23|null|
| 24|null|
+---+----+