Pyspark分解sql函数
我有这个模式:Pyspark分解sql函数,pyspark,apache-spark-sql,pyspark-sql,pyspark-dataframes,Pyspark,Apache Spark Sql,Pyspark Sql,Pyspark Dataframes,我有这个模式: root |-- _id: long (nullable = true) |-- _published-at: string (nullable = true) |-- _title: string (nullable = true) |-- a: array (nullable = true) | |-- element: struct (containsNull = true) | | |-- _VALUE: string (nullable =
root
|-- _id: long (nullable = true)
|-- _published-at: string (nullable = true)
|-- _title: string (nullable = true)
|-- a: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- _VALUE: string (nullable = true)
| | |-- _href: string (nullable = true)
| | |-- _type: string (nullable = true)
|-- p: array (nullable = true)
| |-- element: string (containsNull = true)
link_structure = StructType([
StructField("_VALUE", StringType(), True),
StructField("_href", StringType(), True),
StructField("_type", StringType(), True)
])
articles_schema = StructType([
StructField("_id", LongType(), True),
StructField("_published-at", StringType(), True),
StructField("_title", StringType(), True),
StructField("a", ArrayType(link_structure), True),
StructField("p", ArrayType(StringType()), True)])
样本数据:
+---+-------------+--------------------+--------------------+--------------------+
|_id|_published-at| _title| a| p|
+---+-------------+--------------------+--------------------+--------------------+
| 17| 2004-07-29|SAN FRANCISCO / H...|[[Gwendolyn Tucke...|[Chief juvenile p...|
| 19| 2017-10-05|Nancy Pelosi Lies...|[[so he asked her...|[CNN recently hos...|
| 23| 2017-04-20|University leader...|[[letter, http://...|[Pro-life student...|
| 24| 2011-01-14|What Wine Prices ...|[[A new working p...|[More on:, <a>Fos...|
+---+-------------+--------------------+--------------------+--------------------+
数据:
但我得到了这个结果:
+---+----+
|_id| col|
+---+----+
| 17|null|
| 19|null|
| 23|null|
| 24|null|
+---+----+
如何做到这一点?链接中的值看起来如何?使用链接数据更新我认为您不需要在那里创建数组。您的
a.。\u href
看起来就像一个数组本身。一个简单的explode('A.。\u href')
应该可以工作。首先是这样(没有数组),但是我得到了空集,使用数组我开始获得ids值,但是在链接列中有空值!是否需要创建一个包含每个链接的id而不是数组的映射?否。array
部分只需将两列压缩以创建成对的值。我认为这是因为a.\u href
列数组中没有引用这些值。
+---+--------------------+--------------------+
|_id| content| links|
+---+--------------------+--------------------+
| 17|[Chief juvenile p...|[[/search/?action...|
| 19|[CNN recently hos...|[[https://www.you...|
| 23|[Pro-life student...|[[http://yourstud...|
| 24|[More on:, <a>Fos...|[[http://www.imf....|
+---+--------------------+--------------------+
df2 = articles_df.select("_id", fun.explode(fun.col('links')))
+---+----+
|_id| col|
+---+----+
| 17|null|
| 19|null|
| 23|null|
| 24|null|
+---+----+