Json 将一条记录中的数组字段与所有其他重编码连接起来-pySpark
模式 在这个模式中,有一个用户ID为0,我必须将用户ID为0的languageknowList与所有其他用户的languageknowList连接起来 我该怎么做 例如: 向DF输入数据Json 将一条记录中的数组字段与所有其他重编码连接起来-pySpark,json,apache-spark,pyspark,apache-spark-sql,Json,Apache Spark,Pyspark,Apache Spark Sql,模式 在这个模式中,有一个用户ID为0,我必须将用户ID为0的languageknowList与所有其他用户的languageknowList连接起来 我该怎么做 例如: 向DF输入数据 root |-- userId: string (nullable = true) |-- languageknowList: array (nullable = true) | |-- element: struct (containsNull = false)
root
|-- userId: string (nullable = true)
|-- languageknowList: array (nullable = true)
| |-- element: struct (containsNull = false)
| | |-- code: string (nullable = false)
| | |-- description: string (nullable = false)
| | |-- name: string (nullable = false)
输出df应类似于:
[{
"userId":1,
"languageknowList": [[10,"Hindi","Hindi"],[11,"Spanish","Spanish"]]
},
{
"userId":2,
"languageknowList": [[11,"Spanish","Spanish"]]
},
{
"userId":0,
"languageknowList": [[1,"English","English"],[2,"German","German"]]
}]
您可以使用
userId=0
和concat
语言数组将数据帧交叉连接到行:
[{
"userId":1,
"languageknowList": [[10,"Hindi","Hindi"],[11,"Spanish","Spanish"],[1,"English","English"],[2,"German","German"]]
},
{
"userId":2,
"languageknowList": [[11,"Spanish","Spanish"],[1,"English","English"],[2,"German","German"]]
}]
面临错误:-检测到逻辑平面之间的内部联接的隐式笛卡尔积,而不是联接我使用了交叉联接。现在工作正常。啊,很抱歉。我编辑了我的答案What is coalesce(1)它用于将整个数据帧收集到一个分区中,这样输出就不会被分割成多个json文件。仅当数据帧较小时才应使用它,否则可能会导致内存不足问题。
result = df.filter('userId != 0').crossJoin(
df.filter('userId = 0').select('languageknowList').toDF('language')
).select(
'userId',
F.concat('languageknowList', 'language').alias('languageknowList')
)
result.show(20,0)
+------+----------------------------------------------------------------------------------------+
|userId|languageknowList |
+------+----------------------------------------------------------------------------------------+
|1 |[[10, Hindi, Hindi], [11, Spanish, Spanish], [1, English, English], [2, German, German]]|
|2 |[[11, Spanish, Spanish], [1, English, English], [2, German, German]] |
+------+----------------------------------------------------------------------------------------+
result.coalesce(1).write.json('result')
$ cat result/part-00000-b34b3748-71b5-46d4-b011-6b208978cc5a-c000.json
{"userId":1,"languageknowList":[["10","Hindi","Hindi"],["11","Spanish","Spanish"],["1","English","English"],["2","German","German"]]}
{"userId":2,"languageknowList":[["11","Spanish","Spanish"],["1","English","English"],["2","German","German"]]}