Warning: file_get_contents(/data/phpspider/zhask/data//catemap/4/json/15.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Json 将一条记录中的数组字段与所有其他重编码连接起来-pySpark_Json_Apache Spark_Pyspark_Apache Spark Sql - Fatal编程技术网

Json 将一条记录中的数组字段与所有其他重编码连接起来-pySpark

Json 将一条记录中的数组字段与所有其他重编码连接起来-pySpark,json,apache-spark,pyspark,apache-spark-sql,Json,Apache Spark,Pyspark,Apache Spark Sql,模式 在这个模式中,有一个用户ID为0,我必须将用户ID为0的languageknowList与所有其他用户的languageknowList连接起来 我该怎么做 例如: 向DF输入数据 root |-- userId: string (nullable = true) |-- languageknowList: array (nullable = true) | |-- element: struct (containsNull = false)

模式

在这个模式中,有一个用户ID为0,我必须将用户ID为0的languageknowList与所有其他用户的languageknowList连接起来

我该怎么做

例如: 向DF输入数据

 root
     |-- userId: string (nullable = true)
     |-- languageknowList: array (nullable = true)
     |    |-- element: struct (containsNull = false)
     |    |    |-- code: string (nullable = false)
     |    |    |-- description: string (nullable = false)
     |    |    |-- name: string (nullable = false)
输出df应类似于:

[{
  "userId":1,
  "languageknowList": [[10,"Hindi","Hindi"],[11,"Spanish","Spanish"]]
},
{
  "userId":2,
  "languageknowList": [[11,"Spanish","Spanish"]]
},
{
  "userId":0,
  "languageknowList": [[1,"English","English"],[2,"German","German"]]
}]

您可以使用
userId=0
concat
语言数组将数据帧交叉连接到行:

[{
  "userId":1,
  "languageknowList": [[10,"Hindi","Hindi"],[11,"Spanish","Spanish"],[1,"English","English"],[2,"German","German"]]
},
{
  "userId":2,
  "languageknowList": [[11,"Spanish","Spanish"],[1,"English","English"],[2,"German","German"]]
}]

面临错误:-检测到逻辑平面之间的内部联接的隐式笛卡尔积,而不是联接我使用了交叉联接。现在工作正常。啊,很抱歉。我编辑了我的答案What is coalesce(1)它用于将整个数据帧收集到一个分区中,这样输出就不会被分割成多个json文件。仅当数据帧较小时才应使用它,否则可能会导致内存不足问题。
result = df.filter('userId != 0').crossJoin(
    df.filter('userId = 0').select('languageknowList').toDF('language')
).select(
    'userId',
    F.concat('languageknowList', 'language').alias('languageknowList')
)

result.show(20,0)
+------+----------------------------------------------------------------------------------------+
|userId|languageknowList                                                                        |
+------+----------------------------------------------------------------------------------------+
|1     |[[10, Hindi, Hindi], [11, Spanish, Spanish], [1, English, English], [2, German, German]]|
|2     |[[11, Spanish, Spanish], [1, English, English], [2, German, German]]                    |
+------+----------------------------------------------------------------------------------------+

result.coalesce(1).write.json('result')
$ cat result/part-00000-b34b3748-71b5-46d4-b011-6b208978cc5a-c000.json
{"userId":1,"languageknowList":[["10","Hindi","Hindi"],["11","Spanish","Spanish"],["1","English","English"],["2","German","German"]]}
{"userId":2,"languageknowList":[["11","Spanish","Spanish"],["1","English","English"],["2","German","German"]]}