将pyspark中dataframe中的多个列表列转换为json数组列_Json_Apache Spark_Pyspark_Apache Spark Sql

将pyspark中dataframe中的多个列表列转换为json数组列

json apache-spark pyspark

将pyspark中dataframe中的多个列表列转换为json数组列,json,apache-spark,pyspark,apache-spark-sql,Json,Apache Spark,Pyspark,Apache Spark Sql,我有一个数据框，它有多个列表列，并转换一个JSON数组列在逻辑下使用，但没有任何想法 def test(test1,test2): d = {'data': [{'marks': a, 'grades': t} for a, t in zip(test1, test2)]} return d arrayToMapUDF = udf(test ,ArrayType(StringType())) df.withcolumn("jsonarraycolumn"

我有一个数据框，它有多个列表列，并转换一个JSON数组列

在逻辑下使用，但没有任何想法

def test(test1,test2):
    d = {'data': [{'marks': a, 'grades': t} for a, t in zip(test1, test2)]}
    return d

arrayToMapUDF = udf(test ,ArrayType(StringType()))

df.withcolumn("jsonarraycolumn", arrayToMapUDF(col("col"),col("col2")))

UDF定义为数组类型，如下所示，并尝试使用列调用，但没有解决任何问题

def test(test1,test2):
    d = {'data': [{'marks': a, 'grades': t} for a, t in zip(test1, test2)]}
    return d

arrayToMapUDF = udf(test ,ArrayType(StringType()))

df.withcolumn("jsonarraycolumn", arrayToMapUDF(col("col"),col("col2")))

标志分数 [100, 150, 200, 300, 400] [0.01, 0.02, 0.03, 0.04, 0.05]

您可以使用

StringType

，因为它返回的是JSON字符串，而不是字符串数组。您还可以使用

json.dumps

将字典转换为json字符串

import pyspark.sql.functions as F
from pyspark.sql.types import StringType
import json

def test(test1,test2):
    d = [{'amount': a, 'discount': t} for a, t in zip(test1, test2)]
    return json.dumps(d)

arrayToMapUDF = F.udf(test, StringType())

df2 = df.withColumn("jsonarraycolumn", arrayToMapUDF(F.col("amount"), F.col("discount")))

df2.show(truncate=False)
+-------------------------------+------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|amount                         |discount                      |jsonarraycolumn                                                                                                                                                                      |
+-------------------------------+------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|[1000, 15000, 2000, 3000, 4000]|[0.01, 0.02, 0.03, 0.04, 0.05]|[{"amount": 1000, "discount": 0.01}, {"amount": 15000, "discount": 0.02}, {"amount": 2000, "discount": 0.03}, {"amount": 3000, "discount": 0.04}, {"amount": 4000, "discount": 0.05}]|
+-------------------------------+------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

如果你不想被引用

import pyspark.sql.functions as F
from pyspark.sql.types import StringType
import json

def test(test1,test2):
    d = [{'amount': a, 'discount': t} for a, t in zip(test1, test2)]
    return json.dumps(d).replace('"', '')

arrayToMapUDF = F.udf(test, StringType())

df2 = df.withColumn("jsonarraycolumn", arrayToMapUDF(F.col("amount"), F.col("discount")))

df2.show(truncate=False)
+-------------------------------+------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------+
|amount                         |discount                      |jsonarraycolumn                                                                                                                                                  |
+-------------------------------+------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------+
|[1000, 15000, 2000, 3000, 4000]|[0.01, 0.02, 0.03, 0.04, 0.05]|[{amount: 1000, discount: 0.01}, {amount: 15000, discount: 0.02}, {amount: 2000, discount: 0.03}, {amount: 3000, discount: 0.04}, {amount: 4000, discount: 0.05}]|
+-------------------------------+------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------+

如果您想要一个真正的JSON类型列：

def test(test1,test2):
    d = [{'amount': a, 'discount': t} for a, t in zip(test1, test2)]
    return d

arrayToMapUDF = F.udf(test, 
    ArrayType(
        StructType([
            StructField('amount', StringType()), 
            StructField('discount', StringType())
        ])
    )
)

df2 = df.withColumn("jsonarraycolumn", arrayToMapUDF(F.col("amount"), F.col("discount")))

df2.show(truncate=False)
+-------------------------------+------------------------------+-----------------------------------------------------------------------+
|amount                         |discount                      |jsonarraycolumn                                                        |
+-------------------------------+------------------------------+-----------------------------------------------------------------------+
|[1000, 15000, 2000, 3000, 4000]|[0.01, 0.02, 0.03, 0.04, 0.05]|[[1000, 0.01], [15000, 0.02], [2000, 0.03], [3000, 0.04], [4000, 0.05]]|
+-------------------------------+------------------------------+-----------------------------------------------------------------------+

df2.printSchema()
root
 |-- amount: array (nullable = false)
 |    |-- element: integer (containsNull = false)
 |-- discount: array (nullable = false)
 |    |-- element: double (containsNull = false)
 |-- jsonarraycolumn: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- amount: string (nullable = true)
 |    |    |-- discount: string (nullable = true)

要避免使用udf函数，可以使用：

导入pyspark.sql.f函数
transform\u expr=“transform（数组\u zip（金额、折扣）、值->值）”
df=df.withColumn（'jsonarraycolumn'，f.to_json（f.expr（transform_expr）））
df.show（truncate=False）

输出：

+-------------------------------+------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|amount                         |discount                      |jsonarraycolumn                                                                                                                                                             |
+-------------------------------+------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|[1000, 15000, 2000, 3000, 4000]|[0.01, 0.02, 0.03, 0.04, 0.05]|[{"amount":1000.0,"discount":0.01},{"amount":15000.0,"discount":0.02},{"amount":2000.0,"discount":0.03},{"amount":3000.0,"discount":0.04},{"amount":4000.0,"discount":0.05}]|
+-------------------------------+------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

也许我把你弄糊涂了，但给出的解决方案很好。但是当我回信给数据库时，也许我把你弄糊涂了，但给出的解决方案是好的。但当我以“属性”的形式回写数据库时：“{”属性\“：[{”金额\“：“8000000\”，“折扣\“：“0.01\”}添加一个额外的字符backslah和anslo属性名称是重复的（因为我们提到的是json），所以我想将数据保存到数据库中。我尝试了一些方法，但没有成功。def-zip（xs，ys）：return[{'amount'：a，'折扣]：t}对于a，t zip（xs，ys）]arrayToMapUDF=udf（zip，（StructType（[StructField（'marks'，IntegerType（）），StructField（'marks1'，DecimalType（））））如果不使用转储JSON，我们可以创建一个包含键值对的数组吗？@mike def zip（xs，ys）：返回[{'amount'：a，'discount'：t}表示a，t zip（xs，ys）]arrayToMapUDF=udf（zip，（StructType（[StructField]）（'marks'，IntegerType（）），StructField（'marks1'，DecimalType（））））））jsoncolumn[{“金额”：1000，“折扣”：0.01}，{“金额”：15000，“折扣”：0.02}，{“金额”：2000，“折扣”：0.03}，{“金额”：3000，“折扣”：0.04}，{“金额”：4000，“折扣”：0.05}]当我运行代码时，我收到一条错误消息。

Py4JJavaError:调用z:org.apache.spark.sql.functions.expr时出错：org.apache.spark.sql.catalyst.parser.ParseException:externeous input'->'expecting{'），'，}（第1行，第43位）

你的spark版本是什么？``spark 2.4.5``` john检查我的编辑，我发现如何在spark 2.4.5版本@Kafels`上运行`TypeError:'str'对象不可调用---------------------------------------------------------------------------12.withColumn（“id”，col（“supplierName”））->14 volumediscountDf=volumediscountDf.withColumn（'jsonarraycolumn'，to_json（expr（expr）））中的TypeError回溯（最近一次调用）TypeError:'str'对象不可调用```