将pyspark中dataframe中的多个列表列转换为json数组列

将pyspark中dataframe中的多个列表列转换为json数组列,json,apache-spark,pyspark,apache-spark-sql,Json,Apache Spark,Pyspark,Apache Spark Sql,我有一个数据框,它有多个列表列,并转换一个JSON数组列 在逻辑下使用,但没有任何想法 def test(test1,test2): d = {'data': [{'marks': a, 'grades': t} for a, t in zip(test1, test2)]} return d arrayToMapUDF = udf(test ,ArrayType(StringType())) df.withcolumn("jsonarraycolumn"

我有一个数据框,它有多个列表列,并转换一个JSON数组列

在逻辑下使用,但没有任何想法

def test(test1,test2):
    d = {'data': [{'marks': a, 'grades': t} for a, t in zip(test1, test2)]}
    return d
arrayToMapUDF = udf(test ,ArrayType(StringType()))

df.withcolumn("jsonarraycolumn", arrayToMapUDF(col("col"),col("col2")))
UDF定义为数组类型,如下所示,并尝试使用列调用,但没有解决任何问题

def test(test1,test2):
    d = {'data': [{'marks': a, 'grades': t} for a, t in zip(test1, test2)]}
    return d
arrayToMapUDF = udf(test ,ArrayType(StringType()))

df.withcolumn("jsonarraycolumn", arrayToMapUDF(col("col"),col("col2")))
标志 分数 [100, 150, 200, 300, 400] [0.01, 0.02, 0.03, 0.04, 0.05]
您可以使用
StringType
,因为它返回的是JSON字符串,而不是字符串数组。您还可以使用
json.dumps
将字典转换为json字符串

import pyspark.sql.functions as F
from pyspark.sql.types import StringType
import json

def test(test1,test2):
    d = [{'amount': a, 'discount': t} for a, t in zip(test1, test2)]
    return json.dumps(d)

arrayToMapUDF = F.udf(test, StringType())

df2 = df.withColumn("jsonarraycolumn", arrayToMapUDF(F.col("amount"), F.col("discount")))

df2.show(truncate=False)
+-------------------------------+------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|amount                         |discount                      |jsonarraycolumn                                                                                                                                                                      |
+-------------------------------+------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|[1000, 15000, 2000, 3000, 4000]|[0.01, 0.02, 0.03, 0.04, 0.05]|[{"amount": 1000, "discount": 0.01}, {"amount": 15000, "discount": 0.02}, {"amount": 2000, "discount": 0.03}, {"amount": 3000, "discount": 0.04}, {"amount": 4000, "discount": 0.05}]|
+-------------------------------+------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
如果你不想被引用

import pyspark.sql.functions as F
from pyspark.sql.types import StringType
import json

def test(test1,test2):
    d = [{'amount': a, 'discount': t} for a, t in zip(test1, test2)]
    return json.dumps(d).replace('"', '')

arrayToMapUDF = F.udf(test, StringType())

df2 = df.withColumn("jsonarraycolumn", arrayToMapUDF(F.col("amount"), F.col("discount")))

df2.show(truncate=False)
+-------------------------------+------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------+
|amount                         |discount                      |jsonarraycolumn                                                                                                                                                  |
+-------------------------------+------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------+
|[1000, 15000, 2000, 3000, 4000]|[0.01, 0.02, 0.03, 0.04, 0.05]|[{amount: 1000, discount: 0.01}, {amount: 15000, discount: 0.02}, {amount: 2000, discount: 0.03}, {amount: 3000, discount: 0.04}, {amount: 4000, discount: 0.05}]|
+-------------------------------+------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------+
如果您想要一个真正的JSON类型列:

def test(test1,test2):
    d = [{'amount': a, 'discount': t} for a, t in zip(test1, test2)]
    return d

arrayToMapUDF = F.udf(test, 
    ArrayType(
        StructType([
            StructField('amount', StringType()), 
            StructField('discount', StringType())
        ])
    )
)

df2 = df.withColumn("jsonarraycolumn", arrayToMapUDF(F.col("amount"), F.col("discount")))

df2.show(truncate=False)
+-------------------------------+------------------------------+-----------------------------------------------------------------------+
|amount                         |discount                      |jsonarraycolumn                                                        |
+-------------------------------+------------------------------+-----------------------------------------------------------------------+
|[1000, 15000, 2000, 3000, 4000]|[0.01, 0.02, 0.03, 0.04, 0.05]|[[1000, 0.01], [15000, 0.02], [2000, 0.03], [3000, 0.04], [4000, 0.05]]|
+-------------------------------+------------------------------+-----------------------------------------------------------------------+

df2.printSchema()
root
 |-- amount: array (nullable = false)
 |    |-- element: integer (containsNull = false)
 |-- discount: array (nullable = false)
 |    |-- element: double (containsNull = false)
 |-- jsonarraycolumn: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- amount: string (nullable = true)
 |    |    |-- discount: string (nullable = true)

要避免使用udf函数,可以使用:

导入pyspark.sql.f函数
transform\u expr=“transform(数组\u zip(金额、折扣)、值->值)”
df=df.withColumn('jsonarraycolumn',f.to_json(f.expr(transform_expr)))
df.show(truncate=False)
输出:

+-------------------------------+------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|amount                         |discount                      |jsonarraycolumn                                                                                                                                                             |
+-------------------------------+------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|[1000, 15000, 2000, 3000, 4000]|[0.01, 0.02, 0.03, 0.04, 0.05]|[{"amount":1000.0,"discount":0.01},{"amount":15000.0,"discount":0.02},{"amount":2000.0,"discount":0.03},{"amount":3000.0,"discount":0.04},{"amount":4000.0,"discount":0.05}]|
+-------------------------------+------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

也许我把你弄糊涂了,但给出的解决方案很好。但是当我回信给数据库时,也许我把你弄糊涂了,但给出的解决方案是好的。但当我以“属性”的形式回写数据库时:“{”属性\“:[{”金额\“:“8000000\”,“折扣\“:“0.01\”}添加一个额外的字符backslah和anslo属性名称是重复的(因为我们提到的是json),所以我想将数据保存到数据库中。我尝试了一些方法,但没有成功。def-zip(xs,ys):return[{'amount':a,'折扣]:t}对于a,t zip(xs,ys)]arrayToMapUDF=udf(zip,(StructType([StructField('marks',IntegerType()),StructField('marks1',DecimalType())))如果不使用转储JSON,我们可以创建一个包含键值对的数组吗?@mike def zip(xs,ys):返回[{'amount':a,'discount':t}表示a,t zip(xs,ys)]arrayToMapUDF=udf(zip,(StructType([StructField])('marks',IntegerType()),StructField('marks1',DecimalType())))))jsoncolumn[{“金额”:1000,“折扣”:0.01},{“金额”:15000,“折扣”:0.02},{“金额”:2000,“折扣”:0.03},{“金额”:3000,“折扣”:0.04},{“金额”:4000,“折扣”:0.05}]当我运行代码时,我收到一条错误消息。
Py4JJavaError:调用z:org.apache.spark.sql.functions.expr时出错:org.apache.spark.sql.catalyst.parser.ParseException:externeous input'->'expecting{'),',}(第1行,第43位)
你的spark版本是什么?``spark 2.4.5``` john检查我的编辑,我发现如何在spark 2.4.5版本@Kafels`上运行`TypeError:'str'对象不可调用---------------------------------------------------------------------------12.withColumn(“id”,col(“supplierName”))->14 volumediscountDf=volumediscountDf.withColumn('jsonarraycolumn',to_json(expr(expr)))中的TypeError回溯(最近一次调用)TypeError:'str'对象不可调用```