将pyspark中dataframe中的多个列表列转换为json数组列
我有一个数据框,它有多个列表列,并转换一个JSON数组列 在逻辑下使用,但没有任何想法将pyspark中dataframe中的多个列表列转换为json数组列,json,apache-spark,pyspark,apache-spark-sql,Json,Apache Spark,Pyspark,Apache Spark Sql,我有一个数据框,它有多个列表列,并转换一个JSON数组列 在逻辑下使用,但没有任何想法 def test(test1,test2): d = {'data': [{'marks': a, 'grades': t} for a, t in zip(test1, test2)]} return d arrayToMapUDF = udf(test ,ArrayType(StringType())) df.withcolumn("jsonarraycolumn"
def test(test1,test2):
d = {'data': [{'marks': a, 'grades': t} for a, t in zip(test1, test2)]}
return d
arrayToMapUDF = udf(test ,ArrayType(StringType()))
df.withcolumn("jsonarraycolumn", arrayToMapUDF(col("col"),col("col2")))
UDF定义为数组类型,如下所示,并尝试使用列调用,但没有解决任何问题
def test(test1,test2):
d = {'data': [{'marks': a, 'grades': t} for a, t in zip(test1, test2)]}
return d
arrayToMapUDF = udf(test ,ArrayType(StringType()))
df.withcolumn("jsonarraycolumn", arrayToMapUDF(col("col"),col("col2")))
标志
分数
[100, 150, 200, 300, 400]
[0.01, 0.02, 0.03, 0.04, 0.05]
您可以使用
StringType
,因为它返回的是JSON字符串,而不是字符串数组。您还可以使用json.dumps
将字典转换为json字符串
import pyspark.sql.functions as F
from pyspark.sql.types import StringType
import json
def test(test1,test2):
d = [{'amount': a, 'discount': t} for a, t in zip(test1, test2)]
return json.dumps(d)
arrayToMapUDF = F.udf(test, StringType())
df2 = df.withColumn("jsonarraycolumn", arrayToMapUDF(F.col("amount"), F.col("discount")))
df2.show(truncate=False)
+-------------------------------+------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|amount |discount |jsonarraycolumn |
+-------------------------------+------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|[1000, 15000, 2000, 3000, 4000]|[0.01, 0.02, 0.03, 0.04, 0.05]|[{"amount": 1000, "discount": 0.01}, {"amount": 15000, "discount": 0.02}, {"amount": 2000, "discount": 0.03}, {"amount": 3000, "discount": 0.04}, {"amount": 4000, "discount": 0.05}]|
+-------------------------------+------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
如果你不想被引用
import pyspark.sql.functions as F
from pyspark.sql.types import StringType
import json
def test(test1,test2):
d = [{'amount': a, 'discount': t} for a, t in zip(test1, test2)]
return json.dumps(d).replace('"', '')
arrayToMapUDF = F.udf(test, StringType())
df2 = df.withColumn("jsonarraycolumn", arrayToMapUDF(F.col("amount"), F.col("discount")))
df2.show(truncate=False)
+-------------------------------+------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------+
|amount |discount |jsonarraycolumn |
+-------------------------------+------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------+
|[1000, 15000, 2000, 3000, 4000]|[0.01, 0.02, 0.03, 0.04, 0.05]|[{amount: 1000, discount: 0.01}, {amount: 15000, discount: 0.02}, {amount: 2000, discount: 0.03}, {amount: 3000, discount: 0.04}, {amount: 4000, discount: 0.05}]|
+-------------------------------+------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------+
如果您想要一个真正的JSON类型列:
def test(test1,test2):
d = [{'amount': a, 'discount': t} for a, t in zip(test1, test2)]
return d
arrayToMapUDF = F.udf(test,
ArrayType(
StructType([
StructField('amount', StringType()),
StructField('discount', StringType())
])
)
)
df2 = df.withColumn("jsonarraycolumn", arrayToMapUDF(F.col("amount"), F.col("discount")))
df2.show(truncate=False)
+-------------------------------+------------------------------+-----------------------------------------------------------------------+
|amount |discount |jsonarraycolumn |
+-------------------------------+------------------------------+-----------------------------------------------------------------------+
|[1000, 15000, 2000, 3000, 4000]|[0.01, 0.02, 0.03, 0.04, 0.05]|[[1000, 0.01], [15000, 0.02], [2000, 0.03], [3000, 0.04], [4000, 0.05]]|
+-------------------------------+------------------------------+-----------------------------------------------------------------------+
df2.printSchema()
root
|-- amount: array (nullable = false)
| |-- element: integer (containsNull = false)
|-- discount: array (nullable = false)
| |-- element: double (containsNull = false)
|-- jsonarraycolumn: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- amount: string (nullable = true)
| | |-- discount: string (nullable = true)
要避免使用udf函数,可以使用:
导入pyspark.sql.f函数
transform\u expr=“transform(数组\u zip(金额、折扣)、值->值)”
df=df.withColumn('jsonarraycolumn',f.to_json(f.expr(transform_expr)))
df.show(truncate=False)
输出:
+-------------------------------+------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|amount |discount |jsonarraycolumn |
+-------------------------------+------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|[1000, 15000, 2000, 3000, 4000]|[0.01, 0.02, 0.03, 0.04, 0.05]|[{"amount":1000.0,"discount":0.01},{"amount":15000.0,"discount":0.02},{"amount":2000.0,"discount":0.03},{"amount":3000.0,"discount":0.04},{"amount":4000.0,"discount":0.05}]|
+-------------------------------+------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
也许我把你弄糊涂了,但给出的解决方案很好。但是当我回信给数据库时,也许我把你弄糊涂了,但给出的解决方案是好的。但当我以“属性”的形式回写数据库时:“{”属性\“:[{”金额\“:“8000000\”,“折扣\“:“0.01\”}添加一个额外的字符backslah和anslo属性名称是重复的(因为我们提到的是json),所以我想将数据保存到数据库中。我尝试了一些方法,但没有成功。def-zip(xs,ys):return[{'amount':a,'折扣]:t}对于a,t zip(xs,ys)]arrayToMapUDF=udf(zip,(StructType([StructField('marks',IntegerType()),StructField('marks1',DecimalType())))如果不使用转储JSON,我们可以创建一个包含键值对的数组吗?@mike def zip(xs,ys):返回[{'amount':a,'discount':t}表示a,t zip(xs,ys)]arrayToMapUDF=udf(zip,(StructType([StructField])('marks',IntegerType()),StructField('marks1',DecimalType())))))jsoncolumn[{“金额”:1000,“折扣”:0.01},{“金额”:15000,“折扣”:0.02},{“金额”:2000,“折扣”:0.03},{“金额”:3000,“折扣”:0.04},{“金额”:4000,“折扣”:0.05}]当我运行代码时,我收到一条错误消息。
Py4JJavaError:调用z:org.apache.spark.sql.functions.expr时出错:org.apache.spark.sql.catalyst.parser.ParseException:externeous input'->'expecting{'),',}(第1行,第43位)
你的spark版本是什么?``spark 2.4.5``` john检查我的编辑,我发现如何在spark 2.4.5版本@Kafels`上运行`TypeError:'str'对象不可调用---------------------------------------------------------------------------12.withColumn(“id”,col(“supplierName”))->14 volumediscountDf=volumediscountDf.withColumn('jsonarraycolumn',to_json(expr(expr)))中的TypeError回溯(最近一次调用)TypeError:'str'对象不可调用```