Apache spark 使用combineByKey()时出错

Apache spark 使用combineByKey()时出错,apache-spark,pyspark,Apache Spark,Pyspark,我在这个RDD上使用combineByKey()生成一个结果,该结果给出了每天的总订单和每个状态的总金额。 代码如下: joindf.printSchema() root |-- order_customer_id: string (nullable = true) |-- order_date: string (nullable = true) |-- order_id: string (nullable = true) |-- order_status: string (nullab

我在这个RDD上使用combineByKey()生成一个结果,该结果给出了每天的总订单和每个状态的总金额。 代码如下:

joindf.printSchema()
root
 |-- order_customer_id: string (nullable = true)
 |-- order_date: string (nullable = true)
 |-- order_id: string (nullable = true)
 |-- order_status: string (nullable = true)
 |-- order_item_id: string (nullable = true)
 |-- order_item_order_id: string (nullable = true)
 |-- order_item_product_id: string (nullable = true)
 |-- order_item_product_price: string (nullable = true)
 |-- order_item_quantity: string (nullable = true)
 |-- order_item_subtotal: string (nullable = true)



joindf.show(5)
+-----------------+--------------------+--------+------------+-------------+-------------------+---------------------+------------------------+-------------------+-------------------+
|order_customer_id|          order_date|order_id|order_status|order_item_id|order_item_order_id|order_item_product_id|order_item_product_price|order_item_quantity|order_item_subtotal|
+-----------------+--------------------+--------+------------+-------------+-------------------+---------------------+------------------------+-------------------+-------------------+
|            10153|2013-08-17 00:00:...|    4061|    COMPLETE|        10153|               4080|                  365|                   59.99|                  4|             239.96|
|            10153|2014-01-12 00:00:...|   27596|     PENDING|        10153|               4080|                  365|                   59.99|                  4|             239.96|
|            10153|2014-07-18 00:00:...|   56604|      CLOSED|        10153|               4080|                  365|                   59.99|                  4|             239.96|
|            10153|2013-08-14 00:00:...|   58259|    COMPLETE|        10153|               4080|                  365|                   59.99|                  4|             239.96|
|            10153|2013-08-14 00:00:...|   58269|     PENDING|        10153|               4080|                  365|                   59.99|                  4|             239.96|
+-----------------+--------------------+--------+------------+-------------+-------------------+---------------------+------------------------+-------------------+-------------------+
这是一个错误

TypeError:“int”对象不可编辑


我哪里出错了?请提供帮助。

您已经有了一个数据帧,不需要将其转换为RDD并执行操作

据我所知,您可以执行以下操作,但是代码是在scala中,您可以将其转换为python

 joindf.map(lambda x: ((str(x[1]),str(x[3])),(float(x[9]),int(x[2]))))
 .combineByKey(lambda v: (v[0],set(v[1])) , 
               lambda acc,v: (acc[0]+v[0],v[1].add(acc[1])), 
               lambda acc1,acc2 : (acc1[0]+acc2[0],acc1[1].update(acc2[1])))

希望这有帮助

这有帮助吗?
joindf.groupBy(split($"order_date", " ")(0).as("order_date"))
    .agg(sum($"order_item_quantity"), sum($"order_item_subtotal"))