Pyspark 用条件求和

Pyspark 用条件求和,pyspark,group-by,sum,Pyspark,Group By,Sum,下面是我的数据,我正在用包裹id进行分组,如果 改进型cd以MA开头 输入: +------------+----+-----+-----------------+ | parcel_id|year| sqft|imprv_det_type_cd| +------------+----+-----+-----------------+ |000000100010|2014| 4272| MA| |000000100010|2014| 800|

下面是我的数据,我正在用包裹id进行分组,如果 改进型cd以MA开头

输入:

+------------+----+-----+-----------------+
|   parcel_id|year| sqft|imprv_det_type_cd|
+------------+----+-----+-----------------+
|000000100010|2014| 4272|               MA|
|000000100010|2014|  800|              60P|
|000000100010|2014| 3200|              MA2|
|000000100010|2014| 1620|              49R|
|000000100010|2014| 1446|              46R|
|000000100010|2014|40140|              45B|
|000000100010|2014| 1800|              45C|
|000000100010|2014|  864|              49C|
|000000100010|2014|    1|              48S|
+------------+----+-----+-----------------+

在这种情况下,从上面只考虑两排

预期产出:

+---------+-----------------+--------------------+----------+
|parcel_id|imprv_det_type_cd|structure_total_sqft|year_built|
+---------+-----------------+--------------------+----------+
|100010   |MA               |7472               |2014       |
+---------+-----------------+--------------------+----------+

代码:

我知道在这个代码中,.withColumn(“structure\u total\u sqft”,F.sum(“sqft”).over(w\u impr))有变化,但不确定我必须做什么变化。我试过了,但还是不起作用


提前谢谢你。

我不知道你为什么要做
groupBy
,但你没有

df.withColumn('parcel_id', f.regexp_replace('parcel_id', r'^[0]*', '')) \
  .filter("imprv_det_type_cd like 'MA%'") \
  .groupBy('parcel_id', 'year') \
  .agg(f.sum('sqft').alias('sqft'), f.first(f.substring('imprv_det_type_cd', 0, 2)).alias('imprv_det_type_cd')) \
  .show(10, False)

+---------+----+------+-----------------+
|parcel_id|year|sqft  |imprv_det_type_cd|
+---------+----+------+-----------------+
|100010   |2014|7472.0|MA               |
+---------+----+------+-----------------+
使用
求和(当(..)

df2.show(false)
df2.printSchema()
/**
* +------------+----+-----+-----------------+
*|地块id |年份|平方英尺|改进设计类型| cd|
* +------------+----+-----+-----------------+
*| 000000 100010 | 2014 | 4272 | MA|
*| 000000 100010 | 2014 | 800 | 60P|
*| 000000 100010 | 2014 | 3200 | MA2|
*| 000000 100010 | 2014 | 1620 | 49R|
*| 000000 100010 | 2014 | 1446 | 46R|
*| 000000 100010 | 2014 | 40140 | 45B|
*| 000000 100010 | 2014 | 1800 | 45摄氏度|
*| 000000 100010 | 2014 | 864 | 49C|
*| 000000 100010 | 2014 | 1 | 48S|
* +------------+----+-----+-----------------+
*
*根
*|--地块id:string(nullable=true)
*|--year:string(nullable=true)
*|--sqft:string(nullable=true)
*|--imprv_det_type_cd:string(nullable=true)
*/
val p=df2.groupBy(expr(“cast(地块id为整数)为地块id”))
阿格先生(
总金额(当($“改进设计类型”cd.以(“MA”),$“sqft”)开始时)为(“结构总额”,
第一种(“改进型光盘”)。作为(“改进型光盘”),
首($“年”)。作为(“建成年”)
)
p、 显示(假)
p、 解释()
/**
* +---------+--------------------+-----------------+----------+
*|地块标识|结构|总面积|改进设计|类型| cd |年份||
* +---------+--------------------+-----------------+----------+
*| 100010 | 7472.0 | MA | 2014|
* +---------+--------------------+-----------------+----------+
*/
df.withColumn('parcel_id', f.regexp_replace('parcel_id', r'^[0]*', '')) \
  .filter("imprv_det_type_cd like 'MA%'") \
  .groupBy('parcel_id', 'year') \
  .agg(f.sum('sqft').alias('sqft'), f.first(f.substring('imprv_det_type_cd', 0, 2)).alias('imprv_det_type_cd')) \
  .show(10, False)

+---------+----+------+-----------------+
|parcel_id|year|sqft  |imprv_det_type_cd|
+---------+----+------+-----------------+
|100010   |2014|7472.0|MA               |
+---------+----+------+-----------------+