Pyspark 用条件求和_Pyspark_Group By_Sum

Pyspark 用条件求和

pyspark

Pyspark 用条件求和,pyspark,group-by,sum,Pyspark,Group By,Sum,下面是我的数据，我正在用包裹id进行分组，如果改进型cd以MA开头输入： +------------+----+-----+-----------------+ | parcel_id|year| sqft|imprv_det_type_cd| +------------+----+-----+-----------------+ |000000100010|2014| 4272| MA| |000000100010|2014| 800|

下面是我的数据，我正在用包裹id进行分组，如果改进型cd以MA开头

输入：

+------------+----+-----+-----------------+
|   parcel_id|year| sqft|imprv_det_type_cd|
+------------+----+-----+-----------------+
|000000100010|2014| 4272|               MA|
|000000100010|2014|  800|              60P|
|000000100010|2014| 3200|              MA2|
|000000100010|2014| 1620|              49R|
|000000100010|2014| 1446|              46R|
|000000100010|2014|40140|              45B|
|000000100010|2014| 1800|              45C|
|000000100010|2014|  864|              49C|
|000000100010|2014|    1|              48S|
+------------+----+-----+-----------------+

在这种情况下，从上面只考虑两排

预期产出：

+---------+-----------------+--------------------+----------+
|parcel_id|imprv_det_type_cd|structure_total_sqft|year_built|
+---------+-----------------+--------------------+----------+
|100010   |MA               |7472               |2014       |
+---------+-----------------+--------------------+----------+

代码：

我知道在这个代码中，.withColumn（“structure\u total\u sqft”，F.sum（“sqft”）.over（w\u impr））有变化，但不确定我必须做什么变化。我试过了，但还是不起作用

提前谢谢你。

我不知道你为什么要做

groupBy

，但你没有

df.withColumn('parcel_id', f.regexp_replace('parcel_id', r'^[0]*', '')) \
  .filter("imprv_det_type_cd like 'MA%'") \
  .groupBy('parcel_id', 'year') \
  .agg(f.sum('sqft').alias('sqft'), f.first(f.substring('imprv_det_type_cd', 0, 2)).alias('imprv_det_type_cd')) \
  .show(10, False)

+---------+----+------+-----------------+
|parcel_id|year|sqft  |imprv_det_type_cd|
+---------+----+------+-----------------+
|100010   |2014|7472.0|MA               |
+---------+----+------+-----------------+

使用求和（当（..）

df2.show（false）
df2.printSchema（）
/**
* +------------+----+-----+-----------------+
*|地块id |年份|平方英尺|改进设计类型| cd|
* +------------+----+-----+-----------------+
*| 000000 100010 | 2014 | 4272 | MA|
*| 000000 100010 | 2014 | 800 | 60P|
*| 000000 100010 | 2014 | 3200 | MA2|
*| 000000 100010 | 2014 | 1620 | 49R|
*| 000000 100010 | 2014 | 1446 | 46R|
*| 000000 100010 | 2014 | 40140 | 45B|
*| 000000 100010 | 2014 | 1800 | 45摄氏度|
*| 000000 100010 | 2014 | 864 | 49C|
*| 000000 100010 | 2014 | 1 | 48S|
* +------------+----+-----+-----------------+
*
*根
*|--地块id:string（nullable=true）
*|--year:string（nullable=true）
*|--sqft:string（nullable=true）
*|--imprv_det_type_cd:string（nullable=true）
*/
val p=df2.groupBy（expr（“cast（地块id为整数）为地块id”））
阿格先生(
总金额（当（$“改进设计类型”cd.以（“MA”），$“sqft”）开始时）为（“结构总额”，
第一种（“改进型光盘”）。作为（“改进型光盘”），
首（$“年”）。作为（“建成年”）
)
p、 显示（假）
p、 解释（）
/**
* +---------+--------------------+-----------------+----------+
*|地块标识|结构|总面积|改进设计|类型| cd |年份||
* +---------+--------------------+-----------------+----------+
*| 100010 | 7472.0 | MA | 2014|
* +---------+--------------------+-----------------+----------+
*/

df.withColumn('parcel_id', f.regexp_replace('parcel_id', r'^[0]*', '')) \
  .filter("imprv_det_type_cd like 'MA%'") \
  .groupBy('parcel_id', 'year') \
  .agg(f.sum('sqft').alias('sqft'), f.first(f.substring('imprv_det_type_cd', 0, 2)).alias('imprv_det_type_cd')) \
  .show(10, False)

+---------+----+------+-----------------+
|parcel_id|year|sqft  |imprv_det_type_cd|
+---------+----+------+-----------------+
|100010   |2014|7472.0|MA               |
+---------+----+------+-----------------+