Pyspark 用条件求和
下面是我的数据,我正在用包裹id进行分组,如果 改进型cd以MA开头 输入:Pyspark 用条件求和,pyspark,group-by,sum,Pyspark,Group By,Sum,下面是我的数据,我正在用包裹id进行分组,如果 改进型cd以MA开头 输入: +------------+----+-----+-----------------+ | parcel_id|year| sqft|imprv_det_type_cd| +------------+----+-----+-----------------+ |000000100010|2014| 4272| MA| |000000100010|2014| 800|
+------------+----+-----+-----------------+
| parcel_id|year| sqft|imprv_det_type_cd|
+------------+----+-----+-----------------+
|000000100010|2014| 4272| MA|
|000000100010|2014| 800| 60P|
|000000100010|2014| 3200| MA2|
|000000100010|2014| 1620| 49R|
|000000100010|2014| 1446| 46R|
|000000100010|2014|40140| 45B|
|000000100010|2014| 1800| 45C|
|000000100010|2014| 864| 49C|
|000000100010|2014| 1| 48S|
+------------+----+-----+-----------------+
在这种情况下,从上面只考虑两排
预期产出:
+---------+-----------------+--------------------+----------+
|parcel_id|imprv_det_type_cd|structure_total_sqft|year_built|
+---------+-----------------+--------------------+----------+
|100010 |MA |7472 |2014 |
+---------+-----------------+--------------------+----------+
代码:
我知道在这个代码中,.withColumn(“structure\u total\u sqft”,F.sum(“sqft”).over(w\u impr))有变化,但不确定我必须做什么变化。我试过了,但还是不起作用
提前谢谢你。我不知道你为什么要做
groupBy
,但你没有
df.withColumn('parcel_id', f.regexp_replace('parcel_id', r'^[0]*', '')) \
.filter("imprv_det_type_cd like 'MA%'") \
.groupBy('parcel_id', 'year') \
.agg(f.sum('sqft').alias('sqft'), f.first(f.substring('imprv_det_type_cd', 0, 2)).alias('imprv_det_type_cd')) \
.show(10, False)
+---------+----+------+-----------------+
|parcel_id|year|sqft |imprv_det_type_cd|
+---------+----+------+-----------------+
|100010 |2014|7472.0|MA |
+---------+----+------+-----------------+
使用求和(当(..)
df2.show(false)
df2.printSchema()
/**
* +------------+----+-----+-----------------+
*|地块id |年份|平方英尺|改进设计类型| cd|
* +------------+----+-----+-----------------+
*| 000000 100010 | 2014 | 4272 | MA|
*| 000000 100010 | 2014 | 800 | 60P|
*| 000000 100010 | 2014 | 3200 | MA2|
*| 000000 100010 | 2014 | 1620 | 49R|
*| 000000 100010 | 2014 | 1446 | 46R|
*| 000000 100010 | 2014 | 40140 | 45B|
*| 000000 100010 | 2014 | 1800 | 45摄氏度|
*| 000000 100010 | 2014 | 864 | 49C|
*| 000000 100010 | 2014 | 1 | 48S|
* +------------+----+-----+-----------------+
*
*根
*|--地块id:string(nullable=true)
*|--year:string(nullable=true)
*|--sqft:string(nullable=true)
*|--imprv_det_type_cd:string(nullable=true)
*/
val p=df2.groupBy(expr(“cast(地块id为整数)为地块id”))
阿格先生(
总金额(当($“改进设计类型”cd.以(“MA”),$“sqft”)开始时)为(“结构总额”,
第一种(“改进型光盘”)。作为(“改进型光盘”),
首($“年”)。作为(“建成年”)
)
p、 显示(假)
p、 解释()
/**
* +---------+--------------------+-----------------+----------+
*|地块标识|结构|总面积|改进设计|类型| cd |年份||
* +---------+--------------------+-----------------+----------+
*| 100010 | 7472.0 | MA | 2014|
* +---------+--------------------+-----------------+----------+
*/
df.withColumn('parcel_id', f.regexp_replace('parcel_id', r'^[0]*', '')) \
.filter("imprv_det_type_cd like 'MA%'") \
.groupBy('parcel_id', 'year') \
.agg(f.sum('sqft').alias('sqft'), f.first(f.substring('imprv_det_type_cd', 0, 2)).alias('imprv_det_type_cd')) \
.show(10, False)
+---------+----+------+-----------------+
|parcel_id|year|sqft |imprv_det_type_cd|
+---------+----+------+-----------------+
|100010 |2014|7472.0|MA |
+---------+----+------+-----------------+