聚合四舍五入值为2位小数的Pyspark Groupby
我需要用avg申请groupby on聚合四舍五入值为2位小数的Pyspark Groupby,pyspark,pyspark-sql,Pyspark,Pyspark Sql,我需要用avg申请groupby on df = spark.createDataFrame([ ("2017-Dec-08 00:00 - 2017-Dec-09 00:00", 80.65,"abc"), ("2017-Dec-09 00:00 - 2017-Dec-10 00:00", 100,"abc"), ("2017-Dec-08 00:00 - 2017-Dec-09 00:00", 65,"def"), ("2017-Dec-09 00:00
df = spark.createDataFrame([
("2017-Dec-08 00:00 - 2017-Dec-09 00:00", 80.65,"abc"),
("2017-Dec-09 00:00 - 2017-Dec-10 00:00", 100,"abc"),
("2017-Dec-08 00:00 - 2017-Dec-09 00:00", 65,"def"),
("2017-Dec-09 00:00 - 2017-Dec-10 00:00", 78.02,"def")
]).toDF("date", "percent","device")
我面临以下例外
schema = StructType([
StructField('date', StringType(), True),
StructField('percent', FloatType(), True),
StructField('device', StringType(), True)
])
dtaDF.groupBy("device").agg(round(mean("percent").alias("y"),2))
如果要使用自定义项,请执行以下操作:
>>> df = sqlContext.createDataFrame([
... ("2017-Dec-08 00:00 - 2017-Dec-09 00:00", 80.65,"abc"),
... ("2017-Dec-09 00:00 - 2017-Dec-10 00:00", 100.00,"abc"),
... ("2017-Dec-08 00:00 - 2017-Dec-09 00:00", 65.00,"def"),
... ("2017-Dec-09 00:00 - 2017-Dec-10 00:00", 78.02,"def")
... ]).toDF("date", "percent","device")
>>> schema = StructType([
... StructField('date', StringType(), True),
... StructField('percent', FloatType(), True),
... StructField('device', StringType(), True)
... ])
>>> df.groupBy("device").agg(round(mean("percent"),2).alias("y")).show()
+------+--------+
|device| y|
+------+--------+
| def| 71.51|
| abc| 90.33|
+------+--------+
您可以将
100
更改为100.00
并将65
更改为65.00
来重新运行吗?这里我可以更改。但这些都是动态的价值观。如何更改?将它们作为字符串加载,并将其转换为float/doubleYeah。我铸造了它们,使它们像浮子一样漂浮(val)。应用查询时,round面对相同的异常
>>> df = sqlContext.createDataFrame([
... ("2017-Dec-08 00:00 - 2017-Dec-09 00:00", 80.65,"abc"),
... ("2017-Dec-09 00:00 - 2017-Dec-10 00:00", 100.00,"abc"),
... ("2017-Dec-08 00:00 - 2017-Dec-09 00:00", 65.00,"def"),
... ("2017-Dec-09 00:00 - 2017-Dec-10 00:00", 78.02,"def")
... ]).toDF("date", "percent","device")
>>> schema = StructType([
... StructField('date', StringType(), True),
... StructField('percent', FloatType(), True),
... StructField('device', StringType(), True)
... ])
>>> df.groupBy("device").agg(round(mean("percent"),2).alias("y")).show()
+------+--------+
|device| y|
+------+--------+
| def| 71.51|
| abc| 90.33|
+------+--------+
@pandas_udf("float", PandasUDFType.GROUPED_AGG)
def mean_udf(v):
return round(v.mean(), 2)
spark.udf.register("mean_udf", mean_udf)
dfStackOverflow = spark.createDataFrame([
("2017-Dec-08 00:00 - 2017-Dec-09 00:00", 80.65,"abc"),
("2017-Dec-09 00:00 - 2017-Dec-10 00:00", 100.00,"abc"),
("2017-Dec-08 00:00 - 2017-Dec-09 00:00", 65.00,"def"),
("2017-Dec-09 00:00 - 2017-Dec-10 00:00", 78.02,"def")
],
schema = StructType([
StructField('date', StringType(), True),
StructField('percent', FloatType(), True),
StructField('device', StringType(), True)
]))
dfStackOverflow.groupBy("device").agg({"percent":"mean_udf"}).show()
+------+-----------------+
|device|mean_udf(percent)|
+------+-----------------+
| abc| 90.32|
| def| 71.51|
+------+-----------------+