聚合四舍五入值为2位小数的Pyspark Groupby

聚合四舍五入值为2位小数的Pyspark Groupby,pyspark,pyspark-sql,Pyspark,Pyspark Sql,我需要用avg申请groupby on df = spark.createDataFrame([ ("2017-Dec-08 00:00 - 2017-Dec-09 00:00", 80.65,"abc"), ("2017-Dec-09 00:00 - 2017-Dec-10 00:00", 100,"abc"), ("2017-Dec-08 00:00 - 2017-Dec-09 00:00", 65,"def"), ("2017-Dec-09 00:00

我需要用avg申请groupby on

df = spark.createDataFrame([
    ("2017-Dec-08 00:00 - 2017-Dec-09 00:00", 80.65,"abc"), 
    ("2017-Dec-09 00:00 - 2017-Dec-10 00:00", 100,"abc"),
    ("2017-Dec-08 00:00 - 2017-Dec-09 00:00", 65,"def"), 
    ("2017-Dec-09 00:00 - 2017-Dec-10 00:00", 78.02,"def")
]).toDF("date", "percent","device")
我面临以下例外

schema = StructType([
    StructField('date', StringType(), True),
    StructField('percent', FloatType(), True),
    StructField('device', StringType(), True)
]) 
dtaDF.groupBy("device").agg(round(mean("percent").alias("y"),2))

如果要使用自定义项,请执行以下操作:

>>> df = sqlContext.createDataFrame([
...     ("2017-Dec-08 00:00 - 2017-Dec-09 00:00", 80.65,"abc"), 
...     ("2017-Dec-09 00:00 - 2017-Dec-10 00:00", 100.00,"abc"),
...     ("2017-Dec-08 00:00 - 2017-Dec-09 00:00", 65.00,"def"), 
...     ("2017-Dec-09 00:00 - 2017-Dec-10 00:00", 78.02,"def")
... ]).toDF("date", "percent","device")
>>> schema = StructType([
...     StructField('date', StringType(), True),
...     StructField('percent', FloatType(), True),
...     StructField('device', StringType(), True)
... ]) 

>>> df.groupBy("device").agg(round(mean("percent"),2).alias("y")).show()
+------+--------+         
|device|       y|
+------+--------+
|   def|   71.51|
|   abc|   90.33|
+------+--------+

您可以将
100
更改为
100.00
并将
65
更改为
65.00
来重新运行吗?这里我可以更改。但这些都是动态的价值观。如何更改?将它们作为字符串加载,并将其转换为float/doubleYeah。我铸造了它们,使它们像浮子一样漂浮(val)。应用查询时,round面对相同的异常
>>> df = sqlContext.createDataFrame([
...     ("2017-Dec-08 00:00 - 2017-Dec-09 00:00", 80.65,"abc"), 
...     ("2017-Dec-09 00:00 - 2017-Dec-10 00:00", 100.00,"abc"),
...     ("2017-Dec-08 00:00 - 2017-Dec-09 00:00", 65.00,"def"), 
...     ("2017-Dec-09 00:00 - 2017-Dec-10 00:00", 78.02,"def")
... ]).toDF("date", "percent","device")
>>> schema = StructType([
...     StructField('date', StringType(), True),
...     StructField('percent', FloatType(), True),
...     StructField('device', StringType(), True)
... ]) 

>>> df.groupBy("device").agg(round(mean("percent"),2).alias("y")).show()
+------+--------+         
|device|       y|
+------+--------+
|   def|   71.51|
|   abc|   90.33|
+------+--------+
@pandas_udf("float", PandasUDFType.GROUPED_AGG)
def mean_udf(v):
    return round(v.mean(), 2)
spark.udf.register("mean_udf", mean_udf)

dfStackOverflow = spark.createDataFrame([
     ("2017-Dec-08 00:00 - 2017-Dec-09 00:00", 80.65,"abc"), 
     ("2017-Dec-09 00:00 - 2017-Dec-10 00:00", 100.00,"abc"),
     ("2017-Dec-08 00:00 - 2017-Dec-09 00:00", 65.00,"def"), 
     ("2017-Dec-09 00:00 - 2017-Dec-10 00:00", 78.02,"def")
 ], 
 schema = StructType([
     StructField('date', StringType(), True),
     StructField('percent', FloatType(), True),
     StructField('device', StringType(), True)
 ]))

dfStackOverflow.groupBy("device").agg({"percent":"mean_udf"}).show()

+------+-----------------+
|device|mean_udf(percent)|
+------+-----------------+
|   abc|            90.32|
|   def|            71.51|
+------+-----------------+