Python 2.7 当percentile_approx基于groupby返回特定列的单个值时,如何选择另一列的相应值?

Python 2.7 当percentile_approx基于groupby返回特定列的单个值时,如何选择另一列的相应值?,python-2.7,pyspark,pyspark-sql,Python 2.7,Pyspark,Pyspark Sql,我是pyspark的新手,不需要什么澄清。 我有一张像这样的PySpark桌子: +---+-------+-----+-------+ | id| ranges|score| uom| +---+-------+-----+-------+ | 1| low| 20|percent| | 1|verylow| 10|percent| | 1| high| 70| bytes| | 1| medium| 40|percent| | 1| high|

我是pyspark的新手,不需要什么澄清。 我有一张像这样的PySpark桌子:

+---+-------+-----+-------+
| id| ranges|score|    uom|
+---+-------+-----+-------+
|  1|    low|   20|percent|
|  1|verylow|   10|percent|
|  1|   high|   70|  bytes|
|  1| medium|   40|percent|
|  1|   high|   60|percent|
|  1|verylow|   10|percent|
|  1|   high|   70|percent|
+---+-------+-----+-------+
+-----+--------------------+
|score|first(ranges, false)|
+-----+--------------------+
|   70|                 low|
+-----+--------------------+
我想计算分数列的百分位值,给定的百分比为0.95,同时我想它也应该返回相应的范围值。我尝试运行此查询:

results = spark.sql('select percentile_approx(score,0.95) as score, first(ranges)  from subset GROUP BY id')
我得到的结果如下:

+---+-------+-----+-------+
| id| ranges|score|    uom|
+---+-------+-----+-------+
|  1|    low|   20|percent|
|  1|verylow|   10|percent|
|  1|   high|   70|  bytes|
|  1| medium|   40|percent|
|  1|   high|   60|percent|
|  1|verylow|   10|percent|
|  1|   high|   70|percent|
+---+-------+-----+-------+
+-----+--------------------+
|score|first(ranges, false)|
+-----+--------------------+
|   70|                 low|
+-----+--------------------+
它返回的第一个范围值不正确,应该是“高”。 如果从查询中删除firstranges,则会出现以下错误:

> pyspark.sql.utils.AnalysisException: u"expression 'subset.`ranges`' is
> neither present in the group by, nor is it an aggregate function. Add
> to group by or wrap in first() (or first_value) if you don't care
> which value you get.;;\nAggregate [id#0L],
> [percentile_approx(score#2L, cast(0.95 as double), 10000, 0, 0) AS
> score#353L, ranges#1]\n+- SubqueryAlias subset\n   +- LogicalRDD
> [id#0L, ranges#1, score#2L, uom#3], false\n

这是因为您仅按id进行分组。因此,通过使用第一个函数,可以有效地从“范围”列中选择一个随机值

一种解决方案是创建第二个数据帧,其中包含分数到范围的映射,然后在最后将其连接回结果df

>>> df.registerTempTable("df") # Register first before selecting from 'df'
>>> map = spark.sql('select ranges, score from df')

>>> results = spark.sql('select percentile_approx(score,0.95) as score from subset GROUP BY id')

>>> results .registerTempTable("results ") 
>>> final_result = spark.sql('select r.score, m.ranges from results as r join map as m on r.score = m.score')
你需要使用一个窗口函数-看看。