Python Spark DataFrame-获取两列组合的平均值_Python_Apache Spark_Pyspark

Python Spark DataFrame-获取两列组合的平均值

python apache-spark pyspark

Python Spark DataFrame-获取两列组合的平均值,python,apache-spark,pyspark,Python,Apache Spark,Pyspark,如何获得两列组合的平均价格我的数据帧： relevantTable = df.select(df['Price'], df['B'], df['A']) 看起来像： +-------+------------+------------------+ | Price| B | A | +-------+------------+------------------+ | 0.2947| i3.xlarge|

如何获得两列组合的平均价格

我的数据帧：

relevantTable = df.select(df['Price'], df['B'], df['A'])

看起来像：

+-------+------------+------------------+
|  Price|     B      |          A       |
+-------+------------+------------------+
| 0.2947|   i3.xlarge|                 x|
|  0.105|    c4.large|                 x|
| 0.2179|   m4.xlarge|                 x|
| 2.2534| m4.10xlarge|                 x|
| 2.1801| m4.10xlarge|                 x|
|  0.108|    r4.large|                 x|
|  0.108|    r4.large|                 x|
| 0.0213|    i3.large|                 y|
| 0.5572|  i2.4xlarge|                 y|
| 0.1542|  c4.4xlarge|                 y|
| 0.3624| m4.10xlarge|                 y|
| 0.3596| m4.10xlarge|                 y|
|   0.11|    m4.large|                 x|
| 0.4436|  m4.2xlarge|                 x|
| 0.1458|  m4.2xlarge|                 y|

... and so on real huge set

要获得a和B所有组合的平均值，一个简单且可扩展的解决方案是什么？

如何：

df.groupBy("A", "B").avg("Price")

或者，如果要按单列包含聚合，请执行以下操作：

df.cube("A", "B").avg("Price")

那么：

df.groupBy("A", "B").avg("Price")

或者，如果要按单列包含聚合，请执行以下操作：

df.cube("A", "B").avg("Price")

所以我尝试的是。reduce and。reduceByKey但我认为我在这里做错了：/所以我尝试的是。reduce and。reduceByKey但我认为我在这里做错了：/哇，这太容易了。。。该死的，我必须学习更多关于这个框架的知识。非常感谢你！哇，这太容易了。。。该死的，我必须学习更多关于这个框架的知识。非常感谢你！