Apache spark 从csv获取数据并计算平均值

Apache spark 从csv获取数据并计算平均值,apache-spark,pyspark,pyspark-sql,Apache Spark,Pyspark,Pyspark Sql,首先,我需要使用python spark从csv文件计算列的平均值 我有一个密码: scSpark = SparkSession \ .builder \ .appName("Python Spark SQL basic example: Reading CSV file without mentioning schema") \ .config("spark.some.config.option", "some-value") \ .getOrCreate() sdfData = scSp

首先,我需要使用python spark从csv文件计算列的平均值

我有一个密码:

 scSpark = SparkSession \
.builder \
.appName("Python Spark SQL basic example: Reading CSV file without mentioning schema") \
.config("spark.some.config.option", "some-value") \
.getOrCreate()

sdfData = scSpark.read.csv("document.csv", header=True, sep=",")
sdfData.show()
然后我会在屏幕上看到下一个数据:

   +---------+------+---------+------------------+
   |     Name| total| test val|             ratio|
   +---------+------+---------+------------------+
   |parimatch|     3|   test7 |0.6164045285312666|
   |parimatch|     4|   test6 |0.5829715240832467|
   |     leon|     3|   test5 |0.6164045285312666|
   |     leon|     4|   test4 |0.5829715240832467|
   |parimatch|     3|   test3 |0.6164045285312666|
   |parimatch|     4|    test |0.5829715240832467|
   +---------+------+---------+------------------+

如何通过spark计算平均比率?

Apache spark有一个平均函数来实现这一点:

import pyspark.sql.functions as f

average = sdfData.agg(avg(col("ratio")))