Pyspark-每年和每月的平均天数

Pyspark-每年和每月的平均天数,pyspark,apache-spark-sql,hdfs,rdd,parquet,Pyspark,Apache Spark Sql,Hdfs,Rdd,Parquet,我有一个存储在hdfs中的CSV文件,格式如下: Business Line,Requisition (Job Title),Year,Month,Actual (# of Days) Communications,1012_Com_Specialist,2017,February,150 Information Technology,5781_Programmer_Associate,2017,March,80 Information Technology,2497_Programmer_Se

我有一个存储在hdfs中的CSV文件,格式如下:

Business Line,Requisition (Job Title),Year,Month,Actual (# of Days)
Communications,1012_Com_Specialist,2017,February,150
Information Technology,5781_Programmer_Associate,2017,March,80
Information Technology,2497_Programmer_Senior,2017,March,120
Services,6871_Business_Analyst_Jr,2018,May,33

我想得到每年和每月实际天数的平均值。有人能帮助我如何使用Pyspark完成这项工作并将输出保存在拼花文件中吗

您可以将csv转换为DF并运行spark sql,如下所示:

csvRDD.map(rec => {
 val i = rec.split(',');
 (i(0).toString, i(1).toString, i(2).toString, i(3).toString, i(4).toInt)   
}).toDF("businessline","jobtitle","year","month","actual").registerTempTable("input")

val resDF = sqlContext.sql("Select year, month, avg(actual) as avgactual from input group by year, month")
resDF.write.parquet("/user/path/solution1")