Warning: file_get_contents(/data/phpspider/zhask/data//catemap/3/apache-spark/6.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Scala 在全柱火花上应用该功能_Scala_Apache Spark_Apache Spark Sql - Fatal编程技术网

Scala 在全柱火花上应用该功能

Scala 在全柱火花上应用该功能,scala,apache-spark,apache-spark-sql,Scala,Apache Spark,Apache Spark Sql,我已经完成了这段代码,我的问题是函数转换数据类型,如何在同一个execept列时间戳中转换数据集中包含的所有列,另一个问题是如何在除列时间戳之外的所有列上应用函数avg。 非常感谢 val df = spark.read.option("header",true).option("inferSchema", "true").csv("C:/Users/mhattabi/Desktop/dataTest.csv") val result=df.withColumn("new_time",((uni

我已经完成了这段代码,我的问题是函数转换数据类型,如何在同一个execept列时间戳中转换数据集中包含的所有列,另一个问题是如何在除列时间戳之外的所有列上应用函数avg。 非常感谢

val df = spark.read.option("header",true).option("inferSchema", "true").csv("C:/Users/mhattabi/Desktop/dataTest.csv")
val result=df.withColumn("new_time",((unix_timestamp(col("time")) /300).cast("long") * 300).cast("timestamp"))
result("value").cast("float")//here the first question 
val finalresult=result.groupBy("new_time").agg(avg("value")).sort("new_time")//here the second question about avg
finalresult.coalesce(1).write.format("com.databricks.spark.csv").option("header", "true").save("C:/mydata.csv")

这在pyspark中很容易实现,但我遇到touble试图将其重写为scala代码。。。我希望你能设法做到

from pyspark.sql.functions import *
df = spark.createDataFrame([(100, "4.5", "5.6")], ["new_time", "col1", "col2"])
columns = [col(c).cast('float') if c != 'new_time' else col(c) for c in df.columns]
aggs = [avg(c) for c in df.columns if c != 'new_time']
finalresult = df.select(columns).groupBy('new_time').agg(*aggs)
finalresult.explain()

*HashAggregate(keys=[new_time#0L], functions=[avg(cast(col1#14 as double)), avg(cast(col2#15 as double))])
+- Exchange hashpartitioning(new_time#0L, 200)
   +- *HashAggregate(keys=[new_time#0L], functions=[partial_avg(cast(col1#14 as double)), partial_avg(cast(col2#15 as double))])
      +- *Project [new_time#0L, cast(col1#1 as float) AS col1#14, cast(col2#2 as float) AS col2#15]
         +- Scan ExistingRDD[new_time#0L,col1#1,col2#2]

这在pyspark中很容易实现,但我遇到touble试图将其重写为scala代码。。。我希望你能设法做到

from pyspark.sql.functions import *
df = spark.createDataFrame([(100, "4.5", "5.6")], ["new_time", "col1", "col2"])
columns = [col(c).cast('float') if c != 'new_time' else col(c) for c in df.columns]
aggs = [avg(c) for c in df.columns if c != 'new_time']
finalresult = df.select(columns).groupBy('new_time').agg(*aggs)
finalresult.explain()

*HashAggregate(keys=[new_time#0L], functions=[avg(cast(col1#14 as double)), avg(cast(col2#15 as double))])
+- Exchange hashpartitioning(new_time#0L, 200)
   +- *HashAggregate(keys=[new_time#0L], functions=[partial_avg(cast(col1#14 as double)), partial_avg(cast(col2#15 as double))])
      +- *Project [new_time#0L, cast(col1#1 as float) AS col1#14, cast(col2#2 as float) AS col2#15]
         +- Scan ExistingRDD[new_time#0L,col1#1,col2#2]

你不能只为你想转换的每一列添加
withColumn
?而且
avg
agg
中的列一样多?@Mariusz问题是数据集非常大,而且有很多列,只是想做一些事情,占用除列时间之外的所有列。你能不能只为你想强制转换的每一列添加
withColumn
?和
avg
中的列一样多
agg
?@Mariusz问题是数据集非常大,而且有很多列,只是想做一些事情,除了列时间之外,占用所有列