Apache spark pyspark数据框架中的聚合和年度周数_Apache Spark_Pyspark_Apache Spark Sql_Spark Streaming_Delta Lake

Apache spark pyspark数据框架中的聚合和年度周数

apache-spark pyspark

Apache spark pyspark数据框架中的聚合和年度周数,apache-spark,pyspark,apache-spark-sql,spark-streaming,delta-lake,Apache Spark,Pyspark,Apache Spark Sql,Spark Streaming,Delta Lake,我在数据框中有下面的模式 root |-- device_id: string (nullable = true) |-- eventName: string (nullable = true) |-- client_event_time: timestamp (nullable = true) |-- eventDate: date (nullable = true) |-- deviceType: string (nullable = true) 我想在此数据框中添加以下两列：

我在数据框中有下面的模式

root
 |-- device_id: string (nullable = true)
 |-- eventName: string (nullable = true)
 |-- client_event_time: timestamp (nullable = true)
 |-- eventDate: date (nullable = true)
 |-- deviceType: string (nullable = true)

我想在此数据框中添加以下两列：

WAU：每周活动用户数（按周分组的不同设备ID）

周：一年中的一周（需要使用适当的SQL函数）

我想使用近似计数。可选关键字rsd也需要设置为.01

我试着开始写下面这样的东西，但出现了错误

spark.readStream
.format("delta")
.load(inputpath)
.groupBy(weekofyear('eventDate'))
.count()
.distinct()
.writeStream
.format("delta")
.option("checkpointLocation", outputpath)
.outputMode("complete")
.start(outputpath)

根据讨论，下面的代码有效

spark.readStream
  .format("delta")
  .load(inputdata)
  .groupBy(weekofyear('eventDate').alias('week'))
  .agg(F.approx_count_distinct('device_id', rsd = .01)).alias('WAU')
  .writeStream
  .format("delta")
  .option("checkpointLocation", outputdata)
  .outputMode("complete")
  .start(outputdata)

您遇到了什么错误？AnalysisException：在流数据帧/数据集上聚合后不支持dropDuplicates；但是，我想对设备ID进行不同的计数@另一个错误是org.apache.spark.sql.AnalysisException:属性名“weekofyear（eventDate）”包含“，；{}（）\n\t=”之间的无效字符。请使用别名将其重命名@第二个错误显然很容易解决，不是吗？只需使用错误消息中建议的别名，我仍然不明白。在

groupBy（）之后调用count（）
，但这是错误的。它每周创建组，然后只计算组数。在groupBy