Warning: file_get_contents(/data/phpspider/zhask/data//catemap/3/apache-spark/5.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181

Warning: file_get_contents(/data/phpspider/zhask/data//catemap/1/angularjs/24.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Apache spark Databricks和Spark中的公共表表达式(CTE)_Apache Spark_Apache Spark Sql_Common Table Expression_Databricks - Fatal编程技术网

Apache spark Databricks和Spark中的公共表表达式(CTE)

Apache spark Databricks和Spark中的公共表表达式(CTE),apache-spark,apache-spark-sql,common-table-expression,databricks,Apache Spark,Apache Spark Sql,Common Table Expression,Databricks,我在Databricks中有一个spark数据帧。我正在尝试使用公共表表达式(CTE)运行一些sql查询。下面是前10行数据 +----------+----------+------+---+---+---------+-----------------+ | data_date| user_id|region|sex|age|age_group|sum(duration_min)| +----------+----------+------+---+---+---------+-----

我在Databricks中有一个spark数据帧。我正在尝试使用公共表表达式(CTE)运行一些sql查询。下面是前10行数据

+----------+----------+------+---+---+---------+-----------------+
| data_date|   user_id|region|sex|age|age_group|sum(duration_min)|
+----------+----------+------+---+---+---------+-----------------+
|2020-01-01|22600560aa|     1|  1| 28|        2|              0.0|
|2020-01-01|17148900ab|     6|  2| 60|        5|           1138.0|
|2020-01-01|21900230aa|     5|  1| 43|        4|              0.0|
|2020-01-01|35900050ac|     8|  1| 16|        1|            224.0|
|2020-01-01|22300280ad|     6|  2| 44|        4|              8.0|
|2020-01-02|19702160ac|     2|  2| 55|        5|              0.0|
|2020-02-02|17900020aa|     5|  2| 64|        5|            264.0|
|2020-02-02|16900120aa|     3|  1| 69|        6|              0.0|
|2020-02-02|11160900aa|     6|  2| 52|        5|              0.0|
|2020-03-02|16900290aa|     5|  1| 37|        3|              0.0|
+----------+----------+------+---+---+---------+-----------------+
在这里,我将每个用户的注册日期存储在regs CTE中,然后计算每个月的注册数。这个带有CTE的块在Databricks中工作没有任何问题

%sql


    WITH regs AS (
      SELECT
        user_id,
        MIN(data_date) AS reg_date
      FROM df2
      GROUP BY user_id)
    
    SELECT
      month(reg_date)  AS reg_month,
      COUNT(DISTINCT user_id) AS users
    FROM regs
    GROUP BY reg_month
    ORDER BY reg_month ASC;
然而,当我在以前的sql查询中添加另一个CTE时,它返回一个错误(我在sql server中测试了这个块,它工作正常)。我不明白他为什么不在spark databricks工作

%sql

WITH regs AS (
  SELECT
    user_id,
    MIN(data_date) AS reg_date
  FROM df2
  GROUP BY user_id
  ),

  regs_per_month AS (
    SELECT
      month(reg_date) AS reg_month,
      COUNT(DISTINCT user_id) AS users
    FROM regs
    GROUP BY reg_month
  )

SELECT
  reg_month,
  users,
  LAG(users, 1) OVER (ORDER BY regs_per_month ASC) AS previous_users
FROM regs_per_month
ORDER BY reg_month ASC;
这是错误消息

Error in SQL statement: AnalysisException: cannot resolve '`regs_per_month`' given input columns: [regs_per_month.reg_month, regs_per_month.users]; line 20 pos 31;
'Sort ['reg_month ASC NULLS FIRST], true

只需使用逗号,即可在Spark SQL中嵌套公共表表达式(CTE),例如

%sql
;WITH regs AS (
SELECT
  user_id,
  MIN(data_date) AS reg_date
FROM df2
GROUP BY user_id
),
regs_per_month AS (
SELECT
  month(reg_date) AS reg_month,
  COUNT(DISTINCT user_id) AS users
FROM regs
GROUP BY reg_month
)
SELECT
  reg_month,
  users,
  LAG(users, 1) OVER (ORDER BY reg_month ASC) AS previous_users
FROM regs_per_month
ORDER BY reg_month ASC;
我的结果:

如前所述,您的
LAG
语句应该引用
reg\u month
列,而不是
regs\u per\u month
CTE

作为嵌套CTE的另一种方法,您可以使用多个,
语句,例如

%sql
;WITH regs_per_month AS ( 
  WITH regs AS ( 
  SELECT
    user_id,
    MIN(data_date) AS reg_date
  FROM df2
  GROUP BY user_id
  )
  SELECT 
    month(reg_date) AS reg_month,
    COUNT(DISTINCT user_id) AS users
  FROM regs
  GROUP BY reg_month
)
SELECT 
  reg_month, 
  users,
  LAG( users, 1 ) OVER ( ORDER BY reg_month ASC ) AS previous_users
FROM regs_per_month
ORDER BY reg_month ASC;

需要另一个吗?我也试过另一个,但没有成功。正如我所说的,当前脚本将在sql server上运行,不会出现无法解决的错误,我将查看您的
…(按regs\u/月ASC排序)…
引用了一列
regs\u/月
,该列未出现在CTE
regs\u/月
中。