Warning: file_get_contents(/data/phpspider/zhask/data//catemap/5/date/2.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
格式错误的SQL导致Spark SQL:不匹配的列显示在同一行中_Sql_Date_Apache Spark_Apache Spark Sql - Fatal编程技术网

格式错误的SQL导致Spark SQL:不匹配的列显示在同一行中

格式错误的SQL导致Spark SQL:不匹配的列显示在同一行中,sql,date,apache-spark,apache-spark-sql,Sql,Date,Apache Spark,Apache Spark Sql,我有一个类似下面的Spark SQL,目的是 列出MY_表中的所有姓名,其中姓名的计数在今天和前一天之间从-10或从=100更改 SELECT DISTINCT d2.name as NAME, d2.region_id as REGION_ID, d2.count as COUNT, date_format(to_timestamp(d2.VERSION_TIME, "yyyy-MM-dd HH:mm:ss"), "yyyy-MM-dd'T'HH:mm:ss.SSS'Z'") as VE

我有一个类似下面的Spark SQL,目的是 列出MY_表中的所有姓名,其中姓名的计数在今天和前一天之间从-10或从=100更改

SELECT DISTINCT d2.name as NAME, d2.region_id as REGION_ID, d2.count as COUNT,
   date_format(to_timestamp(d2.VERSION_TIME, "yyyy-MM-dd HH:mm:ss"), "yyyy-MM-dd'T'HH:mm:ss.SSS'Z'") as VERSION_TIME
FROM MY_TABLE AS d1
RIGHT JOIN MY_TABLE AS d2
   ON d1.VERSION_TIME = d2.VERSION_TIME - interval 1 day    
   AND d1.name = d2.name
   AND d1.region_id = d2.region_id
   AND d2.region_id = ${region_id}
WHERE (coalesce(d1.count, 0) < 10 AND d2.count >= 10
       OR coalesce(d1.count, 0) < 100 AND d2.count >= 100
      )
   AND d1.VERSION_TIME BETWEEN cast("${date}" as timestamp) - INTERVAL 8 DAYS AND cast("${date}" as timestamp) - INTERVAL 1 DAYS
   AND d2.VERSION_TIME BETWEEN cast("${date}" as timestamp) - INTERVAL 7 DAYS AND cast("${date}" as timestamp)
   AND d1.region_id = d2.region_id
   AND d2.region_id = ${region_id}
;
乍一看,它看起来不错,但现实是,abc根本不存在于ID5地区,def和ghi也不存在


我不知道为什么会发生这种情况,任何能分享任何见解的人都将不胜感激

使用窗口功能。如果我正确地遵循逻辑:

SELECT t.*
FROM (SELECT t.*,
             LAG(count) OVER (PARTITION BY name, region_id ORDER BY version_time) as prev_count
      FROM MY_TABLE t
     ) t
WHERE count BETWEEN prev_count - 9 AND prev_count + 9 OR
      count > prev_count + 100 OR
      count < prev_count - 100;

请解释一下你试图简化的逻辑。好主意!就这么做了,谢谢@Gordonlinoff数据是什么样子的?每天每个名字一行?还是每天多行?旁注-您有多余的标准。例如,在连接条件和where子句中都不需要d1.name=d2.name。我对spark一无所知。但是我不认为您所描述的abc在region_id 5中不存在,从查询中可以看出-假定您的意思是在我的_表中没有name='abc'和region_id=5的行。谢谢!这两个条件如何:d1.VERSION_TIME BETWEEN cast${date}as timestamp-INTERVAL 8天,cast${date}as timestamp-INTERVAL 1天,d2.VERSION_TIME BETWEEN cast${date}as timestamp-INTERVAL 7天,cast${date as timestamp如果要添加日期过滤器,请将它们添加到子查询中。
SELECT t.*
FROM (SELECT t.*,
             LAG(count) OVER (PARTITION BY name, region_id ORDER BY version_time) as prev_count
      FROM MY_TABLE t
     ) t
WHERE count BETWEEN prev_count - 9 AND prev_count + 9 OR
      count > prev_count + 100 OR
      count < prev_count - 100;