Warning: file_get_contents(/data/phpspider/zhask/data//catemap/5/sql/82.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
SQL统计一个组中有多少异常值_Sql_Amazon Redshift - Fatal编程技术网

SQL统计一个组中有多少异常值

SQL统计一个组中有多少异常值,sql,amazon-redshift,Sql,Amazon Redshift,我想通过计算每个组中有多少元素大于$\mu+\sigma$,$\mu+2\sigma$,依此类推 到目前为止,我找到了一个解决方案,首先创建一个带有$\mu$和$\sigma的表gp$ CREATE TABLE gp AS SELECT col_a, col_b, AVG(y) AS y_mean, STDDEV(y) AS y_std FROM my_table GROUP BY col_a, col_b; 然后,我对原始表执行左联接,并通过 SEL

我想通过计算每个
组中有多少元素大于$\mu+\sigma$,$\mu+2\sigma$,依此类推

到目前为止,我找到了一个解决方案,首先创建一个带有$\mu$和$\sigma的表
gp
$

CREATE TABLE gp AS
SELECT col_a,
       col_b,
       AVG(y) AS y_mean,
       STDDEV(y) AS y_std
FROM my_table
GROUP BY col_a, col_b;
然后,我对原始表执行
左联接
,并通过

SELECT col_a,
       col_b,
       SUM(CASE
             WHEN y>y_mean+y_std THEN 1
             ELSE 0
           END) AS std1,
       SUM(CASE
             WHEN y>y_mean+2*y_std THEN 1
             ELSE 0
           END) AS std2,
       SUM(CASE
             WHEN y>y_mean+3*y_std THEN 1
             ELSE 0
           END) AS std3, 
FROM (
SELECT a.*,
       b.y_mean,
       b.y_std
FROM(
(SELECT col_a,
       col_b,
       y
FROM my_table) a
LEFT JOIN (SELECT * FROM gp) b
ON a.col_a=b.col_a AND a.col_b=b.col_b)
)
GROUP BY col_a, col_b
我想知道是否有更有效的方法来达到同样的效果。

使用窗口功能:

SELECT col_a, col_b,
       SUM(CASE WHEN y > y_mean + y_std THEN 1 ELSE 0
           END) AS std1,
       SUM(CASE WHEN y > y_mean + 2 * y_std THEN 1 ELSE 0
           END) AS std2,
       SUM(CASE WHEN y > y_mean + 3 * y_std THEN 1 ELSE 0
           END) AS std3
FROM (SELECT t.*,
             AVG(y) OVER (PARTITION BY col_a, col_b) as y_mean,
             STDDEV(y) OVER (PARTITION BY col_a, col_b) as y_std
      FROM my_table t
     ) t
GROUP BY col_a, col_b;

从统计学的角度来看,你也应该看看下限。如果分布仅向正方向倾斜,则标准偏差可能不是最好的度量(尽管在使用数据库时您没有太多选择)。

在类似的情况下,平均绝对偏差对我更有效。我稍后将对此进行测试。在这种情况下,“y”不是负值,我需要此计数,以便了解我有多少异常异常值。只有在删除
t
后,我才能运行您的查询。它比我的解决方案更快。接受。我仍然希望有更多的灵活性,但我会问另一个问题。