Amazon redshift 怎样才能'；全新，前所未见'；ID是否每月以红移方式计数？_Amazon Redshift

Amazon redshift 怎样才能'；全新，前所未见'；ID是否每月以红移方式计数？

amazon-redshift

Amazon redshift 怎样才能'；全新，前所未见'；ID是否每月以红移方式计数？,amazon-redshift,Amazon Redshift,有相当数量的材料可供使用densite\u rank（）等详细方法每月统计不同的东西，但是，我找不到任何允许每月统计不同的东西，这也会删除/打折上个月组中看到的任何id 数据可以这样想象： id (int8 type) | observed time (timestamp utc) ------------------ 1 | 2017-01-01 2 | 2017-01-02 1 | 2017-01-02 1 | 2017-02-02 2 | 2017-02-03 3 | 2017

有相当数量的材料可供使用

densite\u rank（）

等详细方法每月统计不同的东西，但是，我找不到任何允许每月统计不同的东西，这也会删除/打折上个月组中看到的任何id

数据可以这样想象：

id (int8 type) | observed time (timestamp utc)
------------------
1  | 2017-01-01
2  | 2017-01-02
1  | 2017-01-02
1  | 2017-02-02
2  | 2017-02-03
3  | 2017-02-04
1  | 2017-03-01
3  | 2017-03-01
4  | 2017-03-01
5  | 2017-03-02

计数过程可以看作是：

1：2017-01年，我们看到了设备1和2，因此计数为2

2：2017-02年，我们看到了设备1、2和3。我们已经知道设备1和2，但不知道设备3，因此计数为1

3：2017-03年，我们看到了设备1、3、4和5。我们已经知道1和3，但不是4或5，所以计数是2

所需的输出如下所示：

observed time | count of new id
--------------------------
2017-01       | 2
2017-02       | 1
2017-03       | 2

明确地说，我希望有一个新的表，每行有一个聚合的月份，统计在这个月内出现了多少以前从未见过的新ID

IRL案例允许每月查看设备一次以上，但这不应影响计数。它还使用整数来存储id（正数和负数），时间段将是第二个真实时间戳。数据集的大小也很重要

我的初步尝试是：

WITH records_months AS (
SELECT *,
date_trunc('month', observed_time) AS month_group
FROM my_table
WHERE observed_time > '2017-01-01')
id_months AS (
SELECT DISTINCT 
month_group,
id
FROM records_months
GROUP BY month_group, id)
SELECT *
FROM id-months

然而，我仍停留在下一部分，即计算前几个月没有看到的新ID的数量。我相信解决方案可能是一个窗口函数，但我很难确定是哪一个或如何使用。

我想到的第一件事。我们的想法是

（最里面的查询）计算每个
```
id
```
看到的最早月份
（上一级）将其连接回主
```
my_表
```
dataset，然后
（外部查询）在清空已看到的
```
id
```
s之后，按月统计不同的
```
id
```
s

我对它进行了测试，得到了期望的结果集。将最早的一个月加入到原始表似乎是最自然的事情（与窗口函数相比）。希望这对你的红移足够好

select observed_month,
    -- Null out the id if the observed_month that we're grouping by
    -- is NOT the earliest month that the id was seen.
    -- Then count distinct id
    count(distinct(case when observed_month != earliest_month then null else id end)) as num_new_ids
from (
    select t.id,
        date_trunc('month', t.observed_time) as observed_month,
        earliest.earliest_month
    from my_table t
        join (
            -- What's the earliest month an id was seen?
            select id,
                date_trunc('month', min(observed_time)) as earliest_month
            from my_table
            group by 1
        ) earliest
        on t.id = earliest.id
)
group by 1
order by 1;

我想到的第一件事。我们的想法是

（最里面的查询）计算每个
```
id
```
看到的最早月份
（上一级）将其连接回主
```
my_表
```
dataset，然后
（外部查询）在清空已看到的
```
id
```
s之后，按月统计不同的
```
id
```
s

我对它进行了测试，得到了期望的结果集。将最早的一个月加入到原始表似乎是最自然的事情（与窗口函数相比）。希望这对你的红移足够好

select observed_month,
    -- Null out the id if the observed_month that we're grouping by
    -- is NOT the earliest month that the id was seen.
    -- Then count distinct id
    count(distinct(case when observed_month != earliest_month then null else id end)) as num_new_ids
from (
    select t.id,
        date_trunc('month', t.observed_time) as observed_month,
        earliest.earliest_month
    from my_table t
        join (
            -- What's the earliest month an id was seen?
            select id,
                date_trunc('month', min(observed_time)) as earliest_month
            from my_table
            group by 1
        ) earliest
        on t.id = earliest.id
)
group by 1
order by 1;

谢谢。我相信这是正确的。对于我目前的IRL数据集的要求，它也足够快。非常感谢。我相信这是正确的。它的速度也足以满足我对当前IRL数据集的要求。