Sql 随时间查询DAU/MAU(每日)
我有一个包含user_id和date列的daily sessions表。我想以每天为基础绘制DAU/MAU每日活跃用户/每月活跃用户的图表。例如:Sql 随时间查询DAU/MAU(每日),sql,postgresql,Sql,Postgresql,我有一个包含user_id和date列的daily sessions表。我想以每天为基础绘制DAU/MAU每日活跃用户/每月活跃用户的图表。例如: Date MAU DAU DAU/MAU 2014-06-01 20,000 5,000 20% 2014-06-02 21,000 4,000 19% 2014-06-03 20,050 3,050 17% ... ... ... ... 计
Date MAU DAU DAU/MAU
2014-06-01 20,000 5,000 20%
2014-06-02 21,000 4,000 19%
2014-06-03 20,050 3,050 17%
... ... ... ...
计算每日活动量很简单,但计算每月活动量(例如30天内登录的用户数)会导致问题。如果没有每天的左连接,这是如何实现的
编辑:我正在使用Postgres。您没有向我们显示完整的表定义,但可能是这样的:
select date,
count(*) over (partition by date_trunc('day', date) order by date) as dau,
count(*) over (partition by date_trunc('month', date) order by date) as mau
from sessions
order by date;
要在不重复窗口函数的情况下获取百分比,只需将其包装在派生表中:
select date,
dau,
mau,
dau::numeric / (case when mau = 0 then null else mau end) as pct
from (
select date,
count(*) over (partition by date_trunc('day', date) order by date) as dau,
count(*) over (partition by date_trunc('month', date) order by date) as mau
from sessions
) t
order by date;
以下是一个示例输出:
postgres=> select * from sessions;
session_date | user_id
--------------+---------
2014-05-01 | 1
2014-05-01 | 2
2014-05-01 | 3
2014-05-02 | 1
2014-05-02 | 2
2014-05-02 | 3
2014-05-02 | 4
2014-05-02 | 5
2014-06-01 | 1
2014-06-01 | 2
2014-06-01 | 3
2014-06-02 | 1
2014-06-02 | 2
2014-06-02 | 3
2014-06-02 | 4
2014-06-03 | 1
2014-06-03 | 2
2014-06-03 | 3
2014-06-03 | 4
2014-06-03 | 5
(20 rows)
postgres=> select session_date,
postgres-> dau,
postgres-> mau,
postgres-> round(dau::numeric / (case when mau = 0 then null else mau end),2) as pct
postgres-> from (
postgres(> select session_date,
postgres(> count(*) over (partition by date_trunc('day', session_date) order by session_date) as dau,
postgres(> count(*) over (partition by date_trunc('month', session_date) order by session_date) as mau
postgres(> from sessions
postgres(> ) t
postgres-> order by session_date;
session_date | dau | mau | pct
--------------+-----+-----+------
2014-05-01 | 3 | 3 | 1.00
2014-05-01 | 3 | 3 | 1.00
2014-05-01 | 3 | 3 | 1.00
2014-05-02 | 5 | 8 | 0.63
2014-05-02 | 5 | 8 | 0.63
2014-05-02 | 5 | 8 | 0.63
2014-05-02 | 5 | 8 | 0.63
2014-05-02 | 5 | 8 | 0.63
2014-06-01 | 3 | 3 | 1.00
2014-06-01 | 3 | 3 | 1.00
2014-06-01 | 3 | 3 | 1.00
2014-06-02 | 4 | 7 | 0.57
2014-06-02 | 4 | 7 | 0.57
2014-06-02 | 4 | 7 | 0.57
2014-06-02 | 4 | 7 | 0.57
2014-06-03 | 5 | 12 | 0.42
2014-06-03 | 5 | 12 | 0.42
2014-06-03 | 5 | 12 | 0.42
2014-06-03 | 5 | 12 | 0.42
2014-06-03 | 5 | 12 | 0.42
(20 rows)
postgres=>
假设每天都有值,则可以使用子查询获取总计数,范围介于: 不幸的是,我认为您需要不同的用户,而不仅仅是用户数。这使得问题变得更加困难,特别是因为Postgres不支持countdistinct作为窗口函数 我认为你必须为此做一些自我连接。这里有一种方法:
with dau as (
select date, count(distinct userid) as dau
from dailysessions ds
group by date
)
select date, dau,
(select count(distinct user_id)
from dailysessions ds
where ds.date between date - 29 * interval '1 day' and date
) as mau
from dau;
这一个使用COUNT DISTINCT获得滚动30天DAU/MAU: 计算reddit在BigQuery中的用户参与度——但SQL已经足够标准,可以用于其他数据库
SELECT day, dau, mau, INTEGER(100*dau/mau) daumau
FROM (
SELECT day, EXACT_COUNT_DISTINCT(author) dau, FIRST(mau) mau
FROM (
SELECT DATE(SEC_TO_TIMESTAMP(created_utc)) day, author
FROM [fh-bigquery:reddit_comments.2015_09]
WHERE subreddit='AskReddit') a
JOIN (
SELECT stopday, EXACT_COUNT_DISTINCT(author) mau
FROM (SELECT created_utc, subreddit, author FROM [fh-bigquery:reddit_comments.2015_09], [fh-bigquery:reddit_comments.2015_08]) a
CROSS JOIN (
SELECT DATE(SEC_TO_TIMESTAMP(created_utc)) stopday
FROM [fh-bigquery:reddit_comments.2015_09]
GROUP BY 1
) b
WHERE subreddit='AskReddit'
AND SEC_TO_TIMESTAMP(created_utc) BETWEEN DATE_ADD(stopday, -30, 'day') AND TIMESTAMP(stopday)
GROUP BY 1
) b
ON a.day=b.stopday
GROUP BY 1
)
ORDER BY 1
我在上进一步讨论了这一点 正如你所注意到的,DAU很简单。可以通过首先创建一个具有布尔值的视图来解决MAU问题,该视图用于用户激活和取消激活的时间,如下所示:
CREATE OR REPLACE VIEW "vw_login" AS
SELECT *
, LEAST (LEAD("date") OVER w, "date" + 30) AS "activeExpiry"
, CASE WHEN LAG("date") OVER w IS NULL THEN true ELSE false AS "activated"
, CASE
WHEN LEAD("date") OVER w IS NULL THEN true
WHEN LEAD("date") OVER w - "date" > 30 THEN true
ELSE false
END AS "churned"
, CASE
WHEN LAG("date") OVER w IS NULL THEN false
WHEN "date" - LAG("date") OVER w <= 30 THEN false
WHEN row_number() OVER w > 1 THEN true
ELSE false
END AS "resurrected"
FROM "login"
WINDOW w AS (PARTITION BY "user_id" ORDER BY "date")
最后,通过计算各列的累计和,计算活动MAU的运行总数。您需要参加两次vw_活动,因为第二次活动是在用户进入非活动状态的那一天(即自上次登录后30天)加入的
我包含了一个日期序列,以确保数据集中存在所有的日期。您也可以不使用它,但您可能会在数据集中跳过几天
SELECT
d."date"
, SUM(COALESCE(a.activated::int,0)
- COALESCE(a2.churned::int,0)
+ COALESCE(a.resurrected::int,0)) OVER w
, d."date", a."activated", a2."churned", a."resurrected" FROM
generate_series('2010-01-01'::date, CURRENT_DATE, '1 day'::interval) d
LEFT OUTER JOIN vw_activity a ON d."date" = a."date"
LEFT OUTER JOIN vw_activity a2 ON d."date" = (a2."date" + INTERVAL '30 days')::date
WINDOW w AS (ORDER BY d."date") ORDER BY d."date";
当然,您可以在单个查询中执行此操作,但这有助于更好地理解结构。您使用的数据库是MySQL还是Postgres?看起来很棒-问题:MAU是日历月还是每天的前一个月?理想的情况是那天的前一个月。我已经运行了这个查询,但它不起作用。MAU从月初的0增加到月底的累计用户总数。另外,包装器需要一个按日期、dau、mau分组。@DavidBailey:然后您需要提供更多的细节,特别是表结构和更多的示例数据。不,包装器不需要GROUPBY,因为我使用的是一个窗口函数,它生成一个累积计数。不幸的是,SQLFiddle现在不起作用,因为那时我会提供一个实例。我已经添加了一个psql会话的记录,向您展示了我的示例表。您假设的数据结构非常准确:在实时示例中,每月活跃用户数为5月1日3个,5月2日8个,6月1日3个。现在还不清楚这代表着什么……在5月1日和5月2日的会话表中有三个条目。因此,结果显示DAU在5月1日为3,5月2日为5,但MAU进行累积计数,这意味着在5月2日有8个会话。当我运行此查询多天时,结果集中的MAU列不会更改,理想情况下应该更改,因为每天的MAU应该不同。有没有关于如何解决这个问题的建议?@Patthebug。试着用样本数据和期望的结果问一个新问题。这是一个新问题:@GordonLinoff,像往常一样,非常优雅的解决方案,谢谢!该解决方案的前提忽略了MAU不是一个月内DAU的总和。否则,如果一个月内每天都有相同的用户,那么MAU将是30,而实际上应该是1。
CREATE OR REPLACE VIEW "vw_activity" AS
SELECT
SUM("activated"::int) "activated"
, SUM("churned"::int) "churned"
, SUM("resurrected"::int) "resurrected"
, "date"
FROM "vw_login"
GROUP BY "date"
;
SELECT
d."date"
, SUM(COALESCE(a.activated::int,0)
- COALESCE(a2.churned::int,0)
+ COALESCE(a.resurrected::int,0)) OVER w
, d."date", a."activated", a2."churned", a."resurrected" FROM
generate_series('2010-01-01'::date, CURRENT_DATE, '1 day'::interval) d
LEFT OUTER JOIN vw_activity a ON d."date" = a."date"
LEFT OUTER JOIN vw_activity a2 ON d."date" = (a2."date" + INTERVAL '30 days')::date
WINDOW w AS (ORDER BY d."date") ORDER BY d."date";