使用连接在MySQL中执行队列分析的两种方法_Mysql_Sql_Analytics

使用连接在MySQL中执行队列分析的两种方法

mysql sql

使用连接在MySQL中执行队列分析的两种方法,mysql,sql,analytics,Mysql,Sql,Analytics,我做一个处理器。输入参数：时间范围和步骤、条件初始事件到EXTRACT队列、附加条件保留事件到每N小时/天/月后检查。输出参数：队列分析网格，如下所示： 0h | 16h | 32h | 48h | 64h | 80h | 96h | cohort #00 15 | 6 | 4 | 1 | 1 | 2 | 2 | cohort #01 1 | 35 | 8 |

我做一个处理器。输入参数：时间范围和步骤、条件初始事件到EXTRACT队列、附加条件保留事件到每N小时/天/月后检查。输出参数：队列分析网格，如下所示：

               0h |   16h |   32h |   48h |   64h |   80h |   96h | 
cohort #00     15 |     6 |     4 |     1 |     1 |     2 |     2 | 
cohort #01      1 |    35 |     8 |     0 |     2 |     0 |     1 | 
cohort #02      0 |     3 |    31 |    11 |     5 |     3 |     0 | 
cohort #03      0 |     0 |     4 |    27 |     7 |     6 |     2 | 
cohort #04      0 |     1 |     1 |     4 |    29 |     4 |     3 |

#user-id, timestamp, event_name
events_view (uid varchar(64), tm int(11), e varchar(64))

user_id  |  1st timestamp |  cohort_no 
uid1        1423836540          0
uid2        1423839540          0
uid3        1423841160          1
uid4        1423841460          2
...
uidN        1423843080          M

user_id  |  1st timestamp |  cohort_no  |   user_id    | other fields...
uid1        1423836540          0           uid1
uid2        1423839540          0           null
uid3        1423841160          1           null
uid4        1423841460          2           uid4
...
uidN        1423843080          M           null

基本上：

获取群组：从时间开始到每一步，每个时段都做1件事的唯一用户。找出每个队列中有多少人在N秒后做了2件事，N*2秒，N*3秒，依此类推到现在。简而言之，我有两个解决方案。其中一个工作太慢，包括一个沉重的选择，每个数据步骤都有连接：1天、2天、3天等等。我想通过将每个数据步骤的结果连接到队列来优化它——这是第二个解决方案。这看起来很有效，但我不确定这是否是最好的方法，也不确定它是否会给出相同的结果，即使队列交叉。请检查一下

这就是整个故事。我有一张超过100000个事件的表格，大致如下：

               0h |   16h |   32h |   48h |   64h |   80h |   96h | 
cohort #00     15 |     6 |     4 |     1 |     1 |     2 |     2 | 
cohort #01      1 |    35 |     8 |     0 |     2 |     0 |     1 | 
cohort #02      0 |     3 |    31 |    11 |     5 |     3 |     0 | 
cohort #03      0 |     0 |     4 |    27 |     7 |     6 |     2 | 
cohort #04      0 |     1 |     1 |     4 |    29 |     4 |     3 |

#user-id, timestamp, event_name
events_view (uid varchar(64), tm int(11), e varchar(64))

user_id  |  1st timestamp |  cohort_no 
uid1        1423836540          0
uid2        1423839540          0
uid3        1423841160          1
uid4        1423841460          2
...
uidN        1423843080          M

user_id  |  1st timestamp |  cohort_no  |   user_id    | other fields...
uid1        1423836540          0           uid1
uid2        1423839540          0           null
uid3        1423841160          1           null
uid4        1423841460          2           uid4
...
uidN        1423843080          M           null

输入行示例：

"user_sampleid1", 1423836540, "level_end:001:win"

为了首先进行群组分析，我提取了群组：例如，从2015-02-13开始到2015-02-16结束的10小时内发送特别活动“首次发布”的用户。这篇文章中的所有代码都经过了简化和缩短，以了解其思想

DROP TABLE IF EXISTS tmp_c;
create temporary table tmp_c (uid varchar(64),  tm int(11), c int(11) );

set beg = UNIX_TIMESTAMP('2015-02-13 00:00:00');
set en = UNIX_TIMESTAMP('2015-02-16 00:00:00');
select min(tm) into t_start from events_view ;
select max(tm) into t_end from events_view ;
if  beg <  t_start then
    set beg = t_start;
end if;
if  en >  t_end then
    set en = t_end;
end if;

set period =   3600 * 10;
set cnt_c = ceil((en - beg) / period) ;

/*works quick enough*/
WHILE i < cnt_c DO
    insert into tmp_c (
    select uid, min(tm), i from events_view where 
            locate("1st_launch", e) > 0 and tm > (beg + period * i) 
            AND tm <= (beg + period * (i+1)) group by uid );
    SET i = i+1;
END WHILE;

然后，我需要再次划分时段的时间范围，并计算每个时段每个队列中有多少用户发送了事件级别_end:001:win。对于每个小时间段，我选择所有发送了level_end:001:win事件的唯一用户，并将其加入tmp_c队列表。所以我有这样的想法：

               0h |   16h |   32h |   48h |   64h |   80h |   96h | 
cohort #00     15 |     6 |     4 |     1 |     1 |     2 |     2 | 
cohort #01      1 |    35 |     8 |     0 |     2 |     0 |     1 | 
cohort #02      0 |     3 |    31 |    11 |     5 |     3 |     0 | 
cohort #03      0 |     0 |     4 |    27 |     7 |     6 |     2 | 
cohort #04      0 |     1 |     1 |     4 |    29 |     4 |     3 |

#user-id, timestamp, event_name
events_view (uid varchar(64), tm int(11), e varchar(64))

user_id  |  1st timestamp |  cohort_no 
uid1        1423836540          0
uid2        1423839540          0
uid3        1423841160          1
uid4        1423841460          2
...
uidN        1423843080          M

user_id  |  1st timestamp |  cohort_no  |   user_id    | other fields...
uid1        1423836540          0           uid1
uid2        1423839540          0           null
uid3        1423841160          1           null
uid4        1423841460          2           uid4
...
uidN        1423843080          M           null

通过这种方式，我可以看到我的团队中有多少用户发送了level_end:001:win，exclude not found by where子句：where t2.uid not null。最后，我进行分组，并统计每个队列中的用户数，这些用户在这个特定的时间段内发送了level_end:001:win。代码如下：

DROP TABLE IF EXISTS tmp_res;
create temporary table tmp_res (uid varchar(64) CHARACTER SET cp1251 NOT NULL,  c int(11), cnt int(11) );
set i = 0;
set cnt_c = ceil((t_end - beg) / period) ;
WHILE i < cnt_c DO
    insert into tmp_res
    select concat(beg + period * i, "_", beg + period * (i+1)), c, count(distinct(uid)) from 
    (select t1.uid, t1.c  from tmp_c t1 left join
        (select uid, min(tm) from events_view where 
            locate("level_end:001:win", e) > 0 and
            tm > (beg + period * i) AND tm <= (beg + period * (i+1)) group by uid ) t2
        on t1.uid = t2.uid where t2.uid is not null) t3
    group by c; 
    SET i = i+1;
END WHILE;

/*getting result of the first method: tooo slooooow!*/
select * from  tmp_res;

它可以工作，但处理起来需要太多时间，因为这里有很多查询，而不是一个，所以我需要重写它

我认为它可以用连接重写，但我仍然不确定如何重写。我决定制作一个临时表，并在其中写入期间边界：

DROP TABLE IF EXISTS tmp_times;
create temporary table tmp_times (tm_start int(11), tm_end int(11));

set cnt_c = ceil((t_end - beg) / period) ;
set i = 0;
WHILE i < cnt_c DO
 insert into  tmp_times values( beg + period * i, beg + period * (i+1));
SET i = i+1;
END WHILE;

然后，我将periods to events映射到user_id+timestamp，表示特定事件到temp表，并将其左键连接到队列表并对结果进行分组：

SELECT Concat(tm_start, "_", tm_end) per,
       t1.c                          coh,
       Count(DISTINCT( t2.uid ))
FROM   tmp_c t1
       LEFT JOIN (SELECT *
                  FROM   tmp_times t3
                         LEFT JOIN (SELECT uid,
                                           tm
                                    FROM   events_view
                                    WHERE  Locate("level_end:101:win", e) > 0)
                                   t4
                                ON ( t4.tm > t3.tm_start
                                     AND t4.tm <= t3.tm_end )
                  WHERE  t4.uid IS NOT NULL
                  ORDER  BY t3.tm_start) t2
              ON t1.uid = t2.uid
WHERE  t2.uid IS NOT NULL
GROUP  BY per,
          coh
ORDER  BY per,
          coh;

在我的测试中，这将返回与方法1相同的结果。我无法手动检查结果，但我了解方法1的工作原理，就我所知，它给出了我想要的结果。方法2的速度更快，但我不确定它是否是最好的方法，即使队列交叉，它也会给出相同的结果

也许有一些众所周知的常用方法可以在SQL中执行队列分析？我使用的方法1比方法2更可靠吗？我并不经常使用连接，这就是为什么我还没有完全理解连接的魔力。方法2看起来像纯魔法，我过去不相信我不懂的东西：

谢谢你的回答

你们能不能从概述你们的任务开始，而不是直接跳到“我不知道什么”的解决方案上来？@ivan_pozdeev队列分析是一个已知的术语，我认为这个问题是显而易见的。好的，我会编辑。你能提供几行输入数据的示例吗？老实说，我无法解析您对它的描述//用户id、时间戳、事件名称事件视图uid varchar64、tm int11、evarchar64@OllieJones这是一个表结构。添加了示例行。创建表事件\u视图uid varchar64、tm int11、e varchar64将使用此数据类型创建表。Real table有更多的列，但通过events\u view，您会明白这一点。您在问优化问题。不幸的是，如果不知道您的表是什么样子，就无法以一种有用的方式回答这些问题。视图中隐含的数据隐藏也会使RDM难以很好地处理满足您的需求所必需的连接、选择和聚合操作。顺便说一句，如果你寻找MySQL群组分析，你最喜欢的搜索引擎会产生一些好结果。