Mysql 快速组秩()函数

Mysql 快速组秩()函数,mysql,performance,grouping,rank,row-number,Mysql,Performance,Grouping,Rank,Row Number,人们尝试在MySQL中模拟MSSQL RANK或ROW_NUMBER函数的方法有很多种,但到目前为止,我尝试过的所有方法都很慢 我有一张这样的桌子: CREATE TABLE ratings (`id` int, `category` varchar(1), `rating` int) ; INSERT INTO ratings (`id`, `category`, `rating`) VALUES (3, '*', 54), (4, '*', 45),

人们尝试在MySQL中模拟MSSQL RANK或ROW_NUMBER函数的方法有很多种,但到目前为止,我尝试过的所有方法都很慢

我有一张这样的桌子:

CREATE TABLE ratings
    (`id` int, `category` varchar(1), `rating` int)
;

INSERT INTO ratings
    (`id`, `category`, `rating`)
VALUES
    (3, '*', 54),
    (4, '*', 45),
    (1, '*', 43),
    (2, '*', 24),
    (2, 'A', 68),
    (3, 'A', 43),
    (1, 'A', 12),
    (3, 'B', 22),
    (4, 'B', 22),
    (4, 'C', 44)
;
但它有22万条记录。大约有90000个唯一id

我想通过查看那些没有排名的类别来排名id的第一名,因为排名越高,排名越低

SELECT g1.id,
       g1.category,
       g1.rating,
       Count(*) AS rank
FROM ratings AS g1
JOIN ratings AS g2 ON (g2.rating, g2.id) >= (g1.rating, g1.id)
AND g1.category = g2.category
WHERE g1.category != '*'
GROUP BY g1.id,
         g1.category,
         g1.rating
ORDER BY g1.category,
         rank
输出:

id  category    rating  rank
2   A   68  1
3   A   43  2
1   A   12  3
4   B   22  1
3   B   22  2
4   C   44  1
然后我想取一个id的最小等级,并用他们在*类别中的等级平均。提供以下内容的总查询:

SELECT X1.id,
       (X1.rank + X2.minrank) / 2 AS OverallRank
FROM
  (SELECT g1.id,
          g1.category,
          g1.rating,
          Count(*) AS rank
   FROM ratings AS g1
   JOIN ratings AS g2 ON (g2.rating, g2.id) >= (g1.rating, g1.id)
   AND g1.category = g2.category
   WHERE g1.category = '*'
   GROUP BY g1.id,
            g1.category,
            g1.rating
   ORDER BY g1.category,
            rank) X1
JOIN
  (SELECT id,
          Min(rank) AS MinRank
   FROM
     (SELECT g1.id,
             g1.category,
             g1.rating,
             Count(*) AS rank
      FROM ratings AS g1
      JOIN ratings AS g2 ON (g2.rating, g2.id) >= (g1.rating, g1.id)
      AND g1.category = g2.category
      WHERE g1.category != '*'
      GROUP BY g1.id,
               g1.category,
               g1.rating
      ORDER BY g1.category,
               rank) X
   GROUP BY id) X2 ON X1.id = X2.id
ORDER BY overallrank
给我

id  OverallRank
3   1.5000
4   1.5000
2   2.5000
1   3.0000
这个查询是正确的,输出是我想要的,但它只是挂在我的220000条记录的真实表上。我如何优化它?我的真实表格有一个关于id、评级和类别以及id、类别的索引

编辑:

显示结果创建表格评级:

主键是查询此表最常见的用例,这就是为什么它是聚集键。值得注意的是,服务器是SSD的raid 10,具有9GB/s FIO随机读取。因此,我不怀疑没有聚集的索引会有多大影响

select countdistinct类别与评级的输出为50

出于这可能是数据的方式或我的疏忽的考虑,我被包括在整个表的导出中。它的压缩容量只有200KB:


第一个查询需要27秒才能运行

您可以使用带有自动增量列的临时表来生成列号

例如-要为“*”类别生成列组,请执行以下操作:

drop temporary table if exists tmp_main_cat_rank;
create temporary table tmp_main_cat_rank (
    rank int unsigned auto_increment primary key,
    id int NOT NULL
) engine=memory
    select null as rank, id
    from ratings r
    where r.category = '*'
    order by r.category, r.rating desc, r.id desc;
drop temporary table if exists tmp_pos;
create temporary table tmp_pos (
    pos int unsigned auto_increment primary key,
    category varchar(50) not null,
    id int NOT NULL
) engine=memory
    select null as pos, category, id
    from ratings r
    where r.category <> '*'
    order by r.category, r.rating desc, r.id desc;

drop temporary table if exists tmp_cat_offset;
create temporary table tmp_cat_offset engine=memory
    select category, min(pos) - 1 as `offset`
    from tmp_pos
    group by category;

select t.id, min(t.pos - o.offset) as min_rank
from tmp_pos t
join tmp_cat_offset o using(category)
group by t.id
这大约需要30毫秒。而你在我的机器上用selfjoin的方法需要45秒。即使在分类、评级、id上有了新的索引,运行仍然需要14秒

按类别按组生成排名要复杂一些。我们仍然可以使用自动增量列,但需要计算并减去每个类别的偏移量:

drop temporary table if exists tmp_main_cat_rank;
create temporary table tmp_main_cat_rank (
    rank int unsigned auto_increment primary key,
    id int NOT NULL
) engine=memory
    select null as rank, id
    from ratings r
    where r.category = '*'
    order by r.category, r.rating desc, r.id desc;
drop temporary table if exists tmp_pos;
create temporary table tmp_pos (
    pos int unsigned auto_increment primary key,
    category varchar(50) not null,
    id int NOT NULL
) engine=memory
    select null as pos, category, id
    from ratings r
    where r.category <> '*'
    order by r.category, r.rating desc, r.id desc;

drop temporary table if exists tmp_cat_offset;
create temporary table tmp_cat_offset engine=memory
    select category, min(pos) - 1 as `offset`
    from tmp_pos
    group by category;

select t.id, min(t.pos - o.offset) as min_rank
from tmp_pos t
join tmp_cat_offset o using(category)
group by t.id
总运行时间约为280毫秒,无额外索引;总运行时间约为240毫秒,有类别、等级、id索引


selfjoin方法的一个注意事项:这是一个优雅的解决方案,在小组规模较小的情况下表现良好。它的速度很快,平均组大小您有主键还是唯一键?请发布SHOW CREATE TABLE ratings的结果。表中有多少不同的类别选择countdistinct category from ratings?执行第一个查询需要多长时间?更新为两者dropbox链接已断开。很抱歉,dropbox现在默认为private。应该是固定的工作,保罗。在我的服务器上运行了3毫秒。我已经习惯了SQL Server我没有它在这个RAID SSD服务器上,但是它在500毫秒内使用RANK和所有连接运行。我一直在努力与MySQL模拟排名速度,这是一个伟大的方法。
SELECT Count(*)
FROM ratings AS g1
JOIN ratings AS g2 ON (g2.rating, g2.id) >= (g1.rating, g1.id)
AND g1.category = g2.category
WHERE g1.category != '*'