Mysql 快速组秩()函数
人们尝试在MySQL中模拟MSSQL RANK或ROW_NUMBER函数的方法有很多种,但到目前为止,我尝试过的所有方法都很慢 我有一张这样的桌子:Mysql 快速组秩()函数,mysql,performance,grouping,rank,row-number,Mysql,Performance,Grouping,Rank,Row Number,人们尝试在MySQL中模拟MSSQL RANK或ROW_NUMBER函数的方法有很多种,但到目前为止,我尝试过的所有方法都很慢 我有一张这样的桌子: CREATE TABLE ratings (`id` int, `category` varchar(1), `rating` int) ; INSERT INTO ratings (`id`, `category`, `rating`) VALUES (3, '*', 54), (4, '*', 45),
CREATE TABLE ratings
(`id` int, `category` varchar(1), `rating` int)
;
INSERT INTO ratings
(`id`, `category`, `rating`)
VALUES
(3, '*', 54),
(4, '*', 45),
(1, '*', 43),
(2, '*', 24),
(2, 'A', 68),
(3, 'A', 43),
(1, 'A', 12),
(3, 'B', 22),
(4, 'B', 22),
(4, 'C', 44)
;
但它有22万条记录。大约有90000个唯一id
我想通过查看那些没有排名的类别来排名id的第一名,因为排名越高,排名越低
SELECT g1.id,
g1.category,
g1.rating,
Count(*) AS rank
FROM ratings AS g1
JOIN ratings AS g2 ON (g2.rating, g2.id) >= (g1.rating, g1.id)
AND g1.category = g2.category
WHERE g1.category != '*'
GROUP BY g1.id,
g1.category,
g1.rating
ORDER BY g1.category,
rank
输出:
id category rating rank
2 A 68 1
3 A 43 2
1 A 12 3
4 B 22 1
3 B 22 2
4 C 44 1
然后我想取一个id的最小等级,并用他们在*类别中的等级平均。提供以下内容的总查询:
SELECT X1.id,
(X1.rank + X2.minrank) / 2 AS OverallRank
FROM
(SELECT g1.id,
g1.category,
g1.rating,
Count(*) AS rank
FROM ratings AS g1
JOIN ratings AS g2 ON (g2.rating, g2.id) >= (g1.rating, g1.id)
AND g1.category = g2.category
WHERE g1.category = '*'
GROUP BY g1.id,
g1.category,
g1.rating
ORDER BY g1.category,
rank) X1
JOIN
(SELECT id,
Min(rank) AS MinRank
FROM
(SELECT g1.id,
g1.category,
g1.rating,
Count(*) AS rank
FROM ratings AS g1
JOIN ratings AS g2 ON (g2.rating, g2.id) >= (g1.rating, g1.id)
AND g1.category = g2.category
WHERE g1.category != '*'
GROUP BY g1.id,
g1.category,
g1.rating
ORDER BY g1.category,
rank) X
GROUP BY id) X2 ON X1.id = X2.id
ORDER BY overallrank
给我
id OverallRank
3 1.5000
4 1.5000
2 2.5000
1 3.0000
这个查询是正确的,输出是我想要的,但它只是挂在我的220000条记录的真实表上。我如何优化它?我的真实表格有一个关于id、评级和类别以及id、类别的索引
编辑:
显示结果创建表格评级:
主键是查询此表最常见的用例,这就是为什么它是聚集键。值得注意的是,服务器是SSD的raid 10,具有9GB/s FIO随机读取。因此,我不怀疑没有聚集的索引会有多大影响
select countdistinct类别与评级的输出为50
出于这可能是数据的方式或我的疏忽的考虑,我被包括在整个表的导出中。它的压缩容量只有200KB:
第一个查询需要27秒才能运行您可以使用带有自动增量列的临时表来生成列号 例如-要为“*”类别生成列组,请执行以下操作:
drop temporary table if exists tmp_main_cat_rank;
create temporary table tmp_main_cat_rank (
rank int unsigned auto_increment primary key,
id int NOT NULL
) engine=memory
select null as rank, id
from ratings r
where r.category = '*'
order by r.category, r.rating desc, r.id desc;
drop temporary table if exists tmp_pos;
create temporary table tmp_pos (
pos int unsigned auto_increment primary key,
category varchar(50) not null,
id int NOT NULL
) engine=memory
select null as pos, category, id
from ratings r
where r.category <> '*'
order by r.category, r.rating desc, r.id desc;
drop temporary table if exists tmp_cat_offset;
create temporary table tmp_cat_offset engine=memory
select category, min(pos) - 1 as `offset`
from tmp_pos
group by category;
select t.id, min(t.pos - o.offset) as min_rank
from tmp_pos t
join tmp_cat_offset o using(category)
group by t.id
这大约需要30毫秒。而你在我的机器上用selfjoin的方法需要45秒。即使在分类、评级、id上有了新的索引,运行仍然需要14秒
按类别按组生成排名要复杂一些。我们仍然可以使用自动增量列,但需要计算并减去每个类别的偏移量:
drop temporary table if exists tmp_main_cat_rank;
create temporary table tmp_main_cat_rank (
rank int unsigned auto_increment primary key,
id int NOT NULL
) engine=memory
select null as rank, id
from ratings r
where r.category = '*'
order by r.category, r.rating desc, r.id desc;
drop temporary table if exists tmp_pos;
create temporary table tmp_pos (
pos int unsigned auto_increment primary key,
category varchar(50) not null,
id int NOT NULL
) engine=memory
select null as pos, category, id
from ratings r
where r.category <> '*'
order by r.category, r.rating desc, r.id desc;
drop temporary table if exists tmp_cat_offset;
create temporary table tmp_cat_offset engine=memory
select category, min(pos) - 1 as `offset`
from tmp_pos
group by category;
select t.id, min(t.pos - o.offset) as min_rank
from tmp_pos t
join tmp_cat_offset o using(category)
group by t.id
总运行时间约为280毫秒,无额外索引;总运行时间约为240毫秒,有类别、等级、id索引
selfjoin方法的一个注意事项:这是一个优雅的解决方案,在小组规模较小的情况下表现良好。它的速度很快,平均组大小您有主键还是唯一键?请发布SHOW CREATE TABLE ratings的结果。表中有多少不同的类别选择countdistinct category from ratings?执行第一个查询需要多长时间?更新为两者dropbox链接已断开。很抱歉,dropbox现在默认为private。应该是固定的工作,保罗。在我的服务器上运行了3毫秒。我已经习惯了SQL Server我没有它在这个RAID SSD服务器上,但是它在500毫秒内使用RANK和所有连接运行。我一直在努力与MySQL模拟排名速度,这是一个伟大的方法。
SELECT Count(*)
FROM ratings AS g1
JOIN ratings AS g2 ON (g2.rating, g2.id) >= (g1.rating, g1.id)
AND g1.category = g2.category
WHERE g1.category != '*'