Mysql 将整数列连接到另一个表中的范围并避免表扫描-从B.lo和B.hi之间的A.val上的连接B

drop schema if exists dropMe; create schema dropMe; use dropMe;
create table A ( id serial, val int );
create table B ( id serial, lo int, hi int, primary key ( lo, hi ) );

select aId, bId from A join B on A.val between B.lo and B.hi;




# init
set @testRecotds = 100000;
set cte_max_recursion_depth = @testRecotds;

# DDL - creates tmp schema then creates A and B tables
drop schema if exists dropMe; create schema dropMe; use dropMe;
create table A ( id serial, val int unique );
create table B ( id serial, lo int, hi int, primary key ( lo, hi ) );

# DML - inserts semi-random 100k integers in A table and ranges in B table
insert into A( val ) with recursive r as ( select 1 i, 1 n union all select i + 1, n + 1 + 80 * rand() from r where i < @testRecotds ) select n from r;

insert into B( lo, hi )
  with recursive
    r as (
          select 1 i, 1 lo, 1 + 40 * rand() hi
        union all
          select i + 1, lo + 41 + 40 * rand() nLo, ( select nLo ) + 40 * rand()
            from r
            where i < @testRecotds
  select lo, hi from r;

# The actual query - optimize the join
select count( * ) from A join B on val between lo and hi;

# MySQL uses full table scan on A and full index scan B on id column, which has no practical performance improvement

drop schema dropMe;



Postgres加入所有整数并求和1秒,MySQL 27分6秒。MySQL在内部扫描两个表,而PG扫描一个表并在第二个表上使用索引


set cte_max_recursion_depth = 100000;

drop schema if exists dropMe; create schema dropMe; use dropMe;
create table x( x int primary key );
create table y( y int );

insert into x with recursive r as ( select 1 i, 1 n union all select i + 1, n + 1 + 40 * rand() from r where i < 100000 ) select n from r;
insert into y with recursive r as ( select 1 i, 1 n union all select i + 1, n + 1 + 40 * rand() from r where i < 100000 ) select n from r;

select sum( y ), sum( x ) from ( select y, ( select max( x ) from x where x <= y ) x from y ) z;

drop schema dropMe;

drop schema if exists dropMe; create schema dropMe;
create table dropMe.x( x int primary key );
create table dropMe.y( y int );

insert into dropMe.x with recursive r as ( select 1 i, 1 n union all select i + 1, n + 1 + ( 40 * random() ) :: int from r where i < 100000 ) select n from r;
insert into dropMe.y with recursive r as ( select 1 i, 1 n union all select i + 1, n + 1 + ( 40 * random() ) :: int from r where i < 100000 ) select n from r;

select sum( y ), sum( x ) from ( select y, ( select max( x ) from dropMe.x where x <= y ) x from dropMe.y ) z; -- 1 second

drop schema dropMe;

祝你们玩得开心。它给了我很多,所以我在这里分享 有一种O1解决方案可以确定一个A是否在B中,但它需要一种不同的方法来存储表B的非重叠范围。方法是将所有可能的范围包括在B中,然后保持lo或hi,因为另一个与相邻行冗余。我选罗。每一行都会说明它是包含的还是一个间隙。然后是一个简单的

SELECT included FROM B WHERE some_val >= lo ORDER BY lo DESC LIMIT 1

有关更多讨论,请参阅,特别是IPv4范围的代码,如果范围包括整个INT UNSIGNED范围,则可以作为B的模型


我假设包含的值为真1或假0。在引用中,我返回了一个owner_id,对于not owned,它是0




create table A (
    id serial, 
    val int );
create table B (
    id serial, 
    lo int primary key, 
    included boolean );
insert into A( val )
  with recursive r as 
  ( select 1 i, 1 n 
    union all
    select i + 1, n + 1 + 80 * rand() 
        from r where i < 100000
  ) select n from r;
insert into B( lo, included )
  with recursive r as
  ( select 1 i, 1 n
    union all
    select i + 1, n + 1 + 40 * rand() from r where i < 200000
  ) select n, i % 2 from r;

作为证明,我看到Handler\u read\u prev之前大约为N*M,但之后Handler\u read\u prev和Handler\u read\u rnd\u next的大小大约为N。刷新状态和显示状态如“Handler%”对于使用小数据集进行性能测试非常方便。

什么版本的MySQL,表的引擎是什么?你真的能这样创建B表吗?串行列不是必须是主键吗?在A.val上添加索引不是吗?@草莓,我的游乐场在MySQL 8.0.18上运行。@Barmar-不,串行自动使其唯一,您也可以使其成为主键或使其他列成为主键。将val列作为主键对性能没有实际影响。请运行我添加的示例代码。感谢您的建议,但是,它似乎无法解决问题。我用表A中的100k值和100k范围测试它。每个范围由B中的2行表示。包括奇数行,排除偶数行。lo列上有一个主键。MySQL优化器忽略可能的索引搜索,并对两个表进行连续扫描。因此,复杂度为O N*M->100k*200k=200亿步,下面的示例运行36分钟,而不是亚秒时间,所需复杂度为O N*log2 M->100k*17.6->~180万步;如果存在A、B,则删除表格;创建表A id序列,val int;创建表B id序列,lo int主键,包含布尔值;将递归r作为select 1 i,1 n union all select i+1,n+1+80*rand from r插入val,其中i<100000 select n from r;插入到B lo中,包括递归r作为选择1 i,1 n并集所有选择i+1,n+1+40*rand from r,其中i<200000选择n,i%2 from r;从A中选择计数*,其中从B中选择,其中lo@user9526573-请参阅更新。使用函数似乎对性能很重要。好吧,用内部独立查询的用户定义函数替换连接可以将查询速度提高406倍。强制执行计划员使用索引感觉是一种相当笨拙的方法,但它确实有效。
create table A (
    id serial, 
    val int );
create table B (
    id serial, 
    lo int primary key, 
    included boolean );
insert into A( val )
  with recursive r as 
  ( select 1 i, 1 n 
    union all
    select i + 1, n + 1 + 80 * rand() 
        from r where i < 100000
  ) select n from r;
insert into B( lo, included )
  with recursive r as
  ( select 1 i, 1 n
    union all
    select i + 1, n + 1 + 40 * rand() from r where i < 200000
  ) select n, i % 2 from r;
select count( * ) from A where
   ( select B.included from B
        where B.lo <= A.val order by B.lo desc limit 1
CREATE DEFINER = `ip`@`localhost` FUNCTION Included(
        _val INT UNSIGNED)
    DECLARE _included BOOLEAN;
    SELECT included INTO _included
        FROM B
        WHERE lo <= _val
        ORDER BY lo DESC
        LIMIT 1;
    RETURN _included;
END //
    select count( * ) from A
        WHERE Included(A.val);