Sql 如何优化2表查询，其中数据只能基于两个表进行区分？_Sql_Postgresql_Performance_Query Performance

Sql 如何优化2表查询，其中数据只能基于两个表进行区分？

sql postgresql performance

Sql 如何优化2表查询，其中数据只能基于两个表进行区分？,sql,postgresql,performance,query-performance,Sql,Postgresql,Performance,Query Performance,我有以下两个表和数据分布： drop table if exists line; drop table if exists header; create table header (header_id serial primary key, type character); create table line (line_id serial primary key, header_id serial not null, type character, constraint line_heade

我有以下两个表和数据分布：

drop table if exists line;
drop table if exists header;

create table header (header_id serial primary key, type character);
create table line (line_id serial primary key, header_id serial not null, type character, constraint line_header foreign key (header_id) references header (header_id)) ;
create index inv_type_idx on header (type);
create index line_type_idx on line (type);

insert into header (type) select case when floor(random()*2+1) = 1 then  'A' else 'B' end from generate_series(1,100000);
insert into line (header_id, type) select header_id,  case when floor(random()*10000+1) = 1 then (case when type ='A' then 'B' else 'A' end) else type end from header, generate_series(1,5);

```
标题
```
表格有100K行：50%的
```
类型
```
A和50%的B
```
行
```
表有500K行：
- 每个
```
标题
```
  有5行
```
行
```
  s
- 总的来说，有50%的行属于
```
类型
```
  A和50%的行属于B
- 在99.99%的情况下，
```
行的类型
与其标题
相同，只有0.01%不同
```


数据分发：
# select h.type header_type, l.type line_type, count(*) from line l inner join header h on l.header_id = h.header_id group by 1,2 order by 1,2;
 header_type | line_type | count  
-------------+-----------+--------
 A           | A         | 250865
 A           | B         |     25
 B           | A         |     29
 B           | B         | 249081
(4 rows)

我需要用类型B获取所有行s，其标题为A。即使总数非常有限（500000行中的25行），我获得的计划（PostgreSQL 10）如下所示，在两个表中执行顺序扫描：
explain
select * from line l
   inner join header h on l.header_id = h.header_id
where h.type ='A' and l.type='B';

                                QUERY PLAN                                 
---------------------------------------------------------------------------
 Hash Join  (cost=2323.29..14632.89 rows=125545 width=19)
   Hash Cond: (l.header_id = h.header_id)
   ->  Seq Scan on line l  (cost=0.00..11656.00 rows=248983 width=13)
         Filter: (type = 'B'::bpchar)
   ->  Hash  (cost=1693.00..1693.00 rows=50423 width=6)
         ->  Seq Scan on header h  (cost=0.00..1693.00 rows=50423 width=6)
               Filter: (type = 'A'::bpchar)
(7 rows)

如果数据区分度非常高，但仅当组合来自多个表的信息时，是否有任何方法可以优化此类查询
当然，作为一种解决方法，我可以从标题
中对存储在行
s信息中的信息进行非规范化处理，这将使此查询更加高效。但如果可能的话，我宁愿不这样做，因为我需要维护这些重复的信息
alter table line add column compound_type char(2);
create index compound_idx on line (compound_type);

update line l
   set compound_type = h.type || l.type
  from header h
 where h.header_id = l.header_id;

# explain select * from line where compound_type = 'BA';
                                 QUERY PLAN                                  
-----------------------------------------------------------------------------
 Index Scan using compound_idx on line  (cost=0.42..155.58 rows=50 width=13)
   Index Cond: (compound_type = 'BA'::bpchar)
(2 rows)

1） 可以使用具有适当索引的物化视图。它可以在“后台”中更新。否则，它类似于行中的合成索引
2） 如果在（line.header\u id，line.type）上创建索引并强制执行如下子查询，则可以将搜索反向到页眉到行：
select header_id 
from header h 
where type='A' and 
    exists(select * from line l where l.header_id=h.header_id and l.type='B')

获取所有标题后，使用相应的标题id再次选择行
最好将类型包含到一些标题索引中，以便查找所需的全部内容都包含在两个索引中
尽管如此，它仍将读取标题索引中的~50K行，并在第二个索引中查找每一行。一般来说，这是无效的，但如果索引完全适合内存，可能就没那么糟糕了。
我不知道有什么方法可以避免至少对一个表进行完全扫描（并在另一个表中查找对应的行）。你不知道这50%中哪一个的类型是错误的。相反，请将数据结构设计为仅将类型存储在标题中，并在需要时进行查找。@GordonLinoff我不知道如何构造它以使其在没有数据重复的情况下保持性能。使用触发器，我发现您的复合类型
足够了。由于存在一些冗余，布尔列types\u=h.type！=l、 键入
可能是更好的样式，不过您需要检查两列。或者，如果有一个时间戳，您也可以有一个在某个时间点找到不同行的行表，并根据需要更新该表。需要时间戳上的索引。您的索引不会有多大帮助，因为它们的选择性在50%左右，所以计划员很可能永远不会使用它们。1。物化视图是一种数据非规范化或重复的方式，这正是我想要避免的，在这种情况下，我看不出它比通过触发器更新的复合类型更好。2.在该查询的计划中，我仍然看到2个序列扫描（每个表中一个）