Warning: file_get_contents(/data/phpspider/zhask/data//catemap/5/sql/85.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
SQLite3查询优化联接与子选择_Sql_Database_Sqlite_Query Optimization - Fatal编程技术网

SQLite3查询优化联接与子选择

SQLite3查询优化联接与子选择,sql,database,sqlite,query-optimization,Sql,Database,Sqlite,Query Optimization,我正试图找出一种最好的方法(在本例中可能无关紧要),根据标志的存在和另一个表中某一行的关系id来查找一个表的行 以下是模式: CREATE TABLE files ( id INTEGER PRIMARY KEY, dirty INTEGER NOT NULL); CREATE TABLE resume_points ( id INTEGER PRIMARY KEY AUTOINCREMENT NOT NULL , scan_file_id INTEGER NOT NULL

我正试图找出一种最好的方法(在本例中可能无关紧要),根据标志的存在和另一个表中某一行的关系id来查找一个表的行

以下是模式:

    CREATE TABLE files (
id INTEGER PRIMARY KEY,
dirty INTEGER NOT NULL);

    CREATE TABLE resume_points (
id INTEGER PRIMARY KEY  AUTOINCREMENT  NOT NULL ,
scan_file_id INTEGER NOT NULL );
我正在使用SQLite3

这些文件表将非常大,通常为10K-5M行。
由于
文件.id
是主键,简历点数将很小,请尝试通过
此字段对
进行分组,而不是检查
不同的文件。*

SELECT f.*
FROM resume_points rp
INNER JOIN files f on rp.scan_file_id = f.id
WHERE f.dirty = 1
GROUP BY f.id

另一个考虑性能的选项是将索引添加到<代码> ReuMyPosial.SncIsFieleId

CREATE INDEX index_resume_points_scan_file_id ON resume_points (scan_file_id)

您可以尝试
存在
,这将不会产生任何重复的
文件

select * from files
where exists (
    select * from resume_points 
    where files.id = resume_points.scan_file_id
)
and dirty = 1;
当然,拥有适当的索引可能会有所帮助:

files.dirty
resume_points.scan_file_id
索引是否有用取决于您的数据。

如果表“resume\u points”只有一个或两个不同的文件id号,则它似乎只需要一行或两行,并且似乎需要扫描文件id作为主键。该表只有两列,id号没有意义

如果是这样的话,你不需要任何一个身份证号码

pragma foreign_keys = on;
CREATE TABLE resume_points (
  scan_file_id integer primary key
);

CREATE TABLE files (
  scan_file_id integer not null references resume_points (scan_file_id),
  dirty INTEGER NOT NULL,
  primary key (scan_file_id, dirty)
);

现在你也不需要加入了。只需查询“文件”表。

TL;DR:最好的查询和索引是:

create index uniqueFiles on resume_points (scan_file_id);
select * from (select distinct scan_file_id from resume_points) d join files on d.scan_file_id = files.id and files.dirty = 1;
由于我通常使用SQL Server,起初我认为无论您以何种方式编写这些等价的SQL语句,查询优化器都会为这样一个简单的查询找到最佳的执行计划。所以我下载了SQLite,开始到处玩。令我大吃一惊的是,他们的表现有着巨大的差异

以下是设置代码:

CREATE TABLE files (
id INTEGER PRIMARY KEY autoincrement,
dirty INTEGER NOT NULL);

CREATE TABLE resume_points (
id INTEGER PRIMARY KEY  AUTOINCREMENT  NOT NULL ,
scan_file_id INTEGER NOT NULL );

insert into files (dirty) values (0);
insert into files (dirty) select (case when random() < 0 then 1 else 0 end) from files;
insert into files (dirty) select (case when random() < 0 then 1 else 0 end) from files;
insert into files (dirty) select (case when random() < 0 then 1 else 0 end) from files;
insert into files (dirty) select (case when random() < 0 then 1 else 0 end) from files;
insert into files (dirty) select (case when random() < 0 then 1 else 0 end) from files;
insert into files (dirty) select (case when random() < 0 then 1 else 0 end) from files;
insert into files (dirty) select (case when random() < 0 then 1 else 0 end) from files;
insert into files (dirty) select (case when random() < 0 then 1 else 0 end) from files;
insert into files (dirty) select (case when random() < 0 then 1 else 0 end) from files;
insert into files (dirty) select (case when random() < 0 then 1 else 0 end) from files;
insert into files (dirty) select (case when random() < 0 then 1 else 0 end) from files;
insert into files (dirty) select (case when random() < 0 then 1 else 0 end) from files;
insert into files (dirty) select (case when random() < 0 then 1 else 0 end) from files;
insert into files (dirty) select (case when random() < 0 then 1 else 0 end) from files;
insert into files (dirty) select (case when random() < 0 then 1 else 0 end) from files;
insert into files (dirty) select (case when random() < 0 then 1 else 0 end) from files;
insert into files (dirty) select (case when random() < 0 then 1 else 0 end) from files;
insert into files (dirty) select (case when random() < 0 then 1 else 0 end) from files;
insert into files (dirty) select (case when random() < 0 then 1 else 0 end) from files;
insert into files (dirty) select (case when random() < 0 then 1 else 0 end) from files;
insert into files (dirty) select (case when random() < 0 then 1 else 0 end) from files;
insert into files (dirty) select (case when random() < 0 then 1 else 0 end) from files;
insert into files (dirty) select (case when random() < 0 then 1 else 0 end) from files;

insert into resume_points (scan_file_id) select (select abs(random() % 8000000)) from files limit 5000;

insert into resume_points (scan_file_id) select (select abs(random() % 8000000)) from files limit 5000;
下面是我在i5笔记本电脑上尝试的查询和执行时间。数据库文件大小只有大约200MB,因为它没有任何其他数据

select distinct files.* from resume_points inner join files on resume_points.scan_file_id=files.id where files.dirty = 1;
4.3 - 4.5ms with and without index

select distinct files.* from files inner join resume_points on files.id=resume_points.scan_file_id where files.dirty = 1;
4.4 - 4.7ms with and without index

select * from (select distinct scan_file_id from resume_points) d join files on d.scan_file_id = files.id and files.dirty = 1;
2.0 - 2.5ms with uniqueFiles
2.6-2.9ms without uniqueFiles

select * from files where id in (select distinct scan_file_id from resume_points) and dirty = 1;
2.1 - 2.5ms with uniqueFiles
2.6-3ms without uniqueFiles

SELECT f.* FROM resume_points rp INNER JOIN files f on rp.scan_file_id = f.id
WHERE f.dirty = 1 GROUP BY f.id
4500 - 6190 ms with uniqueFiles
8.8-9.5 ms without uniqueFiles
    14000 ms with uniqueFiles and fileLookup

select * from files where exists (
select * from resume_points where files.id = resume_points.scan_file_id) and dirty = 1;
8400 ms with uniqueFiles
7400 ms without uniqueFiles

看起来SQLite的查询优化器一点也不高级。最好的查询首先将resume_点减少到少量行(在测试用例中为两行,OP表示为1-2),然后查找文件以查看它是否脏<代码>目录文件
索引对任何文件都没有太大影响。我认为这可能是因为数据在测试表中的排列方式。这可能会对生产表产生影响。然而,差异并不是太大,因为将有不到几个查找
uniqueFiles
确实起到了作用,因为它可以将10000行简历点减少到2行,而无需扫描其中的大部分
fileLookup
确实使一些查询稍微快了一点,但还不足以显著改变结果。值得注意的是,它使小组的进展非常缓慢。总之,尽早减少结果集,使差异最大。

我认为jtseng给出了解决方案

select * from (select distinct scan_file_id from resume_points) d
join files on d.scan_file_id = files.id and files.dirty = 1
基本上,它与您发布的上一个选项相同:

select * from files where id in (select distinct scan_file_id from resume_points) and dirty = 1;
这是因为您必须避免全表扫描/连接

因此,首先您需要1-2个不同的ID:

select distinct scan_file_id from resume_points
之后,只需将1-2行连接到另一个表上,而不是所有10K,这样可以实现性能优化

如果您多次需要此语句,我会将其放入视图中。视图不会改变性能,但看起来更清晰/更易于阅读


另请查看查询优化文档:

这取决于您的数据和硬件。您必须自己测量。在上次查询中,您遗漏了和files.dirty=1
select distinct files.* from resume_points inner join files on resume_points.scan_file_id=files.id where files.dirty = 1;
4.3 - 4.5ms with and without index

select distinct files.* from files inner join resume_points on files.id=resume_points.scan_file_id where files.dirty = 1;
4.4 - 4.7ms with and without index

select * from (select distinct scan_file_id from resume_points) d join files on d.scan_file_id = files.id and files.dirty = 1;
2.0 - 2.5ms with uniqueFiles
2.6-2.9ms without uniqueFiles

select * from files where id in (select distinct scan_file_id from resume_points) and dirty = 1;
2.1 - 2.5ms with uniqueFiles
2.6-3ms without uniqueFiles

SELECT f.* FROM resume_points rp INNER JOIN files f on rp.scan_file_id = f.id
WHERE f.dirty = 1 GROUP BY f.id
4500 - 6190 ms with uniqueFiles
8.8-9.5 ms without uniqueFiles
    14000 ms with uniqueFiles and fileLookup

select * from files where exists (
select * from resume_points where files.id = resume_points.scan_file_id) and dirty = 1;
8400 ms with uniqueFiles
7400 ms without uniqueFiles
select * from (select distinct scan_file_id from resume_points) d
join files on d.scan_file_id = files.id and files.dirty = 1
select * from files where id in (select distinct scan_file_id from resume_points) and dirty = 1;
select distinct scan_file_id from resume_points