Sql postgres按整数类型列分组比按字符类型列分组快?
我有四张桌子,都是空的Sql postgres按整数类型列分组比按字符类型列分组快?,sql,postgresql,group-by,explain,sql-execution-plan,Sql,Postgresql,Group By,Explain,Sql Execution Plan,我有四张桌子,都是空的 create table web_content_3 ( content integer, hits bigint, bytes bigint, appid varchar(32) ); create table web_content_4 ( content character varying (128 ), hits bigint, bytes bigint, appid varchar(32) ); create table web_content_5 ( co
create table web_content_3 ( content integer, hits bigint, bytes bigint, appid varchar(32) );
create table web_content_4 ( content character varying (128 ), hits bigint, bytes bigint, appid varchar(32) );
create table web_content_5 ( content character varying (128 ), hits bigint, bytes bigint, appid integer );
create table web_content_6 ( content integer, hits bigint, bytes bigint, appid integer );
我对大约200万条记录的分组使用相同的查询
i、 e.选择内容,按内容从web_content_u{3,4,5,6}组中选择sum(hits)作为hits,sum(bytes)作为bytes,appid,appid代码>
结果是:
- Table Name | Content | appid | Time Taken [In ms]
- ===========================================================
- web_content_3 | integer | Character | 27277.931
- web_content_4 | Character | Character | 151219.388
- web_content_5 | Character | integer | 127252.023
- web_content_6 | integer | integer | 5412.096
这里的web内容查询只需5秒左右,与其他三种组合相比,使用此统计数据,我们可以说group by的整数、整数组合要快得多,但问题是为什么
我也解释了结果,但它确实给我解释了web_内容_4和web_内容_6查询之间的剧烈变化
给你
test=# EXPLAIN ANALYSE SELECT content, sum(hits) as hits, sum(bytes) as bytes, appid from web_content_4 GROUP BY content,appid;
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------
GroupAggregate (cost=482173.36..507552.31 rows=17680 width=63) (actual time=138099.612..151565.655 rows=17680 loops=1)
-> Sort (cost=482173.36..487196.11 rows=2009100 width=63) (actual time=138099.202..149256.707 rows=2009100 loops=1)
Sort Key: content, appid
Sort Method: external merge Disk: 152488kB
-> Seq Scan on web_content_4 (cost=0.00..45218.00 rows=2009100 width=63) (actual time=0.010..349.144 rows=2009100 loops=1)
Total runtime: 151613.569 ms
(6 rows)
Time: 151614.106 ms
test=# EXPLAIN ANALYSE SELECT content, sum(hits) as hits, sum(bytes) as bytes, appid from web_content_6 GROUP BY content,appid;
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------
GroupAggregate (cost=368814.36..394194.51 rows=17760 width=24) (actual time=3282.333..5840.953 rows=17760 loops=1)
-> Sort (cost=368814.36..373837.11 rows=2009100 width=24) (actual time=3282.176..3946.025 rows=2009100 loops=1)
Sort Key: content, appid
Sort Method: external merge Disk: 74632kB
-> Seq Scan on web_content_6 (cost=0.00..34864.00 rows=2009100 width=24) (actual time=0.011..297.235 rows=2009100 loops=1)
Total runtime: 6172.960 ms
此聚合的性能将由排序的速度驱动。在所有条件相同的情况下,较大的数据比较短的数据需要更多的时间。“快”的情况是分类74Mbytes;“慢”,152Mbytes
这可以解释性能上的一些差异,但在大多数情况下不是30倍的差异。在一种情况下,当较小的数据放入内存而较大的数据不装入内存时,您会看到巨大的差异。溢出到磁盘是昂贵的
一种怀疑是,数据已经按照web\u content\u 6(content,appid)
进行了排序,或者几乎已经进行了排序。这可能会缩短排序所需的时间。如果你比较两种类型的实际时间和“成本”,你会发现“快速”版本的运行速度比预期的要快得多(假设成本是可比的)。戈登·林诺夫当然是对的。溢出到磁盘是昂贵的
如果可以节省内存,可以告诉PostgreSQL使用更多的内存进行排序等。我构建了一个表,用随机数据填充它,并在运行此查询之前对其进行了分析
EXPLAIN ANALYSE
SELECT content, sum(hits) as hits, sum(bytes) as bytes, appid
from web_content_4
GROUP BY content,appid;
"GroupAggregate (cost=364323.43..398360.86 rows=903791 width=96) (actual time=25059.086..29789.234 rows=1998067 loops=1)"
" -> Sort (cost=364323.43..369323.34 rows=1999961 width=96) (actual time=25057.540..27907.143 rows=2000000 loops=1)"
" Sort Key: content, appid"
" Sort Method: external merge Disk: 216016kB"
" -> Seq Scan on web_content_4 (cost=0.00..52472.61 rows=1999961 width=96) (actual time=0.010..475.187 rows=2000000 loops=1)"
"Total runtime: 30012.427 ms"
我得到了和你一样的执行计划。在我的例子中,这个查询执行一个外部合并排序,它需要大约216MB的磁盘。我可以通过设置work\u mem的值来告诉PostgreSQL为这个查询允许更多的内存。(以这种方式设置work_mem只会影响我当前的连接。)
现在PostgreSQL使用哈希聚合,执行时间减少了6,30秒到5秒
我没有测试web_content_6,因为用整数替换文本通常需要几个连接来恢复文本。因此,我不确定我们会在那里对苹果进行比较。因为比较。比较整数比比较“字符串”要快。在字符串的情况下,可能是逐字符比较。所以排序也需要时间。你也可以在解释计划中看到。这些表上有索引吗?
set work_mem = '250MB';
EXPLAIN ANALYSE
SELECT content, sum(hits) as hits, sum(bytes) as bytes, appid
from web_content_4
GROUP BY content,appid;
"HashAggregate (cost=72472.22..81510.13 rows=903791 width=96) (actual time=3196.777..4505.290 rows=1998067 loops=1)"
" -> Seq Scan on web_content_4 (cost=0.00..52472.61 rows=1999961 width=96) (actual time=0.019..437.252 rows=2000000 loops=1)"
"Total runtime: 4726.401 ms"