Performance 在ClickHouse中使用示例似乎可以读取所有行和更多字节。这是预期的还是由于次优的表定义?
我希望通过使用样本以精度换取速度。虽然有些查询确实运行得更快,但查询日志显示,与不使用SAMPLE时相比,正在读取所有行并读取更多字节。我认为示例将导致更少的行读取 我的问题是:Performance 在ClickHouse中使用示例似乎可以读取所有行和更多字节。这是预期的还是由于次优的表定义?,performance,sample,clickhouse,Performance,Sample,Clickhouse,我希望通过使用样本以精度换取速度。虽然有些查询确实运行得更快,但查询日志显示,与不使用SAMPLE时相比,正在读取所有行并读取更多字节。我认为示例将导致更少的行读取 我的问题是: 使用示例是否会导致读取更多的行和字节 读数增加是否是由于可以纠正的次优表定义 我正在使用ClickHouse版本20.5.3修订版54435 表定义: 无样本查询 带样本查询 简短回答:是的。因为CH需要再读取一列sample\u散列 答案很长:取样很难。如果您每天有1000亿行和400台服务器,那么它非常有用。这
简短回答:是的。因为CH需要再读取一列sample\u散列 答案很长:取样很难。如果您每天有1000亿行和400台服务器,那么它非常有用。这对groupbys很有帮助。它对筛选没有帮助,因为在您的情况下,它不能与主索引一起工作。 Yandex为自己设计了采样。它们已启用强制分区键/主键使用(按日期强制索引/强制主键)。因此,像您这样的查询在他们的系统中是不可能的,因此即使在磁盘读取中,采样也能帮助他们 这就是我在系统中不使用采样的原因 但是 订单依据(日期戳、时间戳、样本散列) 同样,这样的顺序根本没有用处。这整张桌子设计不当。 将
(datestamp
放在索引前缀中没有意义,因为表是按datestamp
分区的,因此每个分区只有一个datestamp
值
索引前缀中的timestamp
是一个更大的问题,因为在主索引的开头放置一个高度基数的列是非常不明智的
因此,我可以创建一个合成示例,展示采样是如何工作的。但它有什么意义吗
CREATE TABLE table_one
( timestamp UInt64,
transaction_id UInt64,
banner_id UInt16,
value UInt32
)
ENGINE = MergeTree()
PARTITION BY toYYYYMMDD(toDateTime(timestamp))
ORDER BY (banner_id, toStartOfHour(toDateTime(timestamp)), cityHash64(transaction_id))
SAMPLE BY cityHash64(transaction_id)
SETTINGS index_granularity = 8192
insert into table_one select 1602809234+intDiv(number,100000), number, number%991, toUInt32(rand())
from numbers(10000000000);
select banner_id, sum(value), count(value), max(value)
from table_one
group by banner_id format Null;
0 rows in set. Elapsed: 11.490 sec. Processed 10.00 billion rows, 60.00 GB (870.30 million rows/s., 5.22 GB/s.)
select banner_id, sum(value), count(value), max(value)
from table_one SAMPLE 0.01
group by banner_id format Null;
0 rows in set. Elapsed: 1.316 sec. Processed 452.67 million rows, 6.34 GB (343.85 million rows/s., 4.81 GB/s.)
select banner_id, sum(value), count(value), max(value)
from table_one
WHERE banner_id = 42
group by banner_id format Null;
0 rows in set. Elapsed: 0.020 sec. Processed 10.30 million rows, 61.78 MB (514.37 million rows/s., 3.09 GB/s.)
select banner_id, sum(value), count(value), max(value)
from table_one SAMPLE 0.01
WHERE banner_id = 42
group by banner_id format Null;
0 rows in set. Elapsed: 0.008 sec. Processed 696.32 thousand rows, 9.75 MB (92.49 million rows/s., 1.29 GB/s.)
select banner_id, toStartOfHour(toDateTime(timestamp)) hr, sum(value), count(value), max(value)
from table_one
group by banner_id, hr format Null;
0 rows in set. Elapsed: 36.660 sec. Processed 10.00 billion rows, 140.00 GB (272.77 million rows/s., 3.82 GB/s.)
select banner_id, toStartOfHour(toDateTime(timestamp)) hr, sum(value), count(value), max(value)
from table_one SAMPLE 0.01
group by banner_id, hr format Null;
0 rows in set. Elapsed: 3.741 sec. Processed 452.67 million rows, 9.96 GB (121.00 million rows/s., 2.66 GB/s.)
select count()
from table_one
where value = 666 format Null;
1 rows in set. Elapsed: 6.056 sec. Processed 10.00 billion rows, 40.00 GB (1.65 billion rows/s., 6.61 GB/s.)
select count()
from table_one SAMPLE 0.01
where value = 666 format Null;
1 rows in set. Elapsed: 1.214 sec. Processed 452.67 million rows, 5.43 GB (372.88 million rows/s., 4.47 GB/s.)
难点:
以下是主索引中的高基数列如何影响的示例。
相同的表格,相同的数据但不是
orderby(banner\u id,tostartohour(toDateTime(timestamp)),cityHash64(transaction\u id))
我曾经
orderby(banner\u id,timestamp,cityHash64(transaction\u id))
高基数列时间戳使得无法在索引中为cityHash64(事务id))
使用范围搜索。CH读取0.01件的每个标记。
这是预期的行为,对于任何数据库或任何排序列表也是如此
现在,CH读取带.001采样和不带采样的所有行。请参阅相关讨论
SELECT
avg(value)
FROM default.table_one;
query_duration_ms: 166
rows_read: 100,000,000
read_bytes: 800,000,000
SELECT
avg(value)
FROM default.table_one
SAMPLE 0.1;
query_duration_ms: 358
rows_read: 100,000,000
read_bytes: 1,600,000,000
CREATE TABLE table_one
( timestamp UInt64,
transaction_id UInt64,
banner_id UInt16,
value UInt32
)
ENGINE = MergeTree()
PARTITION BY toYYYYMMDD(toDateTime(timestamp))
ORDER BY (banner_id, toStartOfHour(toDateTime(timestamp)), cityHash64(transaction_id))
SAMPLE BY cityHash64(transaction_id)
SETTINGS index_granularity = 8192
insert into table_one select 1602809234+intDiv(number,100000), number, number%991, toUInt32(rand())
from numbers(10000000000);
select banner_id, sum(value), count(value), max(value)
from table_one
group by banner_id format Null;
0 rows in set. Elapsed: 11.490 sec. Processed 10.00 billion rows, 60.00 GB (870.30 million rows/s., 5.22 GB/s.)
select banner_id, sum(value), count(value), max(value)
from table_one SAMPLE 0.01
group by banner_id format Null;
0 rows in set. Elapsed: 1.316 sec. Processed 452.67 million rows, 6.34 GB (343.85 million rows/s., 4.81 GB/s.)
select banner_id, sum(value), count(value), max(value)
from table_one
WHERE banner_id = 42
group by banner_id format Null;
0 rows in set. Elapsed: 0.020 sec. Processed 10.30 million rows, 61.78 MB (514.37 million rows/s., 3.09 GB/s.)
select banner_id, sum(value), count(value), max(value)
from table_one SAMPLE 0.01
WHERE banner_id = 42
group by banner_id format Null;
0 rows in set. Elapsed: 0.008 sec. Processed 696.32 thousand rows, 9.75 MB (92.49 million rows/s., 1.29 GB/s.)
select banner_id, toStartOfHour(toDateTime(timestamp)) hr, sum(value), count(value), max(value)
from table_one
group by banner_id, hr format Null;
0 rows in set. Elapsed: 36.660 sec. Processed 10.00 billion rows, 140.00 GB (272.77 million rows/s., 3.82 GB/s.)
select banner_id, toStartOfHour(toDateTime(timestamp)) hr, sum(value), count(value), max(value)
from table_one SAMPLE 0.01
group by banner_id, hr format Null;
0 rows in set. Elapsed: 3.741 sec. Processed 452.67 million rows, 9.96 GB (121.00 million rows/s., 2.66 GB/s.)
select count()
from table_one
where value = 666 format Null;
1 rows in set. Elapsed: 6.056 sec. Processed 10.00 billion rows, 40.00 GB (1.65 billion rows/s., 6.61 GB/s.)
select count()
from table_one SAMPLE 0.01
where value = 666 format Null;
1 rows in set. Elapsed: 1.214 sec. Processed 452.67 million rows, 5.43 GB (372.88 million rows/s., 4.47 GB/s.)
CREATE TABLE table_one
( timestamp UInt64,
transaction_id UInt64,
banner_id UInt16,
value UInt32
)
ENGINE = MergeTree()
PARTITION BY toYYYYMMDD(toDateTime(timestamp))
ORDER BY (banner_id, timestamp, cityHash64(transaction_id))
SAMPLE BY cityHash64(transaction_id)
SETTINGS index_granularity = 8192
insert into table_one select 1602809234+intDiv(number,100000), number, number%991, toUInt32(rand())
from numbers(10000000000);
select banner_id, sum(value), count(value), max(value)
from table_one
group by banner_id format Null;
0 rows in set. Elapsed: 11.196 sec. Processed 10.00 billion rows, 60.00 GB (893.15 million rows/s., 5.36 GB/s.)
select banner_id, sum(value), count(value), max(value)
from table_one SAMPLE 0.01
group by banner_id format Null;
0 rows in set. Elapsed: 24.378 sec. Processed 10.00 billion rows, 140.00 GB (410.21 million rows/s., 5.74 GB/s.)
select banner_id, sum(value), count(value), max(value)
from table_one
WHERE banner_id = 42
group by banner_id format Null;
0 rows in set. Elapsed: 0.022 sec. Processed 10.27 million rows, 61.64 MB (459.28 million rows/s., 2.76 GB/s.)
select banner_id, sum(value), count(value), max(value)
from table_one SAMPLE 0.01
WHERE banner_id = 42
group by banner_id format Null;
0 rows in set. Elapsed: 0.037 sec. Processed 10.27 million rows, 143.82 MB (275.16 million rows/s., 3.85 GB/s.)
select banner_id, toStartOfHour(toDateTime(timestamp)) hr, sum(value), count(value), max(value)
from table_one
group by banner_id, hr format Null;
0 rows in set. Elapsed: 21.663 sec. Processed 10.00 billion rows, 140.00 GB (461.62 million rows/s., 6.46 GB/s.)
select banner_id, toStartOfHour(toDateTime(timestamp)) hr, sum(value), count(value), max(value)
from table_one SAMPLE 0.01
group by banner_id, hr format Null;
0 rows in set. Elapsed: 26.697 sec. Processed 10.00 billion rows, 220.00 GB (374.57 million rows/s., 8.24 GB/s.)
select count()
from table_one
where value = 666 format Null;
0 rows in set. Elapsed: 7.679 sec. Processed 10.00 billion rows, 40.00 GB (1.30 billion rows/s., 5.21 GB/s.)
select count()
from table_one SAMPLE 0.01
where value = 666 format Null;
0 rows in set. Elapsed: 21.668 sec. Processed 10.00 billion rows, 120.00 GB (461.51 million rows/s., 5.54 GB/s.)