Performance 在ClickHouse中使用示例似乎可以读取所有行和更多字节。这是预期的还是由于次优的表定义?

Performance 在ClickHouse中使用示例似乎可以读取所有行和更多字节。这是预期的还是由于次优的表定义?,performance,sample,clickhouse,Performance,Sample,Clickhouse,我希望通过使用样本以精度换取速度。虽然有些查询确实运行得更快,但查询日志显示,与不使用SAMPLE时相比,正在读取所有行并读取更多字节。我认为示例将导致更少的行读取 我的问题是: 使用示例是否会导致读取更多的行和字节 读数增加是否是由于可以纠正的次优表定义 我正在使用ClickHouse版本20.5.3修订版54435 表定义: 无样本查询 带样本查询 简短回答:是的。因为CH需要再读取一列sample\u散列 答案很长:取样很难。如果您每天有1000亿行和400台服务器,那么它非常有用。这

我希望通过使用样本以精度换取速度。虽然有些查询确实运行得更快,但查询日志显示,与不使用SAMPLE时相比,正在读取所有行并读取更多字节。我认为示例将导致更少的行读取

我的问题是:
  • 使用示例是否会导致读取更多的行和字节

  • 读数增加是否是由于可以纠正的次优表定义

  • 我正在使用ClickHouse版本20.5.3修订版54435

    表定义: 无样本查询 带样本查询
    简短回答:是的。因为CH需要再读取一列sample\u散列

    答案很长:取样很难。如果您每天有1000亿行和400台服务器,那么它非常有用。这对groupbys很有帮助。它对筛选没有帮助,因为在您的情况下,它不能与主索引一起工作。 Yandex为自己设计了采样。它们已启用强制分区键/主键使用(按日期强制索引/强制主键)。因此,像您这样的查询在他们的系统中是不可能的,因此即使在磁盘读取中,采样也能帮助他们

    这就是我在系统中不使用采样的原因

    但是

    订单依据(日期戳、时间戳、样本散列)

    同样,这样的顺序根本没有用处。这整张桌子设计不当。 将
    (datestamp
    放在索引前缀中没有意义,因为表是按
    datestamp
    分区的,因此每个分区只有一个
    datestamp

    索引前缀中的
    timestamp
    是一个更大的问题,因为在主索引的开头放置一个高度基数的列是非常不明智的

    因此,我可以创建一个合成示例,展示
    采样
    是如何工作的。但它有什么意义吗

    CREATE TABLE table_one
    ( timestamp UInt64,
      transaction_id UInt64,
      banner_id UInt16,
      value UInt32
    )
    ENGINE = MergeTree()
    PARTITION BY toYYYYMMDD(toDateTime(timestamp))
    ORDER BY (banner_id, toStartOfHour(toDateTime(timestamp)),  cityHash64(transaction_id))
    SAMPLE BY cityHash64(transaction_id)
    SETTINGS index_granularity = 8192
    
    
    insert into table_one select 1602809234+intDiv(number,100000), number, number%991, toUInt32(rand())
    from numbers(10000000000);
    
    
    
    select banner_id, sum(value), count(value), max(value)
    from table_one 
    group by banner_id format Null;
    
    0 rows in set. Elapsed: 11.490 sec. Processed 10.00 billion rows, 60.00 GB (870.30 million rows/s., 5.22 GB/s.)
    
    select banner_id, sum(value), count(value), max(value)
    from table_one SAMPLE 0.01
    group by banner_id format Null;
    
    0 rows in set. Elapsed: 1.316 sec. Processed 452.67 million rows, 6.34 GB (343.85 million rows/s., 4.81 GB/s.)
    
    
    
    select banner_id, sum(value), count(value), max(value)
    from table_one 
    WHERE banner_id = 42
    group by banner_id format Null;
    
    0 rows in set. Elapsed: 0.020 sec. Processed 10.30 million rows, 61.78 MB (514.37 million rows/s., 3.09 GB/s.)
    
    select banner_id, sum(value), count(value), max(value)
    from table_one SAMPLE 0.01
    WHERE banner_id = 42
    group by banner_id format Null;
    
    0 rows in set. Elapsed: 0.008 sec. Processed 696.32 thousand rows, 9.75 MB (92.49 million rows/s., 1.29 GB/s.)
    
    
    
    
    select banner_id, toStartOfHour(toDateTime(timestamp)) hr, sum(value), count(value), max(value)
    from table_one 
    group by banner_id, hr format Null;
    0 rows in set. Elapsed: 36.660 sec. Processed 10.00 billion rows, 140.00 GB (272.77 million rows/s., 3.82 GB/s.)
    
    select banner_id, toStartOfHour(toDateTime(timestamp)) hr, sum(value), count(value), max(value)
    from table_one SAMPLE 0.01
    group by banner_id, hr format Null;
    0 rows in set. Elapsed: 3.741 sec. Processed 452.67 million rows, 9.96 GB (121.00 million rows/s., 2.66 GB/s.)
    
    
    
    
    select count()
    from table_one 
    where value = 666 format Null;
    1 rows in set. Elapsed: 6.056 sec. Processed 10.00 billion rows, 40.00 GB (1.65 billion rows/s., 6.61 GB/s.)
    
    select count()
    from table_one  SAMPLE 0.01
    where value = 666 format Null;
    1 rows in set. Elapsed: 1.214 sec. Processed 452.67 million rows, 5.43 GB (372.88 million rows/s., 4.47 GB/s.)
    
    难点: 以下是主索引中的高基数列如何影响的示例。 相同的表格,相同的数据但不是

    orderby(banner\u id,tostartohour(toDateTime(timestamp)),cityHash64(transaction\u id))

    我曾经

    orderby(banner\u id,timestamp,cityHash64(transaction\u id))

    高基数列时间戳使得无法在索引中为
    cityHash64(事务id))
    使用范围搜索。CH读取0.01件的每个
    标记
    。
    这是预期的行为,对于任何数据库或任何排序列表也是如此

    现在,CH读取带.001采样和不带采样的所有行。

    请参阅相关讨论
    SELECT
      avg(value)
    FROM default.table_one;
    
    query_duration_ms: 166
    rows_read: 100,000,000
    read_bytes: 800,000,000
    
    SELECT
      avg(value)
    FROM default.table_one
    SAMPLE 0.1;
    
    query_duration_ms: 358
    rows_read: 100,000,000
    read_bytes: 1,600,000,000
    
    CREATE TABLE table_one
    ( timestamp UInt64,
      transaction_id UInt64,
      banner_id UInt16,
      value UInt32
    )
    ENGINE = MergeTree()
    PARTITION BY toYYYYMMDD(toDateTime(timestamp))
    ORDER BY (banner_id, toStartOfHour(toDateTime(timestamp)),  cityHash64(transaction_id))
    SAMPLE BY cityHash64(transaction_id)
    SETTINGS index_granularity = 8192
    
    
    insert into table_one select 1602809234+intDiv(number,100000), number, number%991, toUInt32(rand())
    from numbers(10000000000);
    
    
    
    select banner_id, sum(value), count(value), max(value)
    from table_one 
    group by banner_id format Null;
    
    0 rows in set. Elapsed: 11.490 sec. Processed 10.00 billion rows, 60.00 GB (870.30 million rows/s., 5.22 GB/s.)
    
    select banner_id, sum(value), count(value), max(value)
    from table_one SAMPLE 0.01
    group by banner_id format Null;
    
    0 rows in set. Elapsed: 1.316 sec. Processed 452.67 million rows, 6.34 GB (343.85 million rows/s., 4.81 GB/s.)
    
    
    
    select banner_id, sum(value), count(value), max(value)
    from table_one 
    WHERE banner_id = 42
    group by banner_id format Null;
    
    0 rows in set. Elapsed: 0.020 sec. Processed 10.30 million rows, 61.78 MB (514.37 million rows/s., 3.09 GB/s.)
    
    select banner_id, sum(value), count(value), max(value)
    from table_one SAMPLE 0.01
    WHERE banner_id = 42
    group by banner_id format Null;
    
    0 rows in set. Elapsed: 0.008 sec. Processed 696.32 thousand rows, 9.75 MB (92.49 million rows/s., 1.29 GB/s.)
    
    
    
    
    select banner_id, toStartOfHour(toDateTime(timestamp)) hr, sum(value), count(value), max(value)
    from table_one 
    group by banner_id, hr format Null;
    0 rows in set. Elapsed: 36.660 sec. Processed 10.00 billion rows, 140.00 GB (272.77 million rows/s., 3.82 GB/s.)
    
    select banner_id, toStartOfHour(toDateTime(timestamp)) hr, sum(value), count(value), max(value)
    from table_one SAMPLE 0.01
    group by banner_id, hr format Null;
    0 rows in set. Elapsed: 3.741 sec. Processed 452.67 million rows, 9.96 GB (121.00 million rows/s., 2.66 GB/s.)
    
    
    
    
    select count()
    from table_one 
    where value = 666 format Null;
    1 rows in set. Elapsed: 6.056 sec. Processed 10.00 billion rows, 40.00 GB (1.65 billion rows/s., 6.61 GB/s.)
    
    select count()
    from table_one  SAMPLE 0.01
    where value = 666 format Null;
    1 rows in set. Elapsed: 1.214 sec. Processed 452.67 million rows, 5.43 GB (372.88 million rows/s., 4.47 GB/s.)
    
    CREATE TABLE table_one
    ( timestamp UInt64,
      transaction_id UInt64,
      banner_id UInt16,
      value UInt32
    )
    ENGINE = MergeTree()
    PARTITION BY toYYYYMMDD(toDateTime(timestamp))
    ORDER BY (banner_id, timestamp, cityHash64(transaction_id))
    SAMPLE BY cityHash64(transaction_id)
    SETTINGS index_granularity = 8192
    
    insert into table_one select 1602809234+intDiv(number,100000), number, number%991, toUInt32(rand())
    from numbers(10000000000);
    
    
    
    select banner_id, sum(value), count(value), max(value)
    from table_one 
    group by banner_id format Null;
    0 rows in set. Elapsed: 11.196 sec. Processed 10.00 billion rows, 60.00 GB (893.15 million rows/s., 5.36 GB/s.)
    
    select banner_id, sum(value), count(value), max(value)
    from table_one SAMPLE 0.01
    group by banner_id format Null;
    0 rows in set. Elapsed: 24.378 sec. Processed 10.00 billion rows, 140.00 GB (410.21 million rows/s., 5.74 GB/s.)
    
    
    
    select banner_id, sum(value), count(value), max(value)
    from table_one 
    WHERE banner_id = 42
    group by banner_id format Null;
    0 rows in set. Elapsed: 0.022 sec. Processed 10.27 million rows, 61.64 MB (459.28 million rows/s., 2.76 GB/s.)
    
    select banner_id, sum(value), count(value), max(value)
    from table_one SAMPLE 0.01
    WHERE banner_id = 42
    group by banner_id format Null;
    0 rows in set. Elapsed: 0.037 sec. Processed 10.27 million rows, 143.82 MB (275.16 million rows/s., 3.85 GB/s.)
    
    
    
    select banner_id, toStartOfHour(toDateTime(timestamp)) hr, sum(value), count(value), max(value)
    from table_one 
    group by banner_id, hr format Null;
    0 rows in set. Elapsed: 21.663 sec. Processed 10.00 billion rows, 140.00 GB (461.62 million rows/s., 6.46 GB/s.)
    
    
    select banner_id, toStartOfHour(toDateTime(timestamp)) hr, sum(value), count(value), max(value)
    from table_one SAMPLE 0.01
    group by banner_id, hr format Null;
    0 rows in set. Elapsed: 26.697 sec. Processed 10.00 billion rows, 220.00 GB (374.57 million rows/s., 8.24 GB/s.)
    
    
    
    select count()
    from table_one 
    where value = 666 format Null;
    0 rows in set. Elapsed: 7.679 sec. Processed 10.00 billion rows, 40.00 GB (1.30 billion rows/s., 5.21 GB/s.)
    
    select count()
    from table_one  SAMPLE 0.01
    where value = 666 format Null;
    0 rows in set. Elapsed: 21.668 sec. Processed 10.00 billion rows, 120.00 GB (461.51 million rows/s., 5.54 GB/s.)