Cassandra:查找分区键

Cassandra:查找分区键,cassandra,nodetool,cassandra-3.0,Cassandra,Nodetool,Cassandra 3.0,我们目前正在使用下表模式测试Cassandra: CREATE TABLE coreglead_v2.stats_by_site_user ( d_tally text, -- ex.: '2016-01', '2016-02', etc.. site_id int, d_date timestamp, site_user_id int, accepted counter, error counter, impressions_negati

我们目前正在使用下表模式测试Cassandra:

CREATE TABLE coreglead_v2.stats_by_site_user (
    d_tally text, -- ex.: '2016-01', '2016-02', etc..
    site_id int,
    d_date timestamp,
    site_user_id int,
    accepted counter,
    error counter,
    impressions_negative counter,
    impressions_positive counter,
    rejected counter,
    revenue counter,
    reversals_rejected counter,
    reversals_revenue counter,
    PRIMARY KEY (d_tally, site_id, d_date, site_user_id)
) WITH CLUSTERING ORDER BY (site_id ASC, d_date ASC, site_user_id ASC)
    AND bloom_filter_fp_chance = 0.01
    AND caching = {'keys': 'ALL', 'rows_per_partition': 'NONE'}
    AND comment = ''
    AND compaction = {'class': 'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy', 'max_threshold': '32', 'min_threshold': '4'}
    AND compression = {'chunk_length_in_kb': '64', 'class': 'org.apache.cassandra.io.compress.LZ4Compressor'}
    AND crc_check_chance = 1.0
    AND dclocal_read_repair_chance = 0.1
    AND default_time_to_live = 0
    AND gc_grace_seconds = 864000
    AND max_index_interval = 2048
    AND memtable_flush_period_in_ms = 0
    AND min_index_interval = 128
    AND read_repair_chance = 0.0
    AND speculative_retry = '99PERCENTILE';
出于测试目的,我们编写了一个python脚本,在2016年日历中随机分配数据(总共12个月),我们希望分区键为d_-taly列,同时,我们希望键的数量为12(从“2016-01”到“2016-12”)

运行nodetool cfstats向我们展示了以下内容:

Table: stats_by_site_user
        SSTable count: 4
        Space used (live): 131977793
        Space used (total): 131977793
        Space used by snapshots (total): 0
        Off heap memory used (total): 89116
        SSTable Compression Ratio: 0.18667406304929424
        Number of keys (estimate): 24
        Memtable cell count: 120353
        Memtable data size: 23228804
        Memtable off heap memory used: 0
        Memtable switch count: 10
        Local read count: 169
        Local read latency: 1.938 ms
        Local write count: 4912464
        Local write latency: 0.066 ms
        Pending flushes: 0
        Bloom filter false positives: 0
        Bloom filter false ratio: 0.00000
        Bloom filter space used: 128
        Bloom filter off heap memory used: 96
        Index summary off heap memory used: 76
        Compression metadata off heap memory used: 88944
        Compacted partition minimum bytes: 5839589
        Compacted partition maximum bytes: 43388628
        Compacted partition mean bytes: 16102786
        Average live cells per slice (last five minutes): 102.91627247589237
        Maximum live cells per slice (last five minutes): 103
        Average tombstones per slice (last five minutes): 1.0
        Maximum tombstones per slice (last five minutes): 1
让我们困惑的是“钥匙数量(估计):24”部分。看看我们的模式,假设我们的测试数据(超过500万次写入)仅由2016年的数据组成,那么24个键的估计值来自哪里

以下是我们的数据示例:

d_tally | site_id | d_date                   | site_user_id | accepted | error | impressions_negative | impressions_positive | rejected | revenue | reversals_rejected | reversals_revenue
---------+---------+--------------------------+--------------+----------+-------+----------------------+----------------------+----------+---------+--------------------+-------------------
 2016-01 |       1 | 2016-01-01 00:00:00+0000 |       240054 |        1 |  null |                 null |                    1 |     null |     553 |               null |              null
 2016-01 |       1 | 2016-01-01 00:00:00+0000 |      1263968 |        1 |  null |                 null |                    1 |     null |    1093 |               null |              null
 2016-01 |       1 | 2016-01-01 00:00:00+0000 |      1267841 |        1 |  null |                 null |                    1 |     null |     861 |               null |              null
 2016-01 |       1 | 2016-01-01 00:00:00+0000 |      1728725 |        1 |  null |                 null |                    1 |     null |     425 |               null |              null

关键点的数量是一个估计值(尽管应该非常接近)。它从每个sstable中提取数据的草图,并将其合并在一起以估计基数()

不幸的是,memtable中不存在等价项,因此它将memtable的基数添加到sstable估计中。这意味着memtables和sstables中的内容都是重复计算的。这就是为什么您看到的是
24
,而不是
12