Cassandra 卡桑德拉：列出10条最近修改过的记录_Cassandra_Cql

Cassandra 卡桑德拉：列出10条最近修改过的记录

cassandra

Cassandra 卡桑德拉：列出10条最近修改过的记录,cassandra,cql,Cassandra,Cql,我在尝试对数据建模时遇到了问题，因此我无法有效地查询Cassandra最近修改的最后10条（实际上是任意数量的）记录。每个记录都有一个last_modified_date列，该列由应用程序在插入/更新记录时设置我已经从这个示例代码中排除了数据列主数据表（每个记录仅包含一行）：解决方案1（失败）我尝试创建一个单独的表，该表使用集群键顺序表（每条记录一行；仅插入上次修改的日期）：查询： SELECT * FROM record_by_last_modified_index LIMIT 1

我在尝试对数据建模时遇到了问题，因此我无法有效地查询Cassandra最近修改的最后10条（实际上是任意数量的）记录。每个记录都有一个last_modified_date列，该列由应用程序在插入/更新记录时设置

我已经从这个示例代码中排除了数据列

主数据表（每个记录仅包含一行）：

解决方案1（失败）我尝试创建一个单独的表，该表使用集群键顺序

表（每条记录一行；仅插入上次修改的日期）：

查询：

SELECT * FROM record_by_last_modified_index LIMIT 10

此解决方案不起作用，因为集群顺序仅适用于具有相同分区键的记录的顺序。因为每一行都有不同的分区键（记录id），所以查询结果不包括预期的记录

解决方案2（效率低下）我尝试过的另一个解决方案是简单地查询Cassandra中的所有记录id和上次修改日期值，对它们进行排序，然后在我的应用程序中选择前10条记录。这显然效率低下，无法很好地扩展

解决方案3

我考虑的最后一个解决方案是对所有记录使用相同的分区键，并使用集群顺序确保记录正确排序。该解决方案的问题是，由于所有记录都具有相同的分区键，因此数据不会在节点之间正确分区。对我来说，这似乎不是一个开始。

我认为您试图做的更多的是一个关系数据库模型，在Cassandra中有点反模式

Cassandra只根据集群列对内容进行排序，但排序顺序预计不会改变。这是因为当memtables作为SSTables（排序字符串表）写入磁盘时，SSTables是不可变的，不能有效地重新排序。这就是为什么不允许更新集群列的值

如果要对聚集行重新排序，我知道的唯一方法是删除旧行并在批处理中插入新行。为了提高效率，您可能需要首先进行读取，以确定记录id的最后一次修改日期，以便将其删除

因此，我会寻找一种不同的方法，比如只将更新作为新的聚集行写入，而将旧的保留在那里（可能会随着时间的推移使用TTL清理它们）。因此，当您执行限制查询时，您的最新更新将始终位于顶部

在分区方面，您需要将数据分成几个类别，以便将数据分布到节点上。这意味着您将无法获得表的全局排序，而只能在类别中进行排序，这是由于分布式模型。如果您真的需要全局排序，那么可以考虑将Cassandra与Spark配对。排序在时间和资源上都非常昂贵，所以如果您真的需要，请仔细考虑

更新：

再想一想，您应该能够在Cassandra 3.0中使用物化视图实现这一点。视图将为您处理凌乱的delete和insert，以便对聚集的行重新排序。下面是3.0 alpha版本中的外观：

首先创建基表：

CREATE TABLE record_ids (
    record_type int,
    last_modified_date timestamp,
    record_id int,
    PRIMARY KEY(record_type, record_id));

然后使用last_modified_date作为聚类列创建该表的视图：

CREATE MATERIALIZED VIEW last_modified AS
    SELECT record_type FROM record_ids
    WHERE record_type IS NOT NULL AND last_modified_date IS NOT NULL AND record_id IS NOT NULL
    PRIMARY KEY (record_type, last_modified_date, record_id)
    WITH CLUSTERING ORDER BY (last_modified_date DESC);

现在插入一些记录：

insert into record_ids (record_type, last_modified_date, record_id) VALUES ( 1, dateof(now()), 100);
insert into record_ids (record_type, last_modified_date, record_id) VALUES ( 1, dateof(now()), 200);
insert into record_ids (record_type, last_modified_date, record_id) VALUES ( 1, dateof(now()), 300);

SELECT * FROM record_ids;

 record_type | record_id | last_modified_date
-------------+-----------+--------------------------
           1 |       100 | 2015-08-14 19:41:10+0000
           1 |       200 | 2015-08-14 19:41:25+0000
           1 |       300 | 2015-08-14 19:41:41+0000

SELECT * FROM last_modified;

 record_type | last_modified_date       | record_id
-------------+--------------------------+-----------
           1 | 2015-08-14 19:41:41+0000 |       300
           1 | 2015-08-14 19:41:25+0000 |       200
           1 | 2015-08-14 19:41:10+0000 |       100

现在，我们更新了基表中的一条记录，应该可以看到它在视图中移动到列表的顶部：

UPDATE record_ids SET last_modified_date = dateof(now()) 
WHERE record_type=1 AND record_id=200;

因此在基表中，我们看到记录_id=200的时间戳被更新：

SELECT * FROM record_ids;

 record_type | record_id | last_modified_date
-------------+-----------+--------------------------
           1 |       100 | 2015-08-14 19:41:10+0000
           1 |       200 | 2015-08-14 19:43:13+0000
           1 |       300 | 2015-08-14 19:41:41+0000

我们认为：

 SELECT * FROM last_modified;

 record_type | last_modified_date       | record_id
-------------+--------------------------+-----------
           1 | 2015-08-14 19:43:13+0000 |       200
           1 | 2015-08-14 19:41:41+0000 |       300
           1 | 2015-08-14 19:41:10+0000 |       100

因此，您可以在视图中看到记录_id=200向上移动，如果您对该表进行限制N，您将得到最近修改的N行。

CQL查询按字段排序的整个表/视图的唯一方法是使分区键保持常量。只有一台机器（乘以复制系数）将容纳整个表。例如，使用始终为零的

分区INT

分区键，将聚类键作为需要排序的字段。即使集群中有更多节点，您也应该观察到与排序字段上有索引的单节点数据库类似的读/写/容量性能。这并不是完全违背卡桑德拉的目的，因为它有助于在未来扩大规模

如果性能不足，则可以通过增加分区种类来决定扩展。例如，当使用4个节点时，从0、1、2、3中随机选择插入将使读/写/容量性能提高四倍。然后，要查找“10个最近的”项，您必须手动查询所有4个分区，并对结果进行合并排序

理论上，Cassandra可以为INSERT提供动态节点计数最大模分区键，为SELECT提供合并排序（使用

ALLOW FILTERING

）

卡桑德拉的设计目标不允许全局排序要允许写入、读取和存储容量随节点数线性扩展，Cassandra需要：

每个插入都在单个节点上着陆
每个选定的节点都位于单个节点上
客户端在所有节点之间以类似的方式分配工作负载

如果我理解正确，结果是一个完整的表单字段排序查询总是需要从整个集群读取数据并进行合并排序

请注意，物化视图与表是等价的，它们没有任何神奇的特性使它们能够更好地进行全局排序。请看Aaron Ploetz同意cassandra和cql不能在没有分区和比例的情况下对一个字段进行排序

示例解请注意，如果没有

WHERE

子句，您将以令牌（分区键）顺序获得结果。看

其他数据库分布模型如果我理解正确的话，CockroachDB在任何给定时间对一个节点的单调递增数据的读/写性能都会有类似的瓶颈，但存储容量会线性扩展。还有其他范围查询，如“最早的10个”或“在日期X和日期之间”

SELECT * FROM record_ids;

 record_type | record_id | last_modified_date
-------------+-----------+--------------------------
           1 |       100 | 2015-08-14 19:41:10+0000
           1 |       200 | 2015-08-14 19:43:13+0000
           1 |       300 | 2015-08-14 19:41:41+0000

 SELECT * FROM last_modified;

 record_type | last_modified_date       | record_id
-------------+--------------------------+-----------
           1 | 2015-08-14 19:43:13+0000 |       200
           1 | 2015-08-14 19:41:41+0000 |       300
           1 | 2015-08-14 19:41:10+0000 |       100

CREATE KEYSPACE IF NOT EXISTS
    tmpsort
WITH REPLICATION =
    {'class':'SimpleStrategy', 'replication_factor' : 1};

USE tmpsort;

CREATE TABLE record_ids (
    partition int,
    last_modified_date timestamp,
    record_id int,
    PRIMARY KEY((partition), last_modified_date, record_id))
    WITH CLUSTERING ORDER BY (last_modified_date DESC);

INSERT INTO record_ids (partition, last_modified_date, record_id) VALUES ( 1, DATEOF(NOW()), 100);
INSERT INTO record_ids (partition, last_modified_date, record_id) VALUES ( 2, DATEOF(NOW()), 101);
INSERT INTO record_ids (partition, last_modified_date, record_id) VALUES ( 3, DATEOF(NOW()), 102);
INSERT INTO record_ids (partition, last_modified_date, record_id) VALUES ( 1, DATEOF(NOW()), 103);
INSERT INTO record_ids (partition, last_modified_date, record_id) VALUES ( 2, DATEOF(NOW()), 104);
INSERT INTO record_ids (partition, last_modified_date, record_id) VALUES ( 3, DATEOF(NOW()), 105);
INSERT INTO record_ids (partition, last_modified_date, record_id) VALUES ( 3, DATEOF(NOW()), 106);
INSERT INTO record_ids (partition, last_modified_date, record_id) VALUES ( 3, DATEOF(NOW()), 107);
INSERT INTO record_ids (partition, last_modified_date, record_id) VALUES ( 2, DATEOF(NOW()), 108);
INSERT INTO record_ids (partition, last_modified_date, record_id) VALUES ( 3, DATEOF(NOW()), 109);
INSERT INTO record_ids (partition, last_modified_date, record_id) VALUES ( 1, DATEOF(NOW()), 110);
INSERT INTO record_ids (partition, last_modified_date, record_id) VALUES ( 1, DATEOF(NOW()), 111);

SELECT * FROM record_ids;

-- Note the results are only sorted in their partition
-- To try again:
-- DROP KEYSPACE tmpsort;