为什么我在cassandra数据库中的数据插入速度如此之慢？_Cassandra_Apache Kafka

为什么我在cassandra数据库中的数据插入速度如此之慢？

cassandra apache-kafka

为什么我在cassandra数据库中的数据插入速度如此之慢？,cassandra,apache-kafka,Cassandra,Apache Kafka,这是我在Cassandra数据库中当前数据ID是否存在的查询 row = session.execute("SELECT * FROM articles where id = %s", [id]) 解析Kafka中的消息，然后确定此消息是否存在于cassandra数据库中如果不存在，则应执行插入操作，如果确实存在，则不应将其插入数据中 messages = consumer.get_messages(count=25) if len(messages) == 0:

这是我在Cassandra数据库中当前数据ID是否存在的查询

row = session.execute("SELECT * FROM articles where id = %s", [id])

解析Kafka中的消息，然后确定此消息是否存在于cassandra数据库中如果不存在，则应执行插入操作，如果确实存在，则不应将其插入数据中

messages = consumer.get_messages(count=25)

    if len(messages) == 0:
        print 'IDLE'
        sleep(1)
        continue

    for message in messages:
        try:
            message = json.loads(message.message.value)
            data = message['data']
            if data:
                for article in data:
                    source = article['source']
                    id = article['id']
                    title = article['title']
                    thumbnail = article['thumbnail']
                    #url = article['url']
                    text = article['text']
                    print article['created_at'],type(article['created_at'])
                    created_at = parse(article['created_at'])
                    last_crawled = article['last_crawled']
                    channel = article['channel']#userid
                    category = article['category']
                    #scheduled_for = created_at.replace(minute=created_at.minute + 5, second=0, microsecond=0)
                    scheduled_for=(datetime.utcnow() + timedelta(minutes=5)).replace(second=0, microsecond=0)
                    row = session.execute("SELECT * FROM articles where id = %s", [id])
                    if len(list(row))==0:
                    #id parse base62
                        ids = [id[0:2],id[2:9],id[9:16]]
                        idstr=''
                        for argv in ids:
                            num = int(argv)
                            idstr=idstr+encode(num)
                        url='http://weibo.com/%s/%s?type=comment' % (channel,idstr)
                        session.execute("INSERT INTO articles(source, id, title,thumbnail, url, text, created_at, last_crawled,channel,category) VALUES (%s,%s, %s, %s, %s, %s, %s, %s, %s, %s)", (source, id, title,thumbnail, url, text, created_at, scheduled_for,channel,category))
                        session.execute("INSERT INTO schedules(source,type,scheduled_for,id) VALUES (%s, %s, %s,%s) USING TTL 86400", (source,'article', scheduled_for, id))
                        log.info('%s %s %s %s %s %s %s %s %s %s' % (source, id, title,thumbnail, url, text, created_at, scheduled_for,channel,category))


        except Exception, e:
            log.exception(e)
            #log.info('error %s %s' % (message['url'],body))
            print e
            continue

编辑：

我有一个ID，它只有一个唯一的表行，我希望这样。一旦我为唯一ID添加不同的计划时间，我的系统就会崩溃。如果len（list（row））==0：是正确的想法，但是我的系统在那之后非常慢，那么添加这个

这是我的表格说明：

DROP TABLE IF EXISTS schedules;

CREATE TABLE schedules (
 source text,
 type text,
 scheduled_for timestamp,
 id text,
 PRIMARY KEY (source, type, scheduled_for, id)
);

此计划的\u是可更改的。这里还有一个具体的例子

Hao article 2016-01-12 02:09:00+0800 3930462206848285
Hao article 2016-01-12 03:09:00+0801 3930462206848285
Hao article 2016-01-12 04:09:00+0802 3930462206848285
Hao article 2016-01-12 05:09:00+0803 3930462206848285

谢谢你的回复

如果不存在，为什么不使用

insert

考虑到写操作便宜，而读操作可能不便宜，我认为您尝试进行的那种优化毫无意义。@Ralf好的，那么您有什么建议？谢谢你的回复！再插入一次记录？或者至少不要从表中选择*而只选择ID。这样可以节省一些网络带宽。（我认为Cassandra仍然会加载整行；也许有人可以对此发表评论。）根据您的应用程序，在插入之前选择每一行都有addtl。稀释Cassandra缓存的缺点是会降低用户的读取性能。请注意，

如果不存在

也会带来性能损失。我完全同意，这是“先读后写”的典型情况，但至少从应用程序的角度来看，这更容易，而且可能更优化。我认为这比“只是”更糟糕先读后写。对于不存在的工作，卡桑德拉必须确保。如果您为最终的一致性设置集群，那么您就失去了该设置的所有性能优势。但我假设IF NOT EXISTS不会弄乱缓存内容。@CedricH@ralf我在我的帖子中添加了上面的示例：我有一个ID，它只有一个唯一的表行，我希望这样。一旦我为唯一ID添加不同的计划时间，我们的系统就会崩溃。如果len（list（row））==0：是正确的想法，但是我的系统在那之后非常慢，那么添加这个。不知道该怎么办？谢谢你的帮助！