Twitter C*为时间线建模_Twitter_Nosql_Cassandra_Data Modeling_Cql

Twitter C*为时间线建模

twitter nosql cassandra

Twitter C*为时间线建模,twitter,nosql,cassandra,data-modeling,cql,Twitter,Nosql,Cassandra,Data Modeling,Cql,只是为了好玩，我正在构建一个推特克隆，以便更好地理解C* 我所看到的所有建议的C*方案都或多或少地使用了相同的建模技术。问题是，我对以这种方式建模推特时间线的可伸缩性表示怀疑问题：如果我有一个或多个非常受欢迎的userA（摇滚明星）用户，然后是10k+用户，会发生什么？每次userA发布一条推文时，我们都必须将其每个追随者的10k+推文插入时间线表中问题：这个模型真的可以扩展吗？有人能给我推荐一种可以真正扩展的时间线建模的替代方法吗 C*模式： CREATE TABLE users

只是为了好玩，我正在构建一个推特克隆，以便更好地理解C*

我所看到的所有建议的C*方案都或多或少地使用了相同的建模技术。问题是，我对以这种方式建模推特时间线的可伸缩性表示怀疑

问题： 如果我有一个或多个非常受欢迎的userA（摇滚明星）用户，然后是10k+用户，会发生什么？每次userA发布一条推文时，我们都必须将其每个追随者的10k+推文插入时间线表中

问题： 这个模型真的可以扩展吗？有人能给我推荐一种可以真正扩展的时间线建模的替代方法吗

C*模式：

CREATE TABLE users (
 uname text, -- UserA
 followers set, -- Users who follow userA
 following set, -- UserA is following userX
 PRIMARY KEY (uname)
);
-- View of tweets created by user
CREATE TABLE userline (
 tweetid timeuuid,
 uname text,
 body text,
 PRIMARY KEY(uname, tweetid)
);
-- View of tweets created by user, and users he/she follows
CREATE TABLE timeline (
 uname text,
 tweetid timeuuid,
 posted_by text,
 body text,
 PRIMARY KEY(uname, tweetid)
);


-- Example of UserA posting a tweet:
-- BATCH START
-- Store the tweet in the tweets
INSERT INTO tweets (tweetid, uname, body) VALUES (now(), 'userA', 'Test tweet #1');

-- Store the tweet in this users userline
INSERT INTO userline (uname, tweetid, body) VALUES ('userA', now(), 'Test tweet #1');

-- Store the tweet in this users timeline
INSERT INTO timeline (uname, tweetid, posted_by, body) VALUES ('userA', now(), 'userA', 'Test tweet #1');

-- Store the tweet in the public timeline
INSERT INTO timeline (uname, tweetid, posted_by, body) VALUES ('#PUBLIC', now(), 'userA', 'Test tweet #1');

-- Insert the tweet into follower timelines
-- findUserFollowers = SELECT followers FROM users WHERE uname = 'userA';
for (String follower : findUserFollowers('userA')) {
INSERT INTO timeline (uname, tweetid, posted_by, body) VALUES (follower, now(), 'userA', 'Test tweet #1');
}
-- BATCH END

提前感谢您的建议。

在我看来，您概述的模式或类似的模式是最好的用例（请参阅用户X订阅的最新推文+查看我的推文）

然而，有两个陷阱

我不认为Twitter使用Cassandra来存储推文，原因可能与您开始思考的相同。在Cassandra上运行提要似乎不是一个好主意，因为你不想永远保存他人推文的无数副本，而是为每个用户保持某种滑动窗口更新（我猜大多数用户不会从提要顶部向下阅读1000条推文）。所以我们讨论的是一个队列，一个在某些情况下基本上实时更新的队列。卡桑德拉只能在一定程度上强制支持这种模式。我不认为它是为大规模的搅动而设计的

在生产环境中，可能会选择另一个对队列具有更好支持的数据库——可能类似于具有列表支持的sharded Redis

对于您给出的示例，问题并不像看上去那么严重，因为您不需要在同步批处理中执行此更新。您可以发布到作者的列表，快速返回，然后使用集群中运行的异步工作程序执行所有其他更新，并尽最大努力推出更新

最后，既然你问过其他选择，我可以想到一个变化。从概念上讲，它可能更接近我提到的队列，但在幕后，它将遇到许多与大量数据搅动相关的相同问题

CREATE TABLE users(
 uname text,
 mru_timeline_slot int,
 followers set,
 following set,
 PRIMARY KEY (uname)
);

// circular buffer:  keep at most X slots for every user.  
CREATE TABLE timeline_most_recent(
 uname text,
 timeline_slot int, 
 tweeted timeuuid,
 posted_by text,
 body text,
 PRIMARY KEY(uname, timeline_slot)
);