Architecture 如何在爬虫中使用MessageQueue?
看来MessageQueue应该是构建Web爬虫的一个很好的体系结构解决方案,但我仍然不知道如何做到这一点Architecture 如何在爬虫中使用MessageQueue?,architecture,web-crawler,message-queue,Architecture,Web Crawler,Message Queue,看来MessageQueue应该是构建Web爬虫的一个很好的体系结构解决方案,但我仍然不知道如何做到这一点 让我们考虑第一种情况共享数据库,很漂亮 清楚如何执行此操作该算法将是经典的图遍历: There are multiple Workers and shared database. - I manually put the first url into the database while true - worker get random discovered url from d
让我们考虑第一种情况<强>共享数据库<强>,很漂亮 清楚如何执行此操作该算法将是经典的图遍历:
There are multiple Workers and shared database.
- I manually put the first url into the database
while true
- worker get random discovered url from database.
- worker parses it and gets list of all links on the page.
- worker updates the url in the database as processed.
- worker lookup into the database and separates the found links
into processed, discovered and the new ones.
- worker add the new ones links to database as discovered.
让我们考虑第二种情况,<强>用MessageQueue <强> < /P>
There are MessageQueue containing urls that should be processed
and multiple Workers.
- I manually put the first url in the Queue.
while true
- worker takes next discovered url from the Queue.
- worker parsers it and gets list of all links on the page.
- what it does next? How it separates found links into
processed, discovered and the new ones?
- worker puts the list of new urls into the Queue as discovered.
问题:
- 将所有发现的URL保留在MessageQueue中可以吗?如果有数千个站点有数千个页面,那么将有数百万条消息等待在队列中
- 如何将页面上找到的链接分为已处理、已发现和新链接?在DB的情况下如何做很清楚——只需在DB中查找并检查每个链接,但在MessageQueue的情况下如何做呢