Architecture 如何在爬虫中使用MessageQueue？_Architecture_Web Crawler_Message Queue

Architecture 如何在爬虫中使用MessageQueue？

architecture web-crawler

Architecture 如何在爬虫中使用MessageQueue？,architecture,web-crawler,message-queue,Architecture,Web Crawler,Message Queue,看来MessageQueue应该是构建Web爬虫的一个很好的体系结构解决方案，但我仍然不知道如何做到这一点让我们考虑第一种情况共享数据库，很漂亮清楚如何执行此操作该算法将是经典的图遍历： There are multiple Workers and shared database. - I manually put the first url into the database while true - worker get random discovered url from d

看来MessageQueue应该是构建Web爬虫的一个很好的体系结构解决方案，但我仍然不知道如何做到这一点

让我们考虑第一种情况<强>共享数据库<强>，很漂亮清楚如何执行此操作该算法将是经典的图遍历：

There are multiple Workers and shared database.

- I manually put the first url into the database

while true

  - worker get random discovered url from database.
  - worker parses it and gets list of all links on the page.
  - worker updates the url in the database as processed.
  - worker lookup into the database and separates the found links 
    into processed, discovered and the new ones.
  - worker add the new ones links to database as discovered.

让我们考虑第二种情况，<强>用MessageQueue <强> < /P>

There are MessageQueue containing urls that should be processed 
and multiple Workers.

- I manually put the first url in the Queue.

while true

  - worker takes next discovered url from the Queue.
  - worker parsers it and gets list of all links on the page.
  - what it does next? How it separates found links into
    processed, discovered and the new ones?
  - worker puts the list of new urls into the Queue as discovered.

问题：

将所有发现的URL保留在MessageQueue中可以吗？如果有数千个站点有数千个页面，那么将有数百万条消息等待在队列中
如何将页面上找到的链接分为已处理、已发现和新链接？在DB的情况下如何做很清楚——只需在DB中查找并检查每个链接，但在MessageQueue的情况下如何做呢

下一步做什么？它如何将找到的链接分离为处理，发现和新的

您将为这些设置单独的队列，这些队列将流回您的数据库。这样做的想法是，您可以让多个工作人员进行操作，并通过反馈循环将新发现的URL发送回队列进行处理，然后发送到数据库进行存储

如何将页面上找到的链接分为已处理、已发现和新链接？在DB的情况下如何做很清楚——只需在DB中查找并检查每个链接，但在MessageQueue的情况下如何做呢

您可能仍然会在数据库中查找来自队列的链接

因此，工作流如下所示：队列上的链接被丢弃队列工作人员拿起它，检查数据库以查看链接是否已处理如果未处理，请致电网站以检索其他出站链接解析页面，并将每个出站链接放到队列中进行处理

将所有发现的URL保留在MessageQueue中可以吗？如果有数千个站点有数千个页面，那么将有数百万条消息等待在队列中

可能不是，这就是数据库的用途。一旦事情得到处理，您应该将它们从队列中删除。排队是为了。。。排队。消息传输。不适用于数据存储。数据库是用来存储数据的

现在，在处理它们之前，可以将它们留在队列中。如果您担心队列容量，可以修改工作流，以便队列工作人员删除任何已处理的链接，这将减少队列的深度。它甚至可能更有效率