在webcrawler中使用redis nosql_Nosql_Redis_Web Crawler

在webcrawler中使用redis nosql

nosql redis web-crawler

在webcrawler中使用redis nosql,nosql,redis,web-crawler,Nosql,Redis,Web Crawler,我正在制作一个简单的wikipedia页面爬虫，并将详细信息写入运行redis的远程服务器 1 The crawler asks the server for a page that needs crawling 2 The crawler loads the page and adds the pages that are found to an internal buffer 3 When the page has finished being parsed the results a

我正在制作一个简单的wikipedia页面爬虫，并将详细信息写入运行redis的远程服务器

 1 The crawler asks the server for a page that needs crawling
 2 The crawler loads the page and adds the pages that are found to an internal buffer
 3 When the page has finished being parsed the results are sent to the server

如何执行以下操作：

保留服务器上找到的所有页面，并带有一个标志，该标志表示该页面是否已爬网

e、 g

一,
0
一,

我的问题是

我如何要求redis提供其状态为0（尚未爬网）的第一个链接

然后，我如何告诉redis将该状态更改为1（在我对其进行爬网之后）

您可以使用列表来保存要处理的页面

RPUSH mylist "http:// ...."

然后可以使用lpop获取列表中的第一项

LPOP mylist

要跟踪已处理的页面，可以使用

SADD myset "http://.....

最后收集地址是否在处理集中

SISMEMBER myset "http://...."