Python 废弃的Sgmllinkextractor规则未对所有定义的链接进行爬网_Python_Web Scraping_Scrapy

Python 废弃的Sgmllinkextractor规则未对所有定义的链接进行爬网

python web-scraping scrapy

Python 废弃的Sgmllinkextractor规则未对所有定义的链接进行爬网,python,web-scraping,scrapy,Python,Web Scraping,Scrapy,我想以以下格式抓取所有链接： http://example.com/index.php/comments/XXXXX http://example.com/XXX1/index.php/comments/XXXXX http://example.com/XXX2/index.php/comments/XXXX http://example.com/XXX3/index.php/comments/XXXX 我定义了以下规则： start_urls = ['http://example.com/'

我想以以下格式抓取所有链接：

http://example.com/index.php/comments/XXXXX
http://example.com/XXX1/index.php/comments/XXXXX
http://example.com/XXX2/index.php/comments/XXXX
http://example.com/XXX3/index.php/comments/XXXX

我定义了以下规则：

start_urls = ['http://example.com/']

rules = [Rule(SgmlLinkExtractor(allow=[r'\w+/index.php/comments/\w+']), callback='parse_blogpost', follow=True)]

但爬虫似乎只访问了这样的链接（），而没有访问这样的链接（）

任何帮助都将不胜感激

请尝试使用

index.php/comments

而不是

\w+/index.php/comments/\w+/code>。您好，谢谢您的回复。我试过了，但没用。经过仔细的调查，我认为原因是的网页上没有链接（），所以爬虫程序无法跟踪类似的链接。