Python 废弃的Sgmllinkextractor规则未对所有定义的链接进行爬网
我想以以下格式抓取所有链接:Python 废弃的Sgmllinkextractor规则未对所有定义的链接进行爬网,python,web-scraping,scrapy,Python,Web Scraping,Scrapy,我想以以下格式抓取所有链接: http://example.com/index.php/comments/XXXXX http://example.com/XXX1/index.php/comments/XXXXX http://example.com/XXX2/index.php/comments/XXXX http://example.com/XXX3/index.php/comments/XXXX 我定义了以下规则: start_urls = ['http://example.com/'
http://example.com/index.php/comments/XXXXX
http://example.com/XXX1/index.php/comments/XXXXX
http://example.com/XXX2/index.php/comments/XXXX
http://example.com/XXX3/index.php/comments/XXXX
我定义了以下规则:
start_urls = ['http://example.com/']
rules = [Rule(SgmlLinkExtractor(allow=[r'\w+/index.php/comments/\w+']), callback='parse_blogpost', follow=True)]
但爬虫似乎只访问了这样的链接(),而没有访问这样的链接()
任何帮助都将不胜感激 请尝试使用
index.php/comments
而不是\w+/index.php/comments/\w+/code>。您好,谢谢您的回复。我试过了,但没用。经过仔细的调查,我认为原因是的网页上没有链接(),所以爬虫程序无法跟踪类似的链接。