Python SgmlLinkExtractor不显示结果或以下链接_Python_Web Crawler_Scrapy_Scrapy Spider_Sgml

Python SgmlLinkExtractor不显示结果或以下链接

python web-crawler scrapy

Python SgmlLinkExtractor不显示结果或以下链接,python,web-crawler,scrapy,scrapy-spider,sgml,Python,Web Crawler,Scrapy,Scrapy Spider,Sgml,我无法完全理解SGML链接提取器的工作原理。使用Scrapy制作爬虫时，我可以使用特定的URL成功地从链接中提取数据。问题在于使用规则跟踪特定URL中的下一页链接我认为问题在于allow（）属性。将规则添加到代码中时，结果不会显示在命令行中，并且不会跟随到下一页的链接非常感谢您的帮助这是代码 import scrapy from scrapy.selector import HtmlXPathSelector from scrapy.spider import BaseSpider fro

我无法完全理解SGML链接提取器的工作原理。使用Scrapy制作爬虫时，我可以使用特定的URL成功地从链接中提取数据。问题在于使用规则跟踪特定URL中的下一页链接

我认为问题在于

allow（）

属性。将规则添加到代码中时，结果不会显示在命令行中，并且不会跟随到下一页的链接

非常感谢您的帮助

这是代码

import scrapy
from scrapy.selector import HtmlXPathSelector
from scrapy.spider import BaseSpider
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.contrib.spiders import CrawlSpider
from scrapy.contrib.spiders import Rule

from tutorial.items import TutorialItem

class AllGigsSpider(CrawlSpider):
    name = "allGigs"
    allowed_domains = ["http://www.allgigs.co.uk/"]
    start_urls = [
        "http://www.allgigs.co.uk/whats_on/London/clubbing-1.html",
        "http://www.allgigs.co.uk/whats_on/London/festivals-1.html",
        "http://www.allgigs.co.uk/whats_on/London/comedy-1.html",
        "http://www.allgigs.co.uk/whats_on/London/theatre_and_opera-1.html",
        "http://www.allgigs.co.uk/whats_on/London/dance_and_ballet-1.html"
    ]    
    rules = (Rule(SgmlLinkExtractor(allow=(), restrict_xpaths=('//div[@class="more"]',)), callback="parse_me", follow= True),
    )

    def parse_me(self, response):
        hxs = HtmlXPathSelector(response)
        infos = hxs.xpath('//div[@class="entry vevent"]')
        items = []
        for info in infos:
            item = TutorialItem()
            item ['artist'] = hxs.xpath('//span[@class="summary"]//text()').extract()
            item ['date'] = hxs.xpath('//abbr[@class="dtstart dtend"]//text()').extract()
            item ['endDate'] = hxs.xpath('//abbr[@class="dtend"]//text()').extract()            
            item ['startDate'] = hxs.xpath('//abbr[@class="dtstart"]//text()').extract()
            items.append(item)
        return items
        print items

问题出在

限制路径中-它应该指向链接提取器应该在其中查找链接的块。根本不指定allow
：
rules = [
    Rule(SgmlLinkExtractor(restrict_xpaths='//div[@class="more"]'), 
         callback="parse_me", 
         follow=True),
]

您需要修复您的允许的\u域
：
allowed_domains = ["www.allgigs.co.uk"]

还请注意，parse_me（）
回调中的print items
无法访问，因为它位于return
语句之后。而且，在循环中，不应该使用hxs
应用XPath表达式，这些表达式应该在info
上下文中使用。您可以简化parse_me（）
：
非常感谢您的回复，但我在命令行中仍然没有收到任何结果。它仍然只抓取初始URL。感谢你的帮助lot@DanielParkin更新了答案，这只是一个allowed\u domains问题。先生，您是一位传奇人物。我可以问一下为什么要用这种格式吗？非常感谢@DanielParkin是的，它应该是一个域字符串，而不是URL。这是很多从Scrapy开始的用户经常遇到的问题。啊，这很有道理！：）干杯
def parse_me(self, response):
    for info in response.xpath('//div[@class="entry vevent"]'):
        item = TutorialItem()
        item['artist'] = info.xpath('.//span[@class="summary"]//text()').extract()
        item['date'] = info.xpath('.//abbr[@class="dtstart dtend"]//text()').extract()
        item['endDate'] = info.xpath('.//abbr[@class="dtend"]//text()').extract()            
        item['startDate'] = info.xpath('.//abbr[@class="dtstart"]//text()').extract()
        yield item