Python scrapy-解析分页的项_Python_Scrapy

Python scrapy-解析分页的项

python scrapy

Python scrapy-解析分页的项,python,scrapy,Python,Scrapy,我有一个url的形式： example.com/foo/bar/page_1.html 共有53页，每一页约有20行我基本上想从所有页面中获取所有行，即~53*20项我的解析方法中有工作代码，它解析单个页面，并且每个项目更深一页，以获取关于该项目的更多信息： def parse(self, response): hxs = HtmlXPathSelector(response) restaurants = hxs.select('//*[@id="contenido-

我有一个url的形式：

example.com/foo/bar/page_1.html

共有53页，每一页约有20行

我基本上想从所有页面中获取所有行，即~53*20项

我的解析方法中有工作代码，它解析单个页面，并且每个项目更深一页，以获取关于该项目的更多信息：

  def parse(self, response):
    hxs = HtmlXPathSelector(response)

    restaurants = hxs.select('//*[@id="contenido-resbus"]/table/tr[position()>1]')

    for rest in restaurants:
      item = DegustaItem()
      item['name'] = rest.select('td[2]/a/b/text()').extract()[0]
      # some items don't have category associated with them
      try:
        item['category'] = rest.select('td[3]/a/text()').extract()[0]
      except:
        item['category'] = ''
      item['urbanization'] = rest.select('td[4]/a/text()').extract()[0]

      # get profile url
      rel_url = rest.select('td[2]/a/@href').extract()[0]
      # join with base url since profile url is relative
      base_url = get_base_url(response)
      follow = urljoin_rfc(base_url,rel_url)

      request = Request(follow, callback = parse_profile)
      request.meta['item'] = item
      return request


  def parse_profile(self, response):
    item = response.meta['item']
    # item['address'] = figure out xpath
    return item

问题是，我如何抓取每一页

example.com/foo/bar/page_1.html
example.com/foo/bar/page_2.html
example.com/foo/bar/page_3.html
...
...
...
example.com/foo/bar/page_53.html

你有两个选择来解决你的问题。一般的方法是使用

yield

生成新请求，而不是

return

。这样，您可以从单个回调发出多个新请求。查看第二个示例

在您的情况下，可能有一个更简单的解决方案：只需根据如下模式生成开始UR列表：

class MySpider(BaseSpider):
    start_urls = ['http://example.com/foo/bar/page_%s.html' % page for page in xrange(1,54)]

您可以使用爬行爬行器而不是BaseSpider，并使用SgmlLinkExtractor来提取分页中的页面

例如：

start_urls = ["www.example.com/page1"]
rules = ( Rule (SgmlLinkExtractor(restrict_xpaths=('//a[@class="next_page"]',))
                , follow= True),
          Rule (SgmlLinkExtractor(restrict_xpaths=('//div[@class="foto_imovel"]',))
                , callback='parse_call')
    )

第一条规则告诉scrapy跟随xpath表达式中包含的链接，第二条规则告诉scrapy调用xpath表达式中包含的链接的parse_调用，以防您要解析每个页面中的某些内容

有关更多信息，请参阅文档：

对于“分页的scrapy-解析项目”，可以有两个用例

A）。我们只想在表中移动并获取数据。这是相对直截了当的。

列车爬行器类（scrapy.Spider）：
name=“行程”
start_url=['somewebsite']
def解析（自我，响应）：
''使用此解析器执行某些操作''
next_page=response.xpath（“//a[@class='next_page']/@href”）.extract_first（）
如果下一页不是“无”：
下一页=response.urljoin（下一页）
生成scrapy.Request（下一页，callback=self.parse）

观察最后4行。这里

我们从“下一页”分页按钮获取下一页链接表单下一页xpath

如果条件为，则检查它是否不是分页的结束

使用url连接将此链接（我们在步骤1中得到的）与主url连接

对

parse

回调方法的递归调用 B）我们不仅希望在页面之间移动，还希望从该页面的一个或多个链接中提取数据。

class StationDetailSpider（爬行蜘蛛）：
名称=‘火车’
开始\u URL=[SomeOther网站]
规则=(
规则（LinkExtractor（restrict\u xpath=“//a[@class='next\u page']”），follow=True），
规则（LinkExtractor（allow=r“/trains/\d+$”），callback='parse_trains'）
)
def parse_列车（自我、响应）：
''在这里进行分析''

在这里，请注意：

我们正在使用

scrapy.Spider

父类的

CrawlSpider

子类

我们已经制定了“规则”

a）第一条规则，只是检查是否有“下一页”可用，并遵循它

b）第二条规则请求页面上所有链接的格式，例如

/trains/12343

，然后调用

parse\u trains

执行和解析操作

重要：注意，我们不想在这里使用常规的

parse

方法，因为我们使用的是

CrawlSpider

子类。这个类还有一个

parse

方法，所以我们不想覆盖它。只需记住将回调方法命名为

parse

我遇到了一个类似的问题，我所做的正如你所说的，但它仍然只是爬网，只有start_url页面。SGMLLinkedExtractors和contrib模块中的所有其他类都会引发错误。请改用LinkedExtracor类。回答很好。非常感谢。scrapy网站上的LinkedExtractor不适合我。这没有。如何检查是否找不到页面。它只有53页。但是如果我调用

xrange（1,60）

，在python 3中

xrange（）

被重命名为

range（）

。