Python 如何刮取每个搜索项的结果并返回？_Python_Scrapy_Web Crawler

Python 如何刮取每个搜索项的结果并返回？

python scrapy web-crawler

Python 如何刮取每个搜索项的结果并返回？,python,scrapy,web-crawler,Python,Scrapy,Web Crawler,我一直在试图从公司登记簿上搜集一些信息。这是可行的，但我想对搜索条目给出的每个结果重复一下。我一直在尝试使用LinkedExtractor，但我没有让它发挥作用搜索结果网页为：从搜索项中删除单个结果是可行的（如果我单击一个结果项），但如何对每个结果项重复此操作这是我的密码： import scrapy import re from scrapy.linkextractors import LinkExtractor class QuotesSpider(scrapy.Spider):

我一直在试图从公司登记簿上搜集一些信息。这是可行的，但我想对搜索条目给出的每个结果重复一下。我一直在尝试使用LinkedExtractor，但我没有让它发挥作用

搜索结果网页为：

从搜索项中删除单个结果是可行的（如果我单击一个结果项），但如何对每个结果项重复此操作

这是我的密码：

import scrapy
import re
from scrapy.linkextractors import LinkExtractor

class QuotesSpider(scrapy.Spider):


  name = 'CYRecursive'
  start_urls = [
      'https://www.companiesintheuk.co.uk/ltd/a-2']

  def parse(self, response):

    # Looping throught the searchResult block and yielding it
    for i in response.css('div.col-md-9'):

        for i in response.css('div.col-md-6'):
          yield {
              'company_name': re.sub('\s+', ' ', ''.join(i.css('#content2 > strong:nth-child(2) > strong:nth-child(1) > div:nth-child(1)::text').get())),
              'address': re.sub('\s+', ' ', ''.join(i.css("#content2 > strong:nth-child(2) > address:nth-child(2) > div:nth-child(1) > span:nth-child(1)::text").extract_first())),
              'location': re.sub('\s+', ' ', ''.join(i.css("#content2 > strong:nth-child(2) > address:nth-child(2) > div:nth-child(1) > span:nth-child(3)::text").extract_first())),
              'postal_code': re.sub('\s+', ' ', ''.join(i.css("#content2 > strong:nth-child(2) > address:nth-child(2) > div:nth-child(1) > a:nth-child(5) > span:nth-child(1)::text").extract_first())),
          }

当然，您可以使用

start\u请求

自动

生成从a
到z
的所有搜索
您的CSS表达式错误：
            yield {
                'company_name': response.xpath('//div[@itemprop="name"]/text()').extract_first(),
                'address': response.xpath('//span[@itemprop="streetAddress"]/text()').extract_first(),
                'location': response.xpath('//span[@itemprop="addressLocality"]/text()').extract_first(),
                'postal_code': response.xpath('//span[@itemprop="postalCode"]/text()').extract_first(),
            }

我测试了它，它的工作！非常感谢你的努力！但是我无法解释。当我尝试导出它的结果显示，它刮1搜索结果和跳过一些。你知道为什么吗？请参考下面的转储文件。注意：我删除了下面的代码行，因为它没有值。对于i in response.css（'div.col-md-9'）：另外，当我在没有下一个页面url链的情况下运行它时（因此只有第一个页面，但第一个页面上的所有搜索结果），我只得到列表顶部第一项的结果。我没有检查css/XPath表达式。正如我从您的输出中看到的，在运行spider时您有111个错误。我检查了我的css路径，这些路径在其他结果中是正确的。我确实发现它不喜欢我的连接，因为我得到的错误是只能在所有行上连接一个iterable。你知道为什么它只在某些情况下显示这一点吗？非常感谢。还有什么奇怪的是，scrapy显示它成功地访问了网站，并刮取了数据，然后失败了？我不知道该怎么办哈哈
            yield {
                'company_name': response.xpath('//div[@itemprop="name"]/text()').extract_first(),
                'address': response.xpath('//span[@itemprop="streetAddress"]/text()').extract_first(),
                'location': response.xpath('//span[@itemprop="addressLocality"]/text()').extract_first(),
                'postal_code': response.xpath('//span[@itemprop="postalCode"]/text()').extract_first(),
            }