Python 刮削进料输出包含预期输出几次,而不是一次

Python 刮削进料输出包含预期输出几次,而不是一次,python,scrapy,Python,Scrapy,我编写了一个spider,其唯一目的是从底部的寻呼机中提取一个数字,即最大页数(例如,下面示例中的数字255) 我根据这些页面的URL匹配的正则表达式,使用LinkExtractor实现了这一点。卡盘如下所示: import scrapy from scrapy.spiders import CrawlSpider, Rule from scrapy.linkextractors import LinkExtractor from scrapy.crawler import CrawlerPr

我编写了一个spider,其唯一目的是从底部的寻呼机中提取一个数字,即最大页数(例如,下面示例中的数字255)

我根据这些页面的URL匹配的正则表达式,使用LinkExtractor实现了这一点。卡盘如下所示:

import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from scrapy.crawler import CrawlerProcess
from Funda.items import MaxPageItem

class FundaMaxPagesSpider(CrawlSpider):
    name = "Funda_max_pages"
    allowed_domains = ["funda.nl"]
    start_urls = ["http://www.funda.nl/koop/amsterdam/"]

    le_maxpage = LinkExtractor(allow=r'%s+p\d+' % start_urls[0])   # Link to a page containing thumbnails of several houses, such as http://www.funda.nl/koop/amsterdam/p10/

    rules = (
    Rule(le_maxpage, callback='get_max_page_number'),
    )

    def get_max_page_number(self, response):
        links = self.le_maxpage.extract_links(response)
        max_page_number = 0                                                 # Initialize the maximum page number
        page_numbers=[]
        for link in links:
            if link.url.count('/') == 6 and link.url.endswith('/'):         # Select only pages with a link depth of 3
                page_number = int(link.url.split("/")[-2].strip('p'))       # For example, get the number 10 out of the string 'http://www.funda.nl/koop/amsterdam/p10/'
                page_numbers.append(page_number)
                # if page_number > max_page_number:
                #     max_page_number = page_number                           # Update the maximum page number if the current value is larger than its previous value
        max_page_number = max(page_numbers)
        print("The maximum page number is %s" % max_page_number)
        yield {'max_page_number': max_page_number}
如果我通过在命令行中输入
scrapy crawl Funda_max_pages-o Funda_max_pages.json
在提要输出中运行此命令,则生成的json文件如下所示:

[
{"max_page_number": 257},
{"max_page_number": 257},
{"max_page_number": 257},
{"max_page_number": 257},
{"max_page_number": 257},
{"max_page_number": 257},
{"max_page_number": 257}
]
我觉得奇怪的是,dict被输出7次而不是一次。毕竟,
yield
语句在
for
循环之外。有人能解释这种行为吗

  • 您的爬行器将转到第一个开始url
  • 使用LinkExtractor提取7个URL
  • 下载这7个URL中的每一个,并在每一个URL上调用
    get\u max\u page\u number
  • 对于每个url
    get\u max\u page\u number
    返回一个字典
  • 您的爬行器将转到第一个开始url
  • 使用LinkExtractor提取7个URL
  • 下载这7个URL中的每一个,并在每一个URL上调用
    get\u max\u page\u number
  • 对于每个url
    get\u max\u page\u number
    返回一个字典

  • 作为一种变通方法,我将输出写入一个文本文件,以代替JSON提要输出:

    import scrapy
    from scrapy.spiders import CrawlSpider, Rule
    from scrapy.linkextractors import LinkExtractor
    from scrapy.crawler import CrawlerProcess
    
    class FundaMaxPagesSpider(CrawlSpider):
        name = "Funda_max_pages"
        allowed_domains = ["funda.nl"]
        start_urls = ["http://www.funda.nl/koop/amsterdam/"]
    
        le_maxpage = LinkExtractor(allow=r'%s+p\d+' % start_urls[0])   # Link to a page containing thumbnails of several houses, such as http://www.funda.nl/koop/amsterdam/p10/
    
        rules = (
        Rule(le_maxpage, callback='get_max_page_number'),
        )
    
        def get_max_page_number(self, response):
            links = self.le_maxpage.extract_links(response)
            max_page_number = 0                                                 # Initialize the maximum page number
            for link in links:
                if link.url.count('/') == 6 and link.url.endswith('/'):         # Select only pages with a link depth of 3
                    print("The link is %s" % link.url)
                    page_number = int(link.url.split("/")[-2].strip('p'))       # For example, get the number 10 out of the string 'http://www.funda.nl/koop/amsterdam/p10/'
                    if page_number > max_page_number:
                        max_page_number = page_number                           # Update the maximum page number if the current value is larger than its previous value
            print("The maximum page number is %s" % max_page_number)
            place_name = link.url.split("/")[-3]                                # For example, "amsterdam" in 'http://www.funda.nl/koop/amsterdam/p10/'
            print("The place name is %s" % place_name)
            filename = str(place_name)+"_max_pages.txt"                         # File name with as prefix the place name
            with open(filename,'wb') as f:
                f.write('max_page_number = %s' % max_page_number)               # Write the maximum page number to a text file
            yield {'max_page_number': max_page_number}
    
    process = CrawlerProcess({
        'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)'
    })
    
    process.crawl(FundaMaxPagesSpider)
    process.start() # the script will block here until the crawling is finished
    

    我还改编了spider,将其作为脚本运行。该脚本将生成一个文本文件
    amsterdam_max_pages.txt
    ,其中只有一行
    max_pages\u编号:257

    作为一种解决方法,我将输出写入一个文本文件,以代替JSON提要输出:

    import scrapy
    from scrapy.spiders import CrawlSpider, Rule
    from scrapy.linkextractors import LinkExtractor
    from scrapy.crawler import CrawlerProcess
    
    class FundaMaxPagesSpider(CrawlSpider):
        name = "Funda_max_pages"
        allowed_domains = ["funda.nl"]
        start_urls = ["http://www.funda.nl/koop/amsterdam/"]
    
        le_maxpage = LinkExtractor(allow=r'%s+p\d+' % start_urls[0])   # Link to a page containing thumbnails of several houses, such as http://www.funda.nl/koop/amsterdam/p10/
    
        rules = (
        Rule(le_maxpage, callback='get_max_page_number'),
        )
    
        def get_max_page_number(self, response):
            links = self.le_maxpage.extract_links(response)
            max_page_number = 0                                                 # Initialize the maximum page number
            for link in links:
                if link.url.count('/') == 6 and link.url.endswith('/'):         # Select only pages with a link depth of 3
                    print("The link is %s" % link.url)
                    page_number = int(link.url.split("/")[-2].strip('p'))       # For example, get the number 10 out of the string 'http://www.funda.nl/koop/amsterdam/p10/'
                    if page_number > max_page_number:
                        max_page_number = page_number                           # Update the maximum page number if the current value is larger than its previous value
            print("The maximum page number is %s" % max_page_number)
            place_name = link.url.split("/")[-3]                                # For example, "amsterdam" in 'http://www.funda.nl/koop/amsterdam/p10/'
            print("The place name is %s" % place_name)
            filename = str(place_name)+"_max_pages.txt"                         # File name with as prefix the place name
            with open(filename,'wb') as f:
                f.write('max_page_number = %s' % max_page_number)               # Write the maximum page number to a text file
            yield {'max_page_number': max_page_number}
    
    process = CrawlerProcess({
        'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)'
    })
    
    process.crawl(FundaMaxPagesSpider)
    process.start() # the script will block here until the crawling is finished
    

    我还改编了spider,将其作为脚本运行。脚本将生成一个文本文件
    amsterdam_max_pages.txt
    ,其中一行
    max_pages_number:257

    您仍在爬网7个URL,但您正在用
    max_pages_number:257
    覆盖同一文件7次…您仍在爬网7个URL,但是您正在用
    max\u page\u number:257覆盖同一个文件7次。。。