Python 刮取-等待所有生成的请求完成

Python 刮取-等待所有生成的请求完成,python,python-3.x,web-scraping,scrapy,Python,Python 3.x,Web Scraping,Scrapy,嘿,我刚开始吃Scrapy。我正在为网站编写一个基本的cralwer。该网站使用ajax请求以json格式获取与单个产品相关的所有数据。这是我的密码 ` def parse_item(self, response): self.n += 1 print("inside parse_item => ", self.n) popupitem = PopupItem() popupitem["url"] = response.

嘿,我刚开始吃Scrapy。我正在为网站编写一个基本的cralwer。该网站使用ajax请求以json格式获取与单个产品相关的所有数据。这是我的密码

 `  def parse_item(self, response):
        self.n += 1
        print("inside parse_item => ", self.n)

        popupitem = PopupItem()
        popupitem["url"] = response.url
        item_desc_api = self.get_item_desc_api(response)
        print("url to call =>", item_desc_api)
        # calling api url to get items description
        yield scrapy.Request(item_desc_api, callback=self.parse_item_from_api,
                         meta={"popupitem": popupitem})




    def parse_item_from_api(self, response):
        self.m += 1
        print("inside parse_item_from_api =>",self.m)
        popupitem = response.meta["popupitem"]
        jsonresponse = json.loads(response.body_as_unicode())
        yield popupitem
我使用了两个变量n和m来显示parse_item(n)被调用的次数,以及parse_item_从_api(n)被调用的次数

问题

当我运行这段代码时,它只显示n->116和m->37。所以程序在处理所有生成的请求之前退出,并且只有37项存储在output.JSON文件中如何确保在程序退出之前处理所有生成的请求

刮削原木

2017-06-13 13:37:40 [scrapy.core.engine] INFO: Closing spider 
(finished)
2017-06-13 13:37:40 [scrapy.extensions.feedexport] INFO: Stored json 
feed (37 items) in: out.json
2017-06-13 13:37:40 [scrapy.statscollectors] INFO: Dumping Scrapy 
stats:
{'downloader/request_bytes': 93446,
'downloader/request_count': 194,
'downloader/request_method_count/GET': 194,
'downloader/response_bytes': 1808706,
'downloader/response_count': 194,
'downloader/response_status_count/200': 193,
'downloader/response_status_count/301': 1,
'dupefilter/filtered': 154,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2017, 6, 13, 8, 37, 40, 576449),
'item_scraped_count': 37,
'log_count/DEBUG': 233,
'log_count/INFO': 8,
'request_depth_max': 3,
'response_received_count': 193,
'scheduler/dequeued': 193,
'scheduler/dequeued/memory': 193,
'scheduler/enqueued': 193,
'scheduler/enqueued/memory': 193,
'start_time': datetime.datetime(2017, 6, 13, 8, 37, 17, 124336)}
2017-06-13 13:37:40 [scrapy.core.engine] INFO: Spider closed ( 
finished)

创建一个您要发出的所有请求的列表

all_requests = ['https://website.com/1', 'https://website.com/2', 'https://website.com/3']

link = all_requests.pop() # extract the one request to make

# make first request
yield Request(url = link, callback = self.prase_1, meta = {'remaining_links' : all_requests, data = []}

def parse_1(self, response):

    data = []

    data = data.extend( response.meta['data] ) 

    ... GRAB YOUR DATA FROM RESPONSE

    remaining_links = response.meta['remaining_links']


    # if there are more requests to make
    if len(remaining_links) > 0:
        link = remaining_links.pop() # extract one request to make

        yield Request(url = link, callback = self.prase_1, meta = {'remaining_links' : remaining_links, data = data}

    else:

        yield data

任何人都请帮忙。我陷入了死胡同:(你的爬网日志是否显示一些请求正在被过滤?你是得到了所有的HTTP-200还是有一些请求出错(404500…)?这应该出现在末尾的统计信息中。您的爬网确实获得了193个HTTP-200响应,因此在这方面一切都很好。您必须共享您的整个蜘蛛,或者显示爬网过早结束的整个爬网日志。@paultrmbrth请在以下要点中查找我的全部代码。我个人不会运行您的程序,所以我想您也需要要与LOG_LEVEL='DEBUG'共享爬网日志,如果有要处理的sill请求(您在回调中已生成的请求),Scrapy将不会结束。因此,请确保您确实生成了正确的预期请求数(可能使用
self.logger.DEBUG('将请求生成到%r'%someurl)
语句)。