Python 刮痧爬了,但没有刮

Python 刮痧爬了,但没有刮,python,web-scraping,scrapy,Python,Web Scraping,Scrapy,我正在从中抓取数据,并找到了他们的ajax源链接,使其具有这种格式 f”https://www.openingstijden.nl/ajax-resultaten/?page={page_num}&business_id={business_id}&no limit=1“,其中page_num为页码,business id为店铺id。示例: 这是我的代码: class SupermarketScraper(scrapy.Spider): name = "supermarkets&quo

我正在从中抓取数据,并找到了他们的ajax源链接,使其具有这种格式
f”https://www.openingstijden.nl/ajax-resultaten/?page={page_num}&business_id={business_id}&no limit=1“
,其中page_num为页码,business id为店铺id。示例:

这是我的代码:

class SupermarketScraper(scrapy.Spider):
name = "supermarkets"

def __init__(self):
    self.images = ast.literal_eval(pkg_resources.read_text(resources, 'times.json'))
    self.geodata = set()
    sentry_sdk.init(dsn=SENTRY_SDK_DSN)
    self.session = requests.Session()

def start_requests(self):
    b_ids = ['5688'] #[item['b_id'] for item in business_ids]
    page_num = 1
    for b_id in b_ids:
        url = f'https://www.openingstijden.nl/ajax-resultaten/?page={page_num}&business_id={b_id}&no-limit=1'
        yield scrapy.Request(url=url, callback=self.parse, cb_kwargs=dict(b_id=b_id, page_num=page_num))
            
def parse(self, response, b_id, page_num):
    items = response.css('li')
    if items: 
        for item in items:
            link = item.css('a.btn::attr(href)').get()
            geodata = item.css('li::attr(data-geo)').get().split(',')
            if (geodata[0], geodata[1]) not in self.geodata:
                self.geodata.add((geodata[0], geodata[1]))
            else:
                return

            yield response.follow(link, callback=self.parse_store, cb_kwargs=dict(geodata=geodata, b_id=b_id, page_num=page_num))

        new_url = f'http://www.openingstijden.nl/ajax-resultaten/?page={page_num+1}&business_id={b_id}&no-limit=1'
        yield response.follow(new_url, callback=self.parse, cb_kwargs=dict(b_id=b_id, page_num=page_num+1))

def parse_store(self, response, geodata, b_id, page_num):
    name = response.css('#est-name::text')[0].get()
    location = response.css('span.locate span::text').getall()
    street, postal_code, city = location[0:3]

    location = LocationItem()
    location['street'] = street
    location['postal_code'] = postal_code
    location['city'] = city
    location['lat'] = geodata[0]
    location['lng'] = geodata[1]

    item = SupermarketItem()
    item['business_id'] = b_id
    item['name'] = name
    item['location'] = location
    item['page_num'] = page_num
    yield item
代码运行时没有错误,但是,它不会刮取所有数据,某些页面似乎已被爬网,但没有刮取任何数据,如下所示:

2020-11-11 09:23:55 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.openingstijden.nl/Albert-Heijn/Den-Haag/Prinses-Beatrixlaan-590/> (referer: https://www.openingstijden.nl/ajax-resultaten/?page=20&business_id=5688&no-limit=1)
2020-11-11 09:23:55 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.openingstijden.nl/Albert-Heijn/De-Meern/Mereveldplein-13/> (referer: https://www.openingstijden.nl/ajax-resultaten/?page=20&business_id=5688&no-limit=1)
2020-11-11 09:23:55 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.openingstijden.nl/Albert-Heijn/Voorschoten/Schoolstraat-1/> (referer: https://www.openingstijden.nl/ajax-resultaten/?page=20&business_id=5688&no-limit=1)
2020-11-11 09:23:55 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.openingstijden.nl/Albert-Heijn/Den-Haag/Stationsplein-75/> (referer: https://www.openingstijden.nl/ajax-resultaten/?page=20&business_id=5688&no-limit=1)
2020-11-11 09:23:55 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.openingstijden.nl/Albert-Heijn/Brielle/Thoelaverweg-2/> (referer: https://www.openingstijden.nl/ajax-resultaten/?page=20&business_id=5688&no-limit=1)
2020-11-11 09:23:55 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.openingstijden.nl/Albert-Heijn/Oude-Tonge/Dabbehof-34/> (referer: https://www.openingstijden.nl/ajax-resultaten/?page=20&business_id=5688&no-limit=1)
2020-11-11 09:23:55 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.openingstijden.nl/Albert-Heijn/Den-Haag/Johanna-Westerdijkplein-64/>
{'business_id': '5688',
 'location': {'city': 'Den Haag',
 'lat': '52.0674',
 'lng': '4.32342',
 'postal_code': '2521EN',
 'street': 'Johanna Westerdijkplein 64'},
 'name': 'Albert Heijn',
 'page_num': 20}
2020-11-11 09:23:55[scrapy.core.engine]调试:爬网(200)(参考:https://www.openingstijden.nl/ajax-resultaten/?page=20&business_id=5688&no-限制=1)
2020-11-11 09:23:55[刮屑核心引擎]调试:爬网(200)(参考:https://www.openingstijden.nl/ajax-resultaten/?page=20&business_id=5688&no-限制=1)
2020-11-11 09:23:55[刮屑核心引擎]调试:爬网(200)(参考:https://www.openingstijden.nl/ajax-resultaten/?page=20&business_id=5688&no-限制=1)
2020-11-11 09:23:55[刮屑核心引擎]调试:爬网(200)(参考:https://www.openingstijden.nl/ajax-resultaten/?page=20&business_id=5688&no-限制=1)
2020-11-11 09:23:55[刮屑核心引擎]调试:爬网(200)(参考:https://www.openingstijden.nl/ajax-resultaten/?page=20&business_id=5688&no-限制=1)
2020-11-11 09:23:55[刮屑核心引擎]调试:爬网(200)(参考:https://www.openingstijden.nl/ajax-resultaten/?page=20&business_id=5688&no-限制=1)
2020-11-11 09:23:55[scrapy.core.scraper]调试:从
{'business_id':'5688',
'地点':{'城市':'登哈格',
“lat”:“52.0674”,
“液化天然气”:“4.32342”,
“邮政编码”:“2521EN”,
“街道”:“Johanna Westerdijkplein 64”,
“姓名”:“Albert Heijn”,
“页码”:20}
最后一个链接被正确地抓取,但是,之前的所有链接都被抓取,但没有返回任何内容。没有抛出任何错误,并且爬行器在不刮除所有数据的情况下关闭

我已经尝试将我的用户代理更改为
USER\u-AGENT='Mozilla/5.0(Windows NT 10.0;Win64;x64)AppleWebKit/537.36(KHTML,像Gecko)Chrome/86.0.4240.111 Safari/537.36'
,但没有成功。所有这些页面也都有相同的web结构,可以轻松地单独刮取


服务器是否正在主动阻止scraper?为什么这些URL的数据被scraper遗漏了?

如果它说“crawled”,你的回调会被相应的响应调用。您可以记录响应内容,以了解作为响应从服务器获得的内容,并查看它是否符合您的期望。当我记录响应内容时,它确实符合我的期望。逐个抓取这些丢失的URL会返回我想要的内容。所以刮刀只是因为不清楚的原因跳过了它们