Python 刮痧爬了，但没有刮_Python_Web Scraping_Scrapy

Python 刮痧爬了，但没有刮

python web-scraping scrapy

Python 刮痧爬了，但没有刮,python,web-scraping,scrapy,Python,Web Scraping,Scrapy,我正在从中抓取数据，并找到了他们的ajax源链接，使其具有这种格式 f”https://www.openingstijden.nl/ajax-resultaten/?page={page_num}&business_id={business_id}&no limit=1“，其中page_num为页码，business id为店铺id。示例：这是我的代码： class SupermarketScraper(scrapy.Spider): name = "supermarkets&quo

我正在从中抓取数据，并找到了他们的ajax源链接，使其具有这种格式

f”https://www.openingstijden.nl/ajax-resultaten/?page={page_num}&business_id={business_id}&no limit=1“

，其中page_num为页码，business id为店铺id。示例：

这是我的代码：

class SupermarketScraper(scrapy.Spider):
name = "supermarkets"

def __init__(self):
    self.images = ast.literal_eval(pkg_resources.read_text(resources, 'times.json'))
    self.geodata = set()
    sentry_sdk.init(dsn=SENTRY_SDK_DSN)
    self.session = requests.Session()

def start_requests(self):
    b_ids = ['5688'] #[item['b_id'] for item in business_ids]
    page_num = 1
    for b_id in b_ids:
        url = f'https://www.openingstijden.nl/ajax-resultaten/?page={page_num}&business_id={b_id}&no-limit=1'
        yield scrapy.Request(url=url, callback=self.parse, cb_kwargs=dict(b_id=b_id, page_num=page_num))
            
def parse(self, response, b_id, page_num):
    items = response.css('li')
    if items: 
        for item in items:
            link = item.css('a.btn::attr(href)').get()
            geodata = item.css('li::attr(data-geo)').get().split(',')
            if (geodata[0], geodata[1]) not in self.geodata:
                self.geodata.add((geodata[0], geodata[1]))
            else:
                return

            yield response.follow(link, callback=self.parse_store, cb_kwargs=dict(geodata=geodata, b_id=b_id, page_num=page_num))

        new_url = f'http://www.openingstijden.nl/ajax-resultaten/?page={page_num+1}&business_id={b_id}&no-limit=1'
        yield response.follow(new_url, callback=self.parse, cb_kwargs=dict(b_id=b_id, page_num=page_num+1))

def parse_store(self, response, geodata, b_id, page_num):
    name = response.css('#est-name::text')[0].get()
    location = response.css('span.locate span::text').getall()
    street, postal_code, city = location[0:3]

    location = LocationItem()
    location['street'] = street
    location['postal_code'] = postal_code
    location['city'] = city
    location['lat'] = geodata[0]
    location['lng'] = geodata[1]

    item = SupermarketItem()
    item['business_id'] = b_id
    item['name'] = name
    item['location'] = location
    item['page_num'] = page_num
    yield item

代码运行时没有错误，但是，它不会刮取所有数据，某些页面似乎已被爬网，但没有刮取任何数据，如下所示：

2020-11-11 09:23:55 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.openingstijden.nl/Albert-Heijn/Den-Haag/Prinses-Beatrixlaan-590/> (referer: https://www.openingstijden.nl/ajax-resultaten/?page=20&business_id=5688&no-limit=1)
2020-11-11 09:23:55 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.openingstijden.nl/Albert-Heijn/De-Meern/Mereveldplein-13/> (referer: https://www.openingstijden.nl/ajax-resultaten/?page=20&business_id=5688&no-limit=1)
2020-11-11 09:23:55 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.openingstijden.nl/Albert-Heijn/Voorschoten/Schoolstraat-1/> (referer: https://www.openingstijden.nl/ajax-resultaten/?page=20&business_id=5688&no-limit=1)
2020-11-11 09:23:55 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.openingstijden.nl/Albert-Heijn/Den-Haag/Stationsplein-75/> (referer: https://www.openingstijden.nl/ajax-resultaten/?page=20&business_id=5688&no-limit=1)
2020-11-11 09:23:55 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.openingstijden.nl/Albert-Heijn/Brielle/Thoelaverweg-2/> (referer: https://www.openingstijden.nl/ajax-resultaten/?page=20&business_id=5688&no-limit=1)
2020-11-11 09:23:55 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.openingstijden.nl/Albert-Heijn/Oude-Tonge/Dabbehof-34/> (referer: https://www.openingstijden.nl/ajax-resultaten/?page=20&business_id=5688&no-limit=1)
2020-11-11 09:23:55 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.openingstijden.nl/Albert-Heijn/Den-Haag/Johanna-Westerdijkplein-64/>
{'business_id': '5688',
 'location': {'city': 'Den Haag',
 'lat': '52.0674',
 'lng': '4.32342',
 'postal_code': '2521EN',
 'street': 'Johanna Westerdijkplein 64'},
 'name': 'Albert Heijn',
 'page_num': 20}

2020-11-11 09:23:55[scrapy.core.engine]调试：爬网（200）（参考：https://www.openingstijden.nl/ajax-resultaten/?page=20&business_id=5688&no-限制=1）
2020-11-11 09:23:55[刮屑核心引擎]调试：爬网（200）（参考：https://www.openingstijden.nl/ajax-resultaten/?page=20&business_id=5688&no-限制=1）
2020-11-11 09:23:55[刮屑核心引擎]调试：爬网（200）（参考：https://www.openingstijden.nl/ajax-resultaten/?page=20&business_id=5688&no-限制=1）
2020-11-11 09:23:55[刮屑核心引擎]调试：爬网（200）（参考：https://www.openingstijden.nl/ajax-resultaten/?page=20&business_id=5688&no-限制=1）
2020-11-11 09:23:55[刮屑核心引擎]调试：爬网（200）（参考：https://www.openingstijden.nl/ajax-resultaten/?page=20&business_id=5688&no-限制=1）
2020-11-11 09:23:55[刮屑核心引擎]调试：爬网（200）（参考：https://www.openingstijden.nl/ajax-resultaten/?page=20&business_id=5688&no-限制=1）
2020-11-11 09:23:55[scrapy.core.scraper]调试：从
{'business_id'：'5688'，
'地点'：{'城市'：'登哈格'，
“lat”：“52.0674”，
“液化天然气”：“4.32342”，
“邮政编码”：“2521EN”，
“街道”：“Johanna Westerdijkplein 64”，
“姓名”：“Albert Heijn”，
“页码”：20}

最后一个链接被正确地抓取，但是，之前的所有链接都被抓取，但没有返回任何内容。没有抛出任何错误，并且爬行器在不刮除所有数据的情况下关闭

我已经尝试将我的用户代理更改为

USER\u-AGENT='Mozilla/5.0（Windows NT 10.0；Win64；x64）AppleWebKit/537.36（KHTML，像Gecko）Chrome/86.0.4240.111 Safari/537.36'

，但没有成功。所有这些页面也都有相同的web结构，可以轻松地单独刮取

服务器是否正在主动阻止scraper？为什么这些URL的数据被scraper遗漏了？

如果它说“crawled”，你的回调会被相应的响应调用。您可以记录响应内容，以了解作为响应从服务器获得的内容，并查看它是否符合您的期望。当我记录响应内容时，它确实符合我的期望。逐个抓取这些丢失的URL会返回我想要的内容。所以刮刀只是因为不清楚的原因跳过了它们