Python 刮痧爬了,但没有刮
我正在从中抓取数据,并找到了他们的ajax源链接,使其具有这种格式Python 刮痧爬了,但没有刮,python,web-scraping,scrapy,Python,Web Scraping,Scrapy,我正在从中抓取数据,并找到了他们的ajax源链接,使其具有这种格式 f”https://www.openingstijden.nl/ajax-resultaten/?page={page_num}&business_id={business_id}&no limit=1“,其中page_num为页码,business id为店铺id。示例: 这是我的代码: class SupermarketScraper(scrapy.Spider): name = "supermarkets&quo
f”https://www.openingstijden.nl/ajax-resultaten/?page={page_num}&business_id={business_id}&no limit=1“
,其中page_num为页码,business id为店铺id。示例:
这是我的代码:
class SupermarketScraper(scrapy.Spider):
name = "supermarkets"
def __init__(self):
self.images = ast.literal_eval(pkg_resources.read_text(resources, 'times.json'))
self.geodata = set()
sentry_sdk.init(dsn=SENTRY_SDK_DSN)
self.session = requests.Session()
def start_requests(self):
b_ids = ['5688'] #[item['b_id'] for item in business_ids]
page_num = 1
for b_id in b_ids:
url = f'https://www.openingstijden.nl/ajax-resultaten/?page={page_num}&business_id={b_id}&no-limit=1'
yield scrapy.Request(url=url, callback=self.parse, cb_kwargs=dict(b_id=b_id, page_num=page_num))
def parse(self, response, b_id, page_num):
items = response.css('li')
if items:
for item in items:
link = item.css('a.btn::attr(href)').get()
geodata = item.css('li::attr(data-geo)').get().split(',')
if (geodata[0], geodata[1]) not in self.geodata:
self.geodata.add((geodata[0], geodata[1]))
else:
return
yield response.follow(link, callback=self.parse_store, cb_kwargs=dict(geodata=geodata, b_id=b_id, page_num=page_num))
new_url = f'http://www.openingstijden.nl/ajax-resultaten/?page={page_num+1}&business_id={b_id}&no-limit=1'
yield response.follow(new_url, callback=self.parse, cb_kwargs=dict(b_id=b_id, page_num=page_num+1))
def parse_store(self, response, geodata, b_id, page_num):
name = response.css('#est-name::text')[0].get()
location = response.css('span.locate span::text').getall()
street, postal_code, city = location[0:3]
location = LocationItem()
location['street'] = street
location['postal_code'] = postal_code
location['city'] = city
location['lat'] = geodata[0]
location['lng'] = geodata[1]
item = SupermarketItem()
item['business_id'] = b_id
item['name'] = name
item['location'] = location
item['page_num'] = page_num
yield item
代码运行时没有错误,但是,它不会刮取所有数据,某些页面似乎已被爬网,但没有刮取任何数据,如下所示:
2020-11-11 09:23:55 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.openingstijden.nl/Albert-Heijn/Den-Haag/Prinses-Beatrixlaan-590/> (referer: https://www.openingstijden.nl/ajax-resultaten/?page=20&business_id=5688&no-limit=1)
2020-11-11 09:23:55 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.openingstijden.nl/Albert-Heijn/De-Meern/Mereveldplein-13/> (referer: https://www.openingstijden.nl/ajax-resultaten/?page=20&business_id=5688&no-limit=1)
2020-11-11 09:23:55 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.openingstijden.nl/Albert-Heijn/Voorschoten/Schoolstraat-1/> (referer: https://www.openingstijden.nl/ajax-resultaten/?page=20&business_id=5688&no-limit=1)
2020-11-11 09:23:55 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.openingstijden.nl/Albert-Heijn/Den-Haag/Stationsplein-75/> (referer: https://www.openingstijden.nl/ajax-resultaten/?page=20&business_id=5688&no-limit=1)
2020-11-11 09:23:55 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.openingstijden.nl/Albert-Heijn/Brielle/Thoelaverweg-2/> (referer: https://www.openingstijden.nl/ajax-resultaten/?page=20&business_id=5688&no-limit=1)
2020-11-11 09:23:55 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.openingstijden.nl/Albert-Heijn/Oude-Tonge/Dabbehof-34/> (referer: https://www.openingstijden.nl/ajax-resultaten/?page=20&business_id=5688&no-limit=1)
2020-11-11 09:23:55 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.openingstijden.nl/Albert-Heijn/Den-Haag/Johanna-Westerdijkplein-64/>
{'business_id': '5688',
'location': {'city': 'Den Haag',
'lat': '52.0674',
'lng': '4.32342',
'postal_code': '2521EN',
'street': 'Johanna Westerdijkplein 64'},
'name': 'Albert Heijn',
'page_num': 20}
2020-11-11 09:23:55[scrapy.core.engine]调试:爬网(200)(参考:https://www.openingstijden.nl/ajax-resultaten/?page=20&business_id=5688&no-限制=1)
2020-11-11 09:23:55[刮屑核心引擎]调试:爬网(200)(参考:https://www.openingstijden.nl/ajax-resultaten/?page=20&business_id=5688&no-限制=1)
2020-11-11 09:23:55[刮屑核心引擎]调试:爬网(200)(参考:https://www.openingstijden.nl/ajax-resultaten/?page=20&business_id=5688&no-限制=1)
2020-11-11 09:23:55[刮屑核心引擎]调试:爬网(200)(参考:https://www.openingstijden.nl/ajax-resultaten/?page=20&business_id=5688&no-限制=1)
2020-11-11 09:23:55[刮屑核心引擎]调试:爬网(200)(参考:https://www.openingstijden.nl/ajax-resultaten/?page=20&business_id=5688&no-限制=1)
2020-11-11 09:23:55[刮屑核心引擎]调试:爬网(200)(参考:https://www.openingstijden.nl/ajax-resultaten/?page=20&business_id=5688&no-限制=1)
2020-11-11 09:23:55[scrapy.core.scraper]调试:从
{'business_id':'5688',
'地点':{'城市':'登哈格',
“lat”:“52.0674”,
“液化天然气”:“4.32342”,
“邮政编码”:“2521EN”,
“街道”:“Johanna Westerdijkplein 64”,
“姓名”:“Albert Heijn”,
“页码”:20}
最后一个链接被正确地抓取,但是,之前的所有链接都被抓取,但没有返回任何内容。没有抛出任何错误,并且爬行器在不刮除所有数据的情况下关闭
我已经尝试将我的用户代理更改为USER\u-AGENT='Mozilla/5.0(Windows NT 10.0;Win64;x64)AppleWebKit/537.36(KHTML,像Gecko)Chrome/86.0.4240.111 Safari/537.36'
,但没有成功。所有这些页面也都有相同的web结构,可以轻松地单独刮取
服务器是否正在主动阻止scraper?为什么这些URL的数据被scraper遗漏了?如果它说“crawled”,你的回调会被相应的响应调用。您可以记录响应内容,以了解作为响应从服务器获得的内容,并查看它是否符合您的期望。当我记录响应内容时,它确实符合我的期望。逐个抓取这些丢失的URL会返回我想要的内容。所以刮刀只是因为不清楚的原因跳过了它们