Python 从ajax中提取数据
我试图从ajax中提取数据(标题、价格和描述),但即使通过更改用户代理也无法实现 链接:Python 从ajax中提取数据,python,python-3.x,ajax,scrapy,Python,Python 3.x,Ajax,Scrapy,我试图从ajax中提取数据(标题、价格和描述),但即使通过更改用户代理也无法实现 链接: Ajax(要提取的数据): 错误日志: 2020-09-07 20:34:39 [scrapy.core.engine] INFO: Spider opened 2020-09-07 20:34:39 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 20
Ajax(要提取的数据): 错误日志:
2020-09-07 20:34:39 [scrapy.core.engine] INFO: Spider opened
2020-09-07 20:34:39 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2020-09-07 20:34:39 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2020-09-07 20:34:40 [scrapy.core.engine] DEBUG: Crawled (404) <GET https://scrapingclub.com/robots.txt> (referer: None)
2020-09-07 20:34:40 [scrapy.core.engine] DEBUG: Crawled (403) <GET https://scrapingclub.com/exercise/ajaxdetail_header/> (referer: None)
2020-09-07 20:34:40 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <403 https://scrapingclub.com/exercise/ajaxdetail_header/>: HTTP status code is not handled or not allowed
2020-09-07 20:34:39[刮屑核心引擎]信息:蜘蛛打开
2020-09-07 20:34:39[scrapy.extensions.logstats]信息:抓取0页(以0页/分钟的速度),抓取0项(以0项/分钟的速度)
2020-09-07 20:34:39[scrapy.extensions.telnet]信息:telnet控制台监听127.0.0.1:6023
2020-09-07 20:34:40[scrapy.core.engine]调试:爬网(404)(参考:无)
2020-09-07 20:34:40[scrapy.core.engine]调试:爬网(403)(参考:无)
2020-09-07 20:34:40[scrapy.spidermiddleware.httperror]信息:忽略响应:HTTP状态代码未处理或不允许
AJAX
请求应发送头
'X-Requested-With': 'XMLHttpRequest'
但并非所有服务器都会检查它。但是这个服务器检查一下。但它不检查用户代理
服务器以JSON
的形式发送数据,因此xpath
将毫无用处
我用
requests
而不是scrapy
测试它,因为它对我来说更简单
import requests
headers = {
#'User-Agent': 'Mozilla/5.0',
'X-Requested-With': 'XMLHttpRequest',
}
url = 'https://scrapingclub.com/exercise/ajaxdetail_header/'
response = requests.get(url, headers=headers)
data = response.json()
print(data)
print('type:', type(data))
print('keys:', data.keys())
print('--- manually ---')
print('price:', data['price'])
print('title:', data['title'])
print('--- for-loop ---')
for key, value in data.items():
print('{}: {}'.format(key, value))
结果:
{'img_path': '/static/img/00959-A.jpg', 'price': '$24.99', 'description': 'Blouse in airy, crinkled fabric with a printed pattern. Small stand-up collar, concealed buttons at front, and flounces at front. Long sleeves with buttons at cuffs. Rounded hem. 100% polyester. Machine wash cold.', 'title': 'Crinkled Flounced Blouse'}
type: <class 'dict'>
keys: dict_keys(['img_path', 'price', 'description', 'title'])
--- manually ---
price: $24.99
title: Crinkled Flounced Blouse
--- for-loop ---
img_path: /static/img/00959-A.jpg
price: $24.99
description: Blouse in airy, crinkled fabric with a printed pattern. Small stand-up collar, concealed buttons at front, and flounces at front. Long sleeves with buttons at cuffs. Rounded hem. 100% polyester. Machine wash cold.
title: Crinkled Flounced Blouse
编辑: 使用相同的设置
如果它需要AJAX请求,那么它可能需要标题
“X-request-With':“XMLHttpRequest”
谢谢您的帮助!!
{'img_path': '/static/img/00959-A.jpg', 'price': '$24.99', 'description': 'Blouse in airy, crinkled fabric with a printed pattern. Small stand-up collar, concealed buttons at front, and flounces at front. Long sleeves with buttons at cuffs. Rounded hem. 100% polyester. Machine wash cold.', 'title': 'Crinkled Flounced Blouse'}
type: <class 'dict'>
keys: dict_keys(['img_path', 'price', 'description', 'title'])
--- manually ---
price: $24.99
title: Crinkled Flounced Blouse
--- for-loop ---
img_path: /static/img/00959-A.jpg
price: $24.99
description: Blouse in airy, crinkled fabric with a printed pattern. Small stand-up collar, concealed buttons at front, and flounces at front. Long sleeves with buttons at cuffs. Rounded hem. 100% polyester. Machine wash cold.
title: Crinkled Flounced Blouse
import scrapy
import json
class MySpider(scrapy.Spider):
name = 'myspider'
def start_requests(self):
url = 'https://scrapingclub.com/exercise/ajaxdetail_header/'
headers = {
#'User-Agent': 'Mozilla/5.0',
'X-Requested-With': 'XMLHttpRequest',
}
yield scrapy.http.Request(url, headers=headers)
def parse(self, response):
print('url:', response.url)
data = response.json()
print(data)
print('type:', type(data))
print('keys:', data.keys())
print('--- manually ---')
print('price:', data['price'])
print('title:', data['title'])
print('--- for-loop ---')
for key, value in data.items():
print('{}: {}'.format(key, value))
# --- run without project and save in `output.csv` ---
from scrapy.crawler import CrawlerProcess
c = CrawlerProcess({
#'USER_AGENT': 'Mozilla/5.0',
# save in file CSV, JSON or XML
#'FEED_FORMAT': 'csv', # csv, json, xml
#'FEED_URI': 'output.csv', #
})
c.crawl(MySpider)
c.start()
import scrapy
import json
class MySpider(scrapy.Spider):
name = 'myspider'
start_urls = ['https://scrapingclub.com/exercise/ajaxdetail_header/']
def parse(self, response):
print('url:', response.url)
#print('headers:', response.request.headers)
data = response.json()
print(data)
print('type:', type(data))
print('keys:', data.keys())
print('--- manually ---')
print('price:', data['price'])
print('title:', data['title'])
print('--- for-loop ---')
for key, value in data.items():
print('{}: {}'.format(key, value))
# --- run without project and save in `output.csv` ---
from scrapy.crawler import CrawlerProcess
c = CrawlerProcess({
#'USER_AGENT': 'Mozilla/5.0',
'DEFAULT_REQUEST_HEADERS': {
#'User-Agent': 'Mozilla/5.0',
'X-Requested-With': 'XMLHttpRequest',
}
# save in file CSV, JSON or XML
#'FEED_FORMAT': 'csv', # csv, json, xml
#'FEED_URI': 'output.csv', #
})
c.crawl(MySpider)
c.start()