Python 从ajax中提取数据

Python 从ajax中提取数据,python,python-3.x,ajax,scrapy,Python,Python 3.x,Ajax,Scrapy,我试图从ajax中提取数据(标题、价格和描述),但即使通过更改用户代理也无法实现 链接: Ajax(要提取的数据): 错误日志: 2020-09-07 20:34:39 [scrapy.core.engine] INFO: Spider opened 2020-09-07 20:34:39 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 20

我试图从ajax中提取数据(标题、价格和描述),但即使通过更改用户代理也无法实现

链接:
Ajax(要提取的数据):

错误日志:

2020-09-07 20:34:39 [scrapy.core.engine] INFO: Spider opened
2020-09-07 20:34:39 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2020-09-07 20:34:39 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2020-09-07 20:34:40 [scrapy.core.engine] DEBUG: Crawled (404) <GET https://scrapingclub.com/robots.txt> (referer: None)
2020-09-07 20:34:40 [scrapy.core.engine] DEBUG: Crawled (403) <GET https://scrapingclub.com/exercise/ajaxdetail_header/> (referer: None)
2020-09-07 20:34:40 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <403 https://scrapingclub.com/exercise/ajaxdetail_header/>: HTTP status code is not handled or not allowed
2020-09-07 20:34:39[刮屑核心引擎]信息:蜘蛛打开
2020-09-07 20:34:39[scrapy.extensions.logstats]信息:抓取0页(以0页/分钟的速度),抓取0项(以0项/分钟的速度)
2020-09-07 20:34:39[scrapy.extensions.telnet]信息:telnet控制台监听127.0.0.1:6023
2020-09-07 20:34:40[scrapy.core.engine]调试:爬网(404)(参考:无)
2020-09-07 20:34:40[scrapy.core.engine]调试:爬网(403)(参考:无)
2020-09-07 20:34:40[scrapy.spidermiddleware.httperror]信息:忽略响应:HTTP状态代码未处理或不允许

AJAX
请求应发送头

 'X-Requested-With': 'XMLHttpRequest'
但并非所有服务器都会检查它。但是这个服务器检查一下。但它不检查
用户代理

服务器以
JSON
的形式发送数据,因此
xpath
将毫无用处


我用
requests
而不是
scrapy
测试它,因为它对我来说更简单

import requests

headers = {
    #'User-Agent': 'Mozilla/5.0',
    'X-Requested-With': 'XMLHttpRequest',
}

url = 'https://scrapingclub.com/exercise/ajaxdetail_header/'

response = requests.get(url, headers=headers)
data = response.json()

print(data)
print('type:', type(data))
print('keys:', data.keys())
print('--- manually ---')
print('price:', data['price'])
print('title:', data['title'])
print('--- for-loop ---')
for key, value in data.items():
    print('{}: {}'.format(key, value))
结果:

{'img_path': '/static/img/00959-A.jpg', 'price': '$24.99', 'description': 'Blouse in airy, crinkled fabric with a printed pattern. Small stand-up collar, concealed buttons at front, and flounces at front. Long sleeves with buttons at cuffs. Rounded hem. 100% polyester. Machine wash cold.', 'title': 'Crinkled Flounced Blouse'}
type: <class 'dict'>
keys: dict_keys(['img_path', 'price', 'description', 'title'])
--- manually ---
price: $24.99
title: Crinkled Flounced Blouse
--- for-loop ---
img_path: /static/img/00959-A.jpg
price: $24.99
description: Blouse in airy, crinkled fabric with a printed pattern. Small stand-up collar, concealed buttons at front, and flounces at front. Long sleeves with buttons at cuffs. Rounded hem. 100% polyester. Machine wash cold.
title: Crinkled Flounced Blouse

编辑:

使用相同的设置


如果它需要AJAX请求,那么它可能需要标题
“X-request-With':“XMLHttpRequest”
谢谢您的帮助!!
{'img_path': '/static/img/00959-A.jpg', 'price': '$24.99', 'description': 'Blouse in airy, crinkled fabric with a printed pattern. Small stand-up collar, concealed buttons at front, and flounces at front. Long sleeves with buttons at cuffs. Rounded hem. 100% polyester. Machine wash cold.', 'title': 'Crinkled Flounced Blouse'}
type: <class 'dict'>
keys: dict_keys(['img_path', 'price', 'description', 'title'])
--- manually ---
price: $24.99
title: Crinkled Flounced Blouse
--- for-loop ---
img_path: /static/img/00959-A.jpg
price: $24.99
description: Blouse in airy, crinkled fabric with a printed pattern. Small stand-up collar, concealed buttons at front, and flounces at front. Long sleeves with buttons at cuffs. Rounded hem. 100% polyester. Machine wash cold.
title: Crinkled Flounced Blouse
import scrapy
import json

class MySpider(scrapy.Spider):

    name = 'myspider'

    def start_requests(self):
        url = 'https://scrapingclub.com/exercise/ajaxdetail_header/'

        headers = {
            #'User-Agent': 'Mozilla/5.0',
            'X-Requested-With': 'XMLHttpRequest',
        }

        yield scrapy.http.Request(url, headers=headers)
        
    def parse(self, response):
        print('url:', response.url)

        data = response.json()

        print(data)
        print('type:', type(data))
        print('keys:', data.keys())
        print('--- manually ---')
        print('price:', data['price'])
        print('title:', data['title'])
        print('--- for-loop ---')
        for key, value in data.items():
            print('{}: {}'.format(key, value))

# --- run without project and save in `output.csv` ---

from scrapy.crawler import CrawlerProcess

c = CrawlerProcess({
    #'USER_AGENT': 'Mozilla/5.0',
    # save in file CSV, JSON or XML
    #'FEED_FORMAT': 'csv',     # csv, json, xml
    #'FEED_URI': 'output.csv', #
})
c.crawl(MySpider)
c.start()
import scrapy
import json

class MySpider(scrapy.Spider):

    name = 'myspider'

    start_urls = ['https://scrapingclub.com/exercise/ajaxdetail_header/']

    def parse(self, response):
        print('url:', response.url)
        #print('headers:', response.request.headers)
        
        data = response.json()

        print(data)
        print('type:', type(data))
        print('keys:', data.keys())
        print('--- manually ---')
        print('price:', data['price'])
        print('title:', data['title'])
        print('--- for-loop ---')
        for key, value in data.items():
            print('{}: {}'.format(key, value))

# --- run without project and save in `output.csv` ---

from scrapy.crawler import CrawlerProcess

c = CrawlerProcess({
    #'USER_AGENT': 'Mozilla/5.0',
    'DEFAULT_REQUEST_HEADERS': {
            #'User-Agent': 'Mozilla/5.0',
            'X-Requested-With': 'XMLHttpRequest',
        }
    # save in file CSV, JSON or XML
    #'FEED_FORMAT': 'csv',     # csv, json, xml
    #'FEED_URI': 'output.csv', #
})
c.crawl(MySpider)
c.start()