Python 从ajax中提取数据_Python_Python 3.x_Ajax_Scrapy

Python 从ajax中提取数据

python python-3.x ajax scrapy

Python 从ajax中提取数据,python,python-3.x,ajax,scrapy,Python,Python 3.x,Ajax,Scrapy,我试图从ajax中提取数据（标题、价格和描述），但即使通过更改用户代理也无法实现链接： Ajax（要提取的数据）：错误日志： 2020-09-07 20:34:39 [scrapy.core.engine] INFO: Spider opened 2020-09-07 20:34:39 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 20

我试图从ajax中提取数据（标题、价格和描述），但即使通过更改用户代理也无法实现

链接：
Ajax（要提取的数据）：

错误日志：

2020-09-07 20:34:39 [scrapy.core.engine] INFO: Spider opened
2020-09-07 20:34:39 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2020-09-07 20:34:39 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2020-09-07 20:34:40 [scrapy.core.engine] DEBUG: Crawled (404) <GET https://scrapingclub.com/robots.txt> (referer: None)
2020-09-07 20:34:40 [scrapy.core.engine] DEBUG: Crawled (403) <GET https://scrapingclub.com/exercise/ajaxdetail_header/> (referer: None)
2020-09-07 20:34:40 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <403 https://scrapingclub.com/exercise/ajaxdetail_header/>: HTTP status code is not handled or not allowed

2020-09-07 20:34:39[刮屑核心引擎]信息：蜘蛛打开
2020-09-07 20:34:39[scrapy.extensions.logstats]信息：抓取0页（以0页/分钟的速度），抓取0项（以0项/分钟的速度）
2020-09-07 20:34:39[scrapy.extensions.telnet]信息：telnet控制台监听127.0.0.1:6023
2020-09-07 20:34:40[scrapy.core.engine]调试：爬网（404）（参考：无）
2020-09-07 20:34:40[scrapy.core.engine]调试：爬网（403）（参考：无）
2020-09-07 20:34:40[scrapy.spidermiddleware.httperror]信息：忽略响应：HTTP状态代码未处理或不允许

AJAX

请求应发送头

 'X-Requested-With': 'XMLHttpRequest'

但并非所有服务器都会检查它。但是这个服务器检查一下。但它不检查

用户代理

服务器以

JSON

的形式发送数据，因此

xpath

将毫无用处

我用

requests

而不是

scrapy

测试它，因为它对我来说更简单

import requests

headers = {
    #'User-Agent': 'Mozilla/5.0',
    'X-Requested-With': 'XMLHttpRequest',
}

url = 'https://scrapingclub.com/exercise/ajaxdetail_header/'

response = requests.get(url, headers=headers)
data = response.json()

print(data)
print('type:', type(data))
print('keys:', data.keys())
print('--- manually ---')
print('price:', data['price'])
print('title:', data['title'])
print('--- for-loop ---')
for key, value in data.items():
    print('{}: {}'.format(key, value))

结果:

{'img_path': '/static/img/00959-A.jpg', 'price': '$24.99', 'description': 'Blouse in airy, crinkled fabric with a printed pattern. Small stand-up collar, concealed buttons at front, and flounces at front. Long sleeves with buttons at cuffs. Rounded hem. 100% polyester. Machine wash cold.', 'title': 'Crinkled Flounced Blouse'}
type: <class 'dict'>
keys: dict_keys(['img_path', 'price', 'description', 'title'])
--- manually ---
price: $24.99
title: Crinkled Flounced Blouse
--- for-loop ---
img_path: /static/img/00959-A.jpg
price: $24.99
description: Blouse in airy, crinkled fabric with a printed pattern. Small stand-up collar, concealed buttons at front, and flounces at front. Long sleeves with buttons at cuffs. Rounded hem. 100% polyester. Machine wash cold.
title: Crinkled Flounced Blouse

编辑：

使用相同的设置

如果它需要AJAX请求，那么它可能需要标题

“X-request-With'：“XMLHttpRequest”

谢谢您的帮助！！

{'img_path': '/static/img/00959-A.jpg', 'price': '$24.99', 'description': 'Blouse in airy, crinkled fabric with a printed pattern. Small stand-up collar, concealed buttons at front, and flounces at front. Long sleeves with buttons at cuffs. Rounded hem. 100% polyester. Machine wash cold.', 'title': 'Crinkled Flounced Blouse'}
type: <class 'dict'>
keys: dict_keys(['img_path', 'price', 'description', 'title'])
--- manually ---
price: $24.99
title: Crinkled Flounced Blouse
--- for-loop ---
img_path: /static/img/00959-A.jpg
price: $24.99
description: Blouse in airy, crinkled fabric with a printed pattern. Small stand-up collar, concealed buttons at front, and flounces at front. Long sleeves with buttons at cuffs. Rounded hem. 100% polyester. Machine wash cold.
title: Crinkled Flounced Blouse

import scrapy
import json

class MySpider(scrapy.Spider):

    name = 'myspider'

    def start_requests(self):
        url = 'https://scrapingclub.com/exercise/ajaxdetail_header/'

        headers = {
            #'User-Agent': 'Mozilla/5.0',
            'X-Requested-With': 'XMLHttpRequest',
        }

        yield scrapy.http.Request(url, headers=headers)
        
    def parse(self, response):
        print('url:', response.url)

        data = response.json()

        print(data)
        print('type:', type(data))
        print('keys:', data.keys())
        print('--- manually ---')
        print('price:', data['price'])
        print('title:', data['title'])
        print('--- for-loop ---')
        for key, value in data.items():
            print('{}: {}'.format(key, value))

# --- run without project and save in `output.csv` ---

from scrapy.crawler import CrawlerProcess

c = CrawlerProcess({
    #'USER_AGENT': 'Mozilla/5.0',
    # save in file CSV, JSON or XML
    #'FEED_FORMAT': 'csv',     # csv, json, xml
    #'FEED_URI': 'output.csv', #
})
c.crawl(MySpider)
c.start()

import scrapy
import json

class MySpider(scrapy.Spider):

    name = 'myspider'

    start_urls = ['https://scrapingclub.com/exercise/ajaxdetail_header/']

    def parse(self, response):
        print('url:', response.url)
        #print('headers:', response.request.headers)
        
        data = response.json()

        print(data)
        print('type:', type(data))
        print('keys:', data.keys())
        print('--- manually ---')
        print('price:', data['price'])
        print('title:', data['title'])
        print('--- for-loop ---')
        for key, value in data.items():
            print('{}: {}'.format(key, value))

# --- run without project and save in `output.csv` ---

from scrapy.crawler import CrawlerProcess

c = CrawlerProcess({
    #'USER_AGENT': 'Mozilla/5.0',
    'DEFAULT_REQUEST_HEADERS': {
            #'User-Agent': 'Mozilla/5.0',
            'X-Requested-With': 'XMLHttpRequest',
        }
    # save in file CSV, JSON or XML
    #'FEED_FORMAT': 'csv',     # csv, json, xml
    #'FEED_URI': 'output.csv', #
})
c.crawl(MySpider)
c.start()