Web scraping 亚马逊的数据抓取是如何基于位置的？_Web Scraping_Beautifulsoup_Python Requests_Scrapy_Urllib3

Web scraping 亚马逊的数据抓取是如何基于位置的？

web-scraping scrapy

Web scraping 亚马逊的数据抓取是如何基于位置的？,web-scraping,beautifulsoup,python-requests,scrapy,urllib3,Web Scraping,Beautifulsoup,Python Requests,Scrapy,Urllib3,每当我想在amazon.com上搜索时，我都失败了。因为产品信息会根据amazon.com中的位置而变化这种变化的信息如下：一价 2-运费 3-海关费用 4-运输状况用硒改变位置很简单，但处理速度很慢。所以这就是为什么我需要用scrapy或requests来刮然而，尽管我模仿浏览器中的cookie和标题，amazon.com不允许我更改位置有两个大问题有一个名为“ubid main”的数据，我无法导出这个数据。这是没有数据的亚马逊。它不允许改变地点虽然我对标题数据也做了同样

每当我想在amazon.com上搜索时，我都失败了。因为产品信息会根据amazon.com中的位置而变化

这种变化的信息如下：

一价
2-运费
3-海关费用
4-运输状况

用硒改变位置很简单，但处理速度很慢。所以这就是为什么我需要用scrapy或requests来刮

然而，尽管我模仿浏览器中的cookie和标题，amazon.com不允许我更改位置

有两个大问题

有一个名为“ubid main”的数据，我无法导出这个数据。这是没有数据的亚马逊。它不允许改变地点

虽然我对标题数据也做了同样的处理，但有一点不同在传出数据之间。示例：我在中使用完全相同的标题浏览器。但是在浏览器中，内容类型是json，但是在我编写的代码中，它是text/html；字符集=UTF-8正在运行

非常有趣的是，没有关于这个主题的信息。你不能在世界头号购物网站上进行定位抓取

请告诉我谁知道这个问题的答案。如果有如scrapy或requests这样的解决方案，就足够了。说真的，我已经一年没解决这个问题了

import requests
from lxml import etree
from random import choice
from urllib3.exceptions import InsecureRequestWarning
import urllib.parse
import urllib3.request
urllib3.disable_warnings(urllib3.exceptions.InsecureRequestWarning)

    

def location():
    headersdelivery = {
            'user-agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.190 Safari/537.36',
            'content-type':'application/x-www-form-urlencoded',
            'accept':'text/html,*/*',
            'x-requested-with':'XMLHttpRequest',
            'contenttype':'application/x-www-form-urlencoded;charset=utf-8',
            'origin':'https://www.amazon.com',
            'sec-fetch-site':'same-origin',
            'sec-fetch-mode':'cors',
            'sec-fetch-dest':'empty',
            'referer':'https://www.amazon.com/',
            'accept-encoding':'gzip, deflate, br',
            'accept-language':'tr-TR,tr;q=0.9,en-US;q=0.8,en;q=0.7'
            }

    payload = {
    'locationType':'LOCATION_INPUT',
    'zipCode':'34249',
    'storeContext':'generic',
    'deviceType':'web',
    'pageType':'Gateway',
    'actionSource':'glow',
    'almBrandId':'undefined'}


    sessionid = requests.session()
    url = "https://www.amazon.com/gp/delivery/ajax/address-change.html"
    ulkesecmereq = sessionid.post(url, headers=headersdelivery, data=payload,verify=False)

    return sessionid


def response(locationsession):
    headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.190 Safari/537.36',
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
    'Accept-Language': 'en-US,en;q=0.5',
    'Accept-Encoding': 'gzip, deflate, br',
    'Connection': 'keep-alive',
    'Upgrade-Insecure-Requests': '1',
    'TE': 'Trailers'}

    postdata = {
    'storeContext':'generic',
    'pageType':'Gateway'
    }
    req = locationsession.post("https://www.amazon.com/gp/glow/get-location-label.html",headers=headers, data=postdata, verify=False)
    print(req.content)


locationsession = location()
response(locationsession)

我在报头中看到了CSRF令牌（anti-csrftoken-a2z），您在位置请求中遗漏了该令牌，在location（）中遗漏了一个附加请求。您应该像在浏览器中一样实现所有请求

Chrome中的简单示例：

Chrome->devtools->network->XHR

复制为curl在此处复制并转换为请求库（）