Python XPATH无法访问某些div_Python_Html_Xpath_Web Scraping_Scrapy

Python XPATH无法访问某些div

python html xpath web-scraping scrapy

Python XPATH无法访问某些div,python,html,xpath,web-scraping,scrapy,Python,Html,Xpath,Web Scraping,Scrapy,简介我必须添加到我的爬虫“其他人也买了”-某些productlink的项目。这对我来说真的很奇怪，因为有像“在手机上打开”和“内部生成”这样的div，这对我意味着什么目标我已经得到了所有需要的重要信息，除了“其他人也买了”之外，在尝试了几个小时后，我决定在这里询问，以免我浪费更多的时间，变得更加沮丧 HTML构造我的代码 # -*- coding: utf-8 -*- import scrapy from scrapy.linkextractors import LinkExtrac

简介

我必须添加到我的爬虫“其他人也买了”-某些productlink的项目。这对我来说真的很奇怪，因为有像“在手机上打开”和“内部生成”这样的div，这对我意味着什么

目标

我已经得到了所有需要的重要信息，除了“其他人也买了”之外，在尝试了几个小时后，我决定在这里询问，以免我浪费更多的时间，变得更加沮丧

HTML构造

我的代码

# -*- coding: utf-8 -*-
import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
from ..items import DuifcsvItem
import csv

class DuifSpider(scrapy.Spider):
    name = "duif"
    allowed_domains = ['duif.nl']
    custom_settings = {'FIELD_EXPORT_FIELDS' : ['SKU', 'Title', 'Title_small', 'NL_PL_PC', 'Description']}
    with open("duifonlylinks.csv","r") as f:
        reader = csv.DictReader(f)
        start_urls = [items['Link'] for items in reader]
    
    rules = (
        Rule(LinkExtractor(), callback='parse'),
    )



    def parse(self, response):
        card = response.xpath('//div[@class="heading"]')

        if not card:
            print('No productlink', response.url)        
        
        items = DuifcsvItem()
        items['Link'] = response.url
        items['SKU'] = response.xpath('//p[@class="desc"]/text()').get().strip()
        items['Title'] = response.xpath('//h1[@class="product-title"]/text()').get()
        items['Title_small'] = response.xpath('//div[@class="left"]/p/text()').get()
        items['NL_PL_PC'] = response.xpath('//div[@class="desc"]/ul/li/em/text()').getall()
        items['Description'] = response.xpath('//div[@class="item"]/p/text()').getall()
        yield items

实际网页：

如果您可以使用xpath访问此href，那就太完美了

XPATH我已经试过了

>>> response.xpath('//div[@class="title"]/h3/text()').get() >>> response.xpath('//div[@class="inner generated"]/div//h3/text()').get() >>> response.xpath('//div[@class="wrap-products"]/div/div/a/@href').get() >>> response.xpath('/div[@class="description"]/div/h3/text()').get() >>> response.xpath('//div[@class="open-on-mobile"]/div/div/div/a/@href').get() >>> response.xpath('//div[@class="product cross-square white"]/a/@href').get() >>> response.xpath('//a[@class="product-link"]').get() >>> response.xpath('//a[@class="product-link"]').getall()
您可以在这部分HTML中找到“其他人也购买了”产品ID（请参见
createCrossSellItems
部分）：

你查过实际的来源了吗？（不是DOM）您的HTML元素显然是由AJAX加载的。由于scrapy只加载基本HTML代码，因此您的
其他产品
不能出现在scrapy响应中。你有两种方法来实现你的目标：scrapy splash或控制浏览器的工具，例如Selenium、Puppeter。或Pipped me to post！看到有一个API端点，ID在脚本中。嘿@gangabass，你能为我详细介绍一下吗？我是一个初学者，我从未接触过json，我不知道从哪里开始。@gangabass，哇，谢谢你的编辑！今天我将学习json并尝试理解您的代码：）我将编辑我的问题并发布我的实际代码，是否可以将您的解决方案与我的解决方案混合使用？抱歉，我不这么认为。但我可以添加
def parse
我的其他项目，如交付状态，对吗？非常感谢，先生！：）
<script> $(function () { createUpsellItems("885034747 | 885034800 | 885034900 |") createCrossSellItems("885034347 | 480010600 | 480010700 | 010046700 | 500061967 | 480011000 |") }) </script>

import scrapy import json import re class DuifSpider(scrapy.Spider): name="duif" start_urls = ['https://www.duif.nl/product/pot-seal-matt-finish-light-pink-large'] def parse(self, response): item = {} item['title'] = response.xpath('//h1[@class="product-title"]/text()').get() item['url'] = response.url item['cross_sell'] = [] cross_sell_items_raw = response.xpath('//script[contains(., "createCrossSellItems(")]/text()').re_first(r'createCrossSellItems\("([^"]+)') cross_sell_items = re.findall(r"\d+", cross_sell_items_raw) if cross_sell_items: cross_sell_item_id = cross_sell_items.pop(0) yield scrapy.Request( f"https://www.duif.nl/api/v2/catalog/product?itemcode={cross_sell_item_id}_Parent", headers={ 'referer': response.url, 'Content-type': 'application/json', 'Authorization': 'bearer null', 'Accept': '*/*', }, callback=self.parse_cross_sell, meta={ 'item': item, 'referer': response.url, 'cross_sell_items': cross_sell_items, } ) else: # There is no "Others also bought" items for this page, just save main item yield item def parse_cross_sell(self, response): main_item = response.meta["item"] cross_sell_items = response.meta["cross_sell_items"] data = json.loads(response.text) current_cross_sell_item = {} current_cross_sell_item['title'] = data["_embedded"]["products"][0]["name"] current_cross_sell_item['url'] = data["_embedded"]["products"][0]["url"] current_cross_sell_item['description'] = data["_embedded"]["products"][0]["description"] main_item['cross_sell'].append(current_cross_sell_item) if cross_sell_items: cross_sell_item_id = cross_sell_items.pop(0) yield scrapy.Request( f"https://www.duif.nl/api/v2/catalog/product?itemcode={cross_sell_item_id}_Parent", headers={ 'referer': response.meta['referer'], 'Content-type': 'application/json', 'Authorization': 'bearer null', 'Accept': '*/*', }, callback=self.parse_cross_sell, meta={ 'item': main_item, 'referer': response.meta['referer'], 'cross_sell_items': cross_sell_items, } ) else: # no more cross sell items to process, save output yield main_item