Warning: file_get_contents(/data/phpspider/zhask/data//catemap/3/html/72.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181

Warning: file_get_contents(/data/phpspider/zhask/data//catemap/3/xpath/2.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Html Xpath:从href标记中提取链接_Html_Xpath_Scrapy - Fatal编程技术网

Html Xpath:从href标记中提取链接

Html Xpath:从href标记中提取链接,html,xpath,scrapy,Html,Xpath,Scrapy,我正在处理以下网页,希望在每个酒店页面上收集数据: 指向酒店页面的链接位于href标签中 <h3 class="sr-hotel__title-wrap"> <a class="hotel_name_link url" href=" /hotel/ch/hirschen-za1-4rich.de.html?label=gen173nr-1DCAQoggJCC2NvdW50cnlfMjA0SAdYBGgsiAEBmAEHuAEHyAEN2AED6AEB-AECiAIBqAI

我正在处理以下网页,希望在每个酒店页面上收集数据:

指向酒店页面的链接位于href标签中

<h3 class="sr-hotel__title-wrap">
  <a class="hotel_name_link url" href=" /hotel/ch/hirschen-za1-4rich.de.html?label=gen173nr-1DCAQoggJCC2NvdW50cnlfMjA0SAdYBGgsiAEBmAEHuAEHyAEN2AED6AEB-AECiAIBqAIDuAKy29byBcACAQ&dest_id=204&dest_type=country&group_adults=2&group_children=0&hapos=1&hpos=1&no_rooms=1&sr_order=popularity&srepoch=1582673331&srpvid=b5d3a51914210067&ucfs=1&from=searchresults ;highlight_room=#hotelTmpl" target="_blank" rel="noopener">
    <span class="sr-hotel__name " data-et-click=" "> Hotel Hirschen </span>
    <span class="invisible_spoken"> Wird in neuem Fenster geöffnet </span>
  </a>
</h3>
或者我应该在xpath中使用上面的级别(div)


提前感谢您的建议

第二个xpath对我有效,但只有在我设置正确的
用户代理时才有效

 Mozilla/5.0 (X11; Linux x86_64; rv:75.0) Gecko/20100101 Firefox/75.0
没有
用户代理
或使用较短版本的
Mozilla/5.0
它会重定向到

 https://www.booking.com/searchresults.de.html 
(没有参数
?dest_id=204;dest_type=country&
),并且它得到没有hotels的空页面

也许你们应该首先检查你们从url中得到的信息——比如,将HTML保存在文件中并在浏览器中打开——也许你们也会得到空页或者一些机器人的警告


最小工作代码

您可以将其放在一个文件中,并作为普通脚本运行,而无需创建项目

import scrapy

class MySpider(scrapy.Spider):

    name = 'myspider'

    start_urls = ['https://www.booking.com/searchresults.de.html?dest_id=204;dest_type=country&']

    def parse(self, response):
        print('url:', response.url)

        #items = response.xpath('.//*[@class="sr-hotel__title "]/a/@href').extract()
        items = response.xpath('//a[@class="hotel_name_link url"]/@href').extract()
        for item in items:
            yield {'url': item.strip()}  # to save in CSV


# --- run without project and save in `output.csv` ---

from scrapy.crawler import CrawlerProcess

c = CrawlerProcess({
    'USER_AGENT': 'Mozilla/5.0 (X11; Linux x86_64; rv:75.0) Gecko/20100101 Firefox/75.0',
    # save in file CSV, JSON or XML
    'FEED_FORMAT': 'csv',     # csv, json, xml
    'FEED_URI': 'output.csv', #
})
c.crawl(MySpider)
c.start()

@zx485好的…,谢谢!但这是html代码中的链接。我如何提取它,或者在酒店页面上还有其他方法吗?对不起,我再次编辑了链接。现在它是正确的。您的第二个表达式在提供的html上对我有效。第一个表达式有一个输入错误(应该是
/*[@class=“sr-hotel\uu title-wrap”]/a/@href
),但一旦它被修复,它也会起作用。第一个表达式也可以用
//h3[@class[normalize space(.)='sr-hotel\uu title']/@href
来修复。它对我不起作用。。。xpath('//h3[@class[规范化空间(.)=“sr-hotel\uuu title”]]//@href')是否正确?非常感谢!!!设置正确的用户代理对我有用!现在我得到了正确的URL!
import scrapy

class MySpider(scrapy.Spider):

    name = 'myspider'

    start_urls = ['https://www.booking.com/searchresults.de.html?dest_id=204;dest_type=country&']

    def parse(self, response):
        print('url:', response.url)

        #items = response.xpath('.//*[@class="sr-hotel__title "]/a/@href').extract()
        items = response.xpath('//a[@class="hotel_name_link url"]/@href').extract()
        for item in items:
            yield {'url': item.strip()}  # to save in CSV


# --- run without project and save in `output.csv` ---

from scrapy.crawler import CrawlerProcess

c = CrawlerProcess({
    'USER_AGENT': 'Mozilla/5.0 (X11; Linux x86_64; rv:75.0) Gecko/20100101 Firefox/75.0',
    # save in file CSV, JSON or XML
    'FEED_FORMAT': 'csv',     # csv, json, xml
    'FEED_URI': 'output.csv', #
})
c.crawl(MySpider)
c.start()