Python 如何在使用scrapy时从多个标记中排除特定的html标记（没有任何id）？_Python_Html_Web Scraping_Scrapy_Scrapy Spider

Python 如何在使用scrapy时从多个标记中排除特定的html标记（没有任何id）？

python html web-scraping scrapy

Python 如何在使用scrapy时从多个标记中排除特定的html标记（没有任何id）？,python,html,web-scraping,scrapy,scrapy-spider,Python,Html,Web Scraping,Scrapy,Scrapy Spider,我将使用以“计时”开头的div元素的文本：请注意，页面的HTML结构不便于区分彼此之间的位置-没有可以迭代的特定于位置的容器。在这种情况下，我会找到每个h2或strong标记，并使用以下同级，例如：还要注意，如果要提取时间范围值，可以使用：此外，确保循环体中有yield（请参阅您发布的代码）如果要排除计时并获取位置描述的其余部分，请使用： In [18]: for sel in response.xpath('//div[contains(@class, "region")]/*[se

我将使用以“计时”开头的

div

元素的文本：

请注意，页面的HTML结构不便于区分彼此之间的位置-没有可以迭代的特定于位置的容器。在这种情况下，我会找到每个

h2

或

strong

标记，并使用

以下同级

，例如：

还要注意，如果要提取时间范围值，可以使用：

此外，确保循环体中有

yield

（请参阅您发布的代码）

如果要排除

计时

并获取位置描述的其余部分，请使用：

In [18]: for sel in response.xpath('//div[contains(@class, "region")]/*[self::h2 or self::strong]'):
        name = sel.xpath('text()').extract()[0]
        timings = sel.xpath('./following-sibling::div[starts-with(., "Timings")]/text()')[0].re(r'(\d+:\d+)\s*\-\s*(\d+:\d+)')[:2]
        print name, timings
Mumbai [u'08:00', u'00:30']
Fort [u'08:00', u'00:30']
Colaba [u'07:00', u'01:00']
Goregaon [u'10:00', u'23:30']
...
Hi-Tech City [u'09:00', u'22:30']
Madhapur [u'11:00', u'23:00']
Banjara Hills [u'10:00', u'22:30']

xpath（'//div[contains（@class，“region”）]/*[self:：h2或self:：strong]'）：打印“.join（sel.xpath中项的item.strip（）（'following-sibling:：div[position（）<4 and not（以（，“计时”））]/text（））.extract（））

是否可以使用re（）或任何其他方式仅从前三个标记中提取文本，并跳过包含“计时…”的第四个标记？“我只想知道地址，而不是时间安排。”阿迪蒂亚：当然，加了一个例子，看看吧。不客气。

import scrapy
from job.items import StarbucksItem

class StarbucksSpider(scrapy.Spider):
    name = "starbucks"
    allowed_domains = ["starbucks.in"]
    start_urls = ["http://www.starbucks.in/coffeehouse/store-locations/"]

    def parse(self, response):
        for sel in response.xpath('//div[@class="region size2of3"]'):
            item = StarbucksItem()
            item['title'] = sel.xpath('div/text()').extract()
        yield item

sel.xpath('.//div[starts-with(., "Timings")]/text()').extract()

In [10]: for sel in response.xpath('//div[contains(@class, "region")]/*[self::h2 or self::strong]'):
            name = sel.xpath('text()').extract()[0]
            timings = sel.xpath('./following-sibling::div[starts-with(., "Timings")]/text()').extract()[0]
            print name, timings
   ....:     
Mumbai Timings: 08:00-00:30 hrs (Mon-Sun)
Fort Timings: 08:00-00:30 hrs (Mon-Sun)
Colaba Timings: 07:00-01:00 hrs (Mon-Sun)
Goregaon Timings: 10:00-23:30 hrs (Mon-Sun)
Powai Timings: 07:00-00:00 hrs (Mon-Sun)
...
Hi-Tech City Timings: 09:00 - 22:30 hrs (Mon - Sun)
Madhapur Timings: 11:00 -23:00 hrs (Mon - Sun)
Banjara Hills Timings: 10:00 -22:30 hrs (Mon - Sun)

In [18]: for sel in response.xpath('//div[contains(@class, "region")]/*[self::h2 or self::strong]'):
        name = sel.xpath('text()').extract()[0]
        timings = sel.xpath('./following-sibling::div[starts-with(., "Timings")]/text()')[0].re(r'(\d+:\d+)\s*\-\s*(\d+:\d+)')[:2]
        print name, timings
Mumbai [u'08:00', u'00:30']
Fort [u'08:00', u'00:30']
Colaba [u'07:00', u'01:00']
Goregaon [u'10:00', u'23:30']
...
Hi-Tech City [u'09:00', u'22:30']
Madhapur [u'11:00', u'23:00']
Banjara Hills [u'10:00', u'22:30']

for sel in response.xpath('//div[contains(@class, "region")]/*[self::h2 or self::strong]'):
    print " ".join(item.strip() for item in sel.xpath('following-sibling::div[position() < 4 and not(starts-with(., "Timings"))]/text()').extract())