Python 提供指向scrapy xpath的备用路径_Python_Xpath_Scrapy

Python 提供指向scrapy xpath的备用路径

python xpath scrapy

Python 提供指向scrapy xpath的备用路径,python,xpath,scrapy,Python,Xpath,Scrapy,我刚开始学Scrapy，我想把主队、客队和得分都拉下来，以此作为一种学习的方式一切正常，除了我用来获取团队的xpath依赖于“a”标记之外： match.xpath('.//*[@class="team-home teams"]/a/text()').extract_first() 一些团队没有链接，因此查询偶尔会返回无链接下面的xpath删除了/a/并选择了未链接的团队，但也选择了许多换行符字符串： match.xpath('.//*[@class="team-home teams"]/

我刚开始学Scrapy，我想把主队、客队和得分都拉下来，以此作为一种学习的方式

一切正常，除了我用来获取团队的xpath依赖于“a”标记之外：

match.xpath('.//*[@class="team-home teams"]/a/text()').extract_first()

一些团队没有链接，因此查询偶尔会返回无链接

下面的xpath删除了/a/并选择了未链接的团队，但也选择了许多换行符字符串：

match.xpath('.//*[@class="team-home teams"]/text()').extract_first()

如果没有返回xpath，如何修改代码以提供替代xpath？或者，是否有一个更智能的xpath，无论是否存在“a”标记，它都会返回正确的结果

import scrapy


class FootballresultsSpider(scrapy.Spider):
    name = "footballResults"
    start_urls = ['http://www.bbc.com/sport/football/results/']

    def parse(self, response):

        for match in response.xpath('//td[@class="match-details"]'):
            yield {
                'home_team': match.xpath('.//*[@class="team-home teams"]/a/text()').extract_first(),
                'score': match.xpath('.//span[@class="score"]/abbr/text()').extract_first(),
                'away_team': match.xpath('.//*[@class="team-away teams"]/a/text()').extract_first(),
            }

*编辑*

下面是试图在XPath之间使用“|”的代码，但对于没有锚定标记的任何条目仍然返回None。为了简洁起见，我只演示了一个条目，

home

import scrapy

class ResultsSpider(scrapy.Spider):
    name = "results"
    #allowed_domains = ["www.bbc.com"]
    start_urls = ['http://www.bbc.com/sport/football/results/']

    def parse(self, response):

        match_details = response.xpath('//td[@class="match-details"]')

        for match in match_details:

            a_xpath = './/span[@class="team-home teams"]/a/text()'
            text_xpath = './/span[@class="team-home teams"]/a/text()'


            home = match.xpath(a_xpath + ' | ' + text_xpath).extract_first()

            yield {
                'Home': home
            }

下面是有效的代码，尽管它有点冗长，我相信有一种更简洁的方法可以做到这一点

import scrapy


class ResultsSpider(scrapy.Spider):
    name = "results"
    #allowed_domains = ["www.bbc.com"]
    start_urls = ['http://www.bbc.com/sport/football/results/']

    def parse(self, response):

        match_details = response.xpath('//td[@class="match-details"]')

        for match in match_details:

            if match.xpath('.//span[@class="team-home teams"]/a/text()').extract_first() == None:
                home = match.xpath('.//span[@class="team-home teams"]/text()').extract_first().strip()
            else:
                home = match.xpath('.//span[@class="team-home teams"]/a/text()').extract_first()

            yield {
                'Home': home,
            }

从我的电话接听，所以还没试过

选项1：正则表达式

选项2：使用项目加载器

l = ItemLoader(TeamItem(), response = response)

l.add_xpath('name', '//your_first_xpath')
l.add_xpath('name', '//your_second_xpath')

然后在item类中，您可以删除非限定名称

您可以在xpath中使用

运算符：

first_xpath = './/*[@class="team-home teams"]/a/text()'
second_xpath = ... # The alternative xpath
match.xpath(first_xpath + ' | ' + second_xpath).extract_first()

谢谢@VMRuiz。但是我不能让它工作。是否认为如果第一个xpath解析为无，则使用第二个xpath？我将用我尝试过的代码编辑我的文章，以及我用来让刮片工作的方法，尽管这有点冗长。@itzafugazi您为

a_xpath

和

text_xpath

设置了相同的值，这就是为什么它总是不向您返回任何内容。啊，您是对的，错误的复制/粘贴，谢谢。无论如何，我将坚持使用if/else方法，因为在这种情况下，我还需要在非链接结果的末尾使用

strip（）

。