使用xpath使用Scrapy从多个表中提取数据_Xpath_Scrapy

使用xpath使用Scrapy从多个表中提取数据

xpath scrapy

使用xpath使用Scrapy从多个表中提取数据,xpath,scrapy,Xpath,Scrapy,我正在从网页上的12个表中提取元数据和URL，虽然我已经开始工作了，但我对xpath和scrapy都很陌生，所以有没有更简洁的方法可以做到这一点当我尝试各种XPath并意识到每个表的每一行都在重复时，我最初得到了大量的副本。我的解决方案是枚举表并循环遍历每个表，只获取该表的行。感觉可能有更简单的方法，但我现在不确定 import scrapy class LinkCheckerSpider(scrapy.Spider): name = 'foodstandardsagency'

我正在从网页上的12个表中提取元数据和URL，虽然我已经开始工作了，但我对xpath和scrapy都很陌生，所以有没有更简洁的方法可以做到这一点

当我尝试各种XPath并意识到每个表的每一行都在重复时，我最初得到了大量的副本。我的解决方案是枚举表并循环遍历每个表，只获取该表的行。感觉可能有更简单的方法，但我现在不确定

import scrapy

class LinkCheckerSpider(scrapy.Spider):
    name = 'foodstandardsagency'
    allowed_domains = ['ratings.food.gov.uk']
    start_urls = ['https://ratings.food.gov.uk/open-data/en-gb/']

    def parse(self, response):

        print(response.url)
        tables = response.xpath('//*[@id="openDataStatic"]//table')

        num_tables = len(tables)

        for tabno in range(num_tables):

            search_path = '// *[ @ id = "openDataStatic"] / table[%d] /  tr'%tabno

            rows = response.xpath(search_path)


            for row in rows:
                local_authority = row.xpath('td[1]//text()').extract()
                last_update = row.xpath('td[2]//text()').extract()
                num_businesses = row.xpath('td[3]//text()').extract()
                xml_file_descr = row.xpath('td[4]//text()').extract()
                xml_file = row.xpath('td[4]/a/@href').extract()

                yield {'local_authority': local_authority[1],
                      'last_update':last_update[1],
                      'num_businesses':num_businesses[1],
                      'xml_file':xml_file[0],
                      'xml_file_descr':xml_file_descr[1]
                        }

'''

我正在用它运行

scrapy runspider fsa_xpath.py

您可以遍历第一个xpath返回的表选择器：

tables = response.xpath('//*[@id="openDataStatic"]//table')
for table in tables:
    for row in table.xpath('./tr'):
        local_authority = row.xpath('td[1]//text()').extract()

您对行执行了此操作。

您可以遍历第一个xpath返回的表选择器：

tables = response.xpath('//*[@id="openDataStatic"]//table')
for table in tables:
    for row in table.xpath('./tr'):
        local_authority = row.xpath('td[1]//text()').extract()

您对行执行了此操作。

太好了，谢谢。我想一定是我错过了一些微妙的东西。我以前也试过，但我错过了。在/tr之前。由于某种原因，它返回了所有表的所有行，但带有。它似乎工作正常。很好，谢谢。我想一定是我错过了一些微妙的东西。我以前也试过，但我错过了。在/tr之前。由于某种原因，它返回了所有表的所有行，但带有。它似乎工作正常。