Python 根据列名刮除表行_Python_Web Scraping_Scrapy

Python 根据列名刮除表行

python web-scraping scrapy

Python 根据列名刮除表行,python,web-scraping,scrapy,Python,Web Scraping,Scrapy,我想提取该表，但问题是，所有表中的每列在每个表中都有不同的位置。可以根据列名和该列的所有行进行刮取以下是一个例子：正如您所看到的，所有列在表中的位置都不同这是我的密码： # -*- coding: utf-8 -*- import scrapy from scrapy.linkextractors import LinkExtractor from scrapy.spiders import CrawlSpider, Rule class LiSpider(CrawlSpider):

我想提取该表，但问题是，所有表中的每列在每个表中都有不同的位置。可以根据列名和该列的所有行进行刮取

以下是一个例子：

正如您所看到的，所有列在表中的位置都不同

这是我的密码：

# -*- coding: utf-8 -*-
import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule


class LiSpider(CrawlSpider):
    name = 'li'
    allowed_domains = ['en.wikipedia.org']
    start_urls = ['https://en.wikipedia.org/wiki/List_of_defunct_airlines_of_the_Americas',
    'https://en.wikipedia.org/wiki/List_of_defunct_airlines_of_Asia',
    'https://en.wikipedia.org/wiki/List_of_defunct_airlines_of_Europe',
    'https://en.wikipedia.org/wiki/List_of_defunct_airlines_of_Oceania']

    rules = (
      
        Rule(LinkExtractor(restrict_xpaths='//div[text() = "Main article: "]/a'), callback='parse_item', follow=True),
    )

    def parse_item(self, response):
           for data in response.xpath('//table[@class="wikitable sortable"]/tbody/tr'):
            yield{
                'Airline': data.xpath('./td[1]/a/text()').get(),
                'IATA': data.xpath('./td[2]/text()').get(),
                'ICAO': data.xpath('./td[3]/text()').get(),
                'Image': data.xpath('./td[position() = count(//th[contains(.,"Image")]/following-sibling::th)+2]]/a/@href').get(),
                'Callsign': data.xpath('./td[5]/text()').get(),
                'Commensed Operations': data.xpath('./td[6]/text()').get(),
                'Ceased Operations': data.xpath('./td[7]/text()').get(),
                'Notes': data.xpath('./td[8]/text()').get(),
            }

你可以用熊猫。试试这个：

import pandas as pd 
df = pd.read_html("https://en.wikipedia.org/wiki/List_of_defunct_airlines_of_Africa")

您将获得数据帧列表

，但我希望在pandas中获得表时按顺序排列所有列。数据帧类型您可以做您想做的事情。我只能在pandas框架中获得表。上面的代码将从该网页获取所有数据，但我只希望tableTry以这种方式df=pd.read\u htmlink，match=Airline[0]获取其标题中包含Airline的表。