Python 将Scrapy Html表复制到具有重复标题的DataFrame_Python_Pandas_Dataframe_Csv_Scrapy

Python 将Scrapy Html表复制到具有重复标题的DataFrame

python pandas dataframe csv scrapy

Python 将Scrapy Html表复制到具有重复标题的DataFrame,python,pandas,dataframe,csv,scrapy,Python,Pandas,Dataframe,Csv,Scrapy,我用一张html表格的刮痕撞到了砖墙上。基本上，我有一段代码，首先将列名指定为对象，使用它们作为键，然后将它们与相应的xpath条目一起附加到单独的对象。然后将它们放入pandas数据框中，最终转换为csv供最终使用 import scrapy from scrapy.selector import Selector import re import pandas as pd class PostSpider(scrapy.Spider): name = "standard

我用一张html表格的刮痕撞到了砖墙上。基本上，我有一段代码，首先将列名指定为对象，使用它们作为键，然后将它们与相应的xpath条目一起附加到单独的对象。然后将它们放入

pandas

数据框中，最终转换为csv供最终使用

import scrapy
from scrapy.selector import Selector
import re
import pandas as pd

class PostSpider(scrapy.Spider):

    name = "standard_squads"

    start_urls = [
        "https://fbref.com/en/comps/11/stats/Serie-A-Stats",
    ]

    def parse(self, response):

        column_index = 1
        columns = {}
        for column_node in response.xpath('//*[@id="stats_standard_squads"]/thead/tr[2]/th'):
            column_name = column_node.xpath("./text()").extract_first()
            print("column name is: " + column_name)
            columns[column_name] = column_index
            column_index += 1
            
            matches = []

        for row in response.xpath('//*[@id="stats_standard_squads"]/tbody/tr'):
            match = {}
            for column_name in columns.keys():

                if column_name=='Squad':
                    match[column_name]=row.xpath('th/a/text()').extract_first()
                else:
                    match[column_name] = row.xpath(
                        "./td[{index}]//text()".format(index=columns[column_name]-1)
                    ).extract_first()

            matches.append(match)
        
        print(matches)

        df = pd.DataFrame(matches,columns=columns.keys())

        yield df.to_csv("test_squads.csv",sep=",", index=False)

然而，我刚刚意识到xpath响应中的列标题名（

/*[@id=“stats\u standard\u squads”]/thead/tr[2]/th

）实际上包含重复项（例如在

xG

页面上，与

xA

一样，在表中出现两次）。正因为如此，当我在

columns.keys（）

中循环时，它会扔掉重复的内容，因此我在最终的csv中只得到20列，而不是25列

我不知道现在该怎么办——我尝试过将列名添加到列表中，将它们作为数据帧头添加，然后每次都追加到新行，但这似乎是一个很简单的例子。我希望有一个更简单的解决方案来实现这种自动刮取，允许数据帧列中出现重复的名称？

将列列表读取到数组中并添加后缀如何：

def parse(self, response):
    columns = []
    for column_node in response.xpath('//*[@id="stats_standard_squads"]/thead/tr[2]/th'):
        column_name = column_node.xpath("./text()").extract_first()
        columns.append(column_name)            

    matches = []
    for row in response.xpath('//*[@id="stats_standard_squads"]/tbody/tr'):
        match = {}
        suffixes = {}
        for column_index, column_name in enumerate(columns):
            # Get correct Index for the currect column
            if column_name not in suffixes:
                suffixes[column_name] = 1
                df_name = column_name # no suffix for the first catch
            else:
                suffixes[column_name] += 1
                df_name = f'{column_name}_{suffixes[column_name]}'

            if column_name=='Squad':
                match[df_name]=row.xpath('th/a/text()').extract_first()
            else:
                match[df_name] = row.xpath(
                    "./td[{index}]//text()".format(index=column_index)
                ).extract_first()

        matches.append(match)
    
    print(matches)

    df = pd.DataFrame(matches,columns=columns.keys())

    yield df.to_csv("test_squads.csv",sep=",", index=False)