Python 在Scrapy中使用For循环将Xpath值追加到列表_Python_Pandas_Numpy_Scrapy

Python 在Scrapy中使用For循环将Xpath值追加到列表

python pandas numpy scrapy

Python 在Scrapy中使用For循环将Xpath值追加到列表,python,pandas,numpy,scrapy,Python,Pandas,Numpy,Scrapy,我正在寻找尝试和自动化我的html表刮在刮。这就是我到目前为止所做的： import scrapy import pandas as pd class XGSpider(scrapy.Spider): name = 'expectedGoals' start_urls = [ 'https://fbref.com/en/comps/9/schedule/Premier-League-Scores-and-Fixtures', ] def p

我正在寻找尝试和自动化我的html表刮在刮。这就是我到目前为止所做的：

import scrapy
import pandas as pd

class XGSpider(scrapy.Spider):

    name = 'expectedGoals'

    start_urls = [
        'https://fbref.com/en/comps/9/schedule/Premier-League-Scores-and-Fixtures',
    ]

    def parse(self, response):

        matches = []

        for row in response.xpath('//*[@id="sched_ks_3232_1"]//tbody/tr'):

            match = {
                'home': row.xpath('td[4]//text()').extract_first(),
                'homeXg': row.xpath('td[5]//text()').extract_first(),
                'score': row.xpath('td[6]//text()').extract_first(),
                'awayXg': row.xpath('td[7]//text()').extract_first(),
                'away': row.xpath('td[8]//text()').extract_first()
            }

            matches.append(match)

        x = pd.DataFrame(
            matches, columns=['home', 'homeXg', 'score', 'awayXg', 'away'])

        yield x.to_csv("xG.csv", sep=",", index=False)

它工作正常，但是正如您所看到的，我正在为

匹配对象硬编码键（home
，homeXg
，等等）。我想自动将键刮到列表中，然后用所述列表中的键初始化dict。问题是，我不知道如何通过索引循环xpath。例如,
 headers = [] 
        for row in response.xpath('//*[@id="sched_ks_3260_1"]/thead/tr'): 
            yield{
                'first': row.xpath('th[1]/text()').extract_first(),
                'second': row.xpath('th[2]/text()').extract_first()
            }

是否可以将th[1]
、th[2]
、th[3]
等粘贴到for循环中，以数字作为索引，并将值附加到列表中？e、 g
row.xpath（'th[i]/text（））.extract_first（）
？
未测试，但应能工作：
column_index = 1
columns = {}
for column_node in response.xpath('//*[@id="sched_ks_3260_1"]/thead/tr/th'):
    column_name = column_node.xpath('./text()').extract_first()
    columns[column_name] = column_index
    column_index += 1
    matches = []

for row in response.xpath('//*[@id="sched_ks_3232_1"]//tbody/tr'):
    match = {}        
    for column_name in columns.keys():
        match[column_name] = row.xpath('./td[{index}]//text()'.format(index=columns[column_name])).extract_first()
    matches.append(match)

我不确定我是否理解这个问题。f字符串不能解决您的问题吗？比如：row.xpath（f'th[{index\u var}]/text（））
？对不起，我对Python很陌生，可能问题不清楚。。。标题键目前是硬编码的，我想自动对其进行刮取，但要做到这一点，我必须弄清楚如何计算表中的列数，然后循环遍历每个xpath—不知道如何做到这一点。