Python 在Scrapy中使用For循环将Xpath值追加到列表
我正在寻找尝试和自动化我的html表刮在刮。这就是我到目前为止所做的:Python 在Scrapy中使用For循环将Xpath值追加到列表,python,pandas,numpy,scrapy,Python,Pandas,Numpy,Scrapy,我正在寻找尝试和自动化我的html表刮在刮。这就是我到目前为止所做的: import scrapy import pandas as pd class XGSpider(scrapy.Spider): name = 'expectedGoals' start_urls = [ 'https://fbref.com/en/comps/9/schedule/Premier-League-Scores-and-Fixtures', ] def p
import scrapy
import pandas as pd
class XGSpider(scrapy.Spider):
name = 'expectedGoals'
start_urls = [
'https://fbref.com/en/comps/9/schedule/Premier-League-Scores-and-Fixtures',
]
def parse(self, response):
matches = []
for row in response.xpath('//*[@id="sched_ks_3232_1"]//tbody/tr'):
match = {
'home': row.xpath('td[4]//text()').extract_first(),
'homeXg': row.xpath('td[5]//text()').extract_first(),
'score': row.xpath('td[6]//text()').extract_first(),
'awayXg': row.xpath('td[7]//text()').extract_first(),
'away': row.xpath('td[8]//text()').extract_first()
}
matches.append(match)
x = pd.DataFrame(
matches, columns=['home', 'homeXg', 'score', 'awayXg', 'away'])
yield x.to_csv("xG.csv", sep=",", index=False)
它工作正常,但是正如您所看到的,我正在为匹配对象硬编码键(home
,homeXg
,等等)。我想自动将键刮到列表中,然后用所述列表中的键初始化dict。问题是,我不知道如何通过索引循环xpath。例如,
headers = []
for row in response.xpath('//*[@id="sched_ks_3260_1"]/thead/tr'):
yield{
'first': row.xpath('th[1]/text()').extract_first(),
'second': row.xpath('th[2]/text()').extract_first()
}
是否可以将th[1]
、th[2]
、th[3]
等粘贴到for循环中,以数字作为索引,并将值附加到列表中?e、 g
row.xpath('th[i]/text()).extract_first()
?未测试,但应能工作:
column_index = 1
columns = {}
for column_node in response.xpath('//*[@id="sched_ks_3260_1"]/thead/tr/th'):
column_name = column_node.xpath('./text()').extract_first()
columns[column_name] = column_index
column_index += 1
matches = []
for row in response.xpath('//*[@id="sched_ks_3232_1"]//tbody/tr'):
match = {}
for column_name in columns.keys():
match[column_name] = row.xpath('./td[{index}]//text()'.format(index=columns[column_name])).extract_first()
matches.append(match)
我不确定我是否理解这个问题。f字符串不能解决您的问题吗?比如:row.xpath(f'th[{index\u var}]/text())
?对不起,我对Python很陌生,可能问题不清楚。。。标题键目前是硬编码的,我想自动对其进行刮取,但要做到这一点,我必须弄清楚如何计算表中的列数,然后循环遍历每个xpath—不知道如何做到这一点。