Python 将Scrapy Html表复制到具有重复标题的DataFrame
我用一张html表格的刮痕撞到了砖墙上。基本上,我有一段代码,首先将列名指定为对象,使用它们作为键,然后将它们与相应的xpath条目一起附加到单独的对象。然后将它们放入Python 将Scrapy Html表复制到具有重复标题的DataFrame,python,pandas,dataframe,csv,scrapy,Python,Pandas,Dataframe,Csv,Scrapy,我用一张html表格的刮痕撞到了砖墙上。基本上,我有一段代码,首先将列名指定为对象,使用它们作为键,然后将它们与相应的xpath条目一起附加到单独的对象。然后将它们放入pandas数据框中,最终转换为csv供最终使用 import scrapy from scrapy.selector import Selector import re import pandas as pd class PostSpider(scrapy.Spider): name = "standard
pandas
数据框中,最终转换为csv供最终使用
import scrapy
from scrapy.selector import Selector
import re
import pandas as pd
class PostSpider(scrapy.Spider):
name = "standard_squads"
start_urls = [
"https://fbref.com/en/comps/11/stats/Serie-A-Stats",
]
def parse(self, response):
column_index = 1
columns = {}
for column_node in response.xpath('//*[@id="stats_standard_squads"]/thead/tr[2]/th'):
column_name = column_node.xpath("./text()").extract_first()
print("column name is: " + column_name)
columns[column_name] = column_index
column_index += 1
matches = []
for row in response.xpath('//*[@id="stats_standard_squads"]/tbody/tr'):
match = {}
for column_name in columns.keys():
if column_name=='Squad':
match[column_name]=row.xpath('th/a/text()').extract_first()
else:
match[column_name] = row.xpath(
"./td[{index}]//text()".format(index=columns[column_name]-1)
).extract_first()
matches.append(match)
print(matches)
df = pd.DataFrame(matches,columns=columns.keys())
yield df.to_csv("test_squads.csv",sep=",", index=False)
然而,我刚刚意识到xpath响应中的列标题名(/*[@id=“stats\u standard\u squads”]/thead/tr[2]/th
)实际上包含重复项(例如在xG
页面上,与xA
一样,在表中出现两次)。正因为如此,当我在columns.keys()
中循环时,它会扔掉重复的内容,因此我在最终的csv中只得到20列,而不是25列
我不知道现在该怎么办——我尝试过将列名添加到列表中,将它们作为数据帧头添加,然后每次都追加到新行,但这似乎是一个很简单的例子。我希望有一个更简单的解决方案来实现这种自动刮取,允许数据帧列中出现重复的名称?将列列表读取到数组中并添加后缀如何:
def parse(self, response):
columns = []
for column_node in response.xpath('//*[@id="stats_standard_squads"]/thead/tr[2]/th'):
column_name = column_node.xpath("./text()").extract_first()
columns.append(column_name)
matches = []
for row in response.xpath('//*[@id="stats_standard_squads"]/tbody/tr'):
match = {}
suffixes = {}
for column_index, column_name in enumerate(columns):
# Get correct Index for the currect column
if column_name not in suffixes:
suffixes[column_name] = 1
df_name = column_name # no suffix for the first catch
else:
suffixes[column_name] += 1
df_name = f'{column_name}_{suffixes[column_name]}'
if column_name=='Squad':
match[df_name]=row.xpath('th/a/text()').extract_first()
else:
match[df_name] = row.xpath(
"./td[{index}]//text()".format(index=column_index)
).extract_first()
matches.append(match)
print(matches)
df = pd.DataFrame(matches,columns=columns.keys())
yield df.to_csv("test_squads.csv",sep=",", index=False)