Python 从文件夹中读取HTML文件时出现的问题
我有两个HTML文件,我想从中读取,就好像它们是网站一样,但我在开始的Python 从文件夹中读取HTML文件时出现的问题,python,html,Python,Html,我有两个HTML文件,我想从中读取,就好像它们是网站一样,但我在开始的date\u部分行中遇到错误,这使我认为我没有正确读取文件。我用于保存到文件的代码: game_links = [ 'https://rugby.statbunker.com/competitions/MatchDetails/Gallagher-Premiership-19/20/Harlequins-VS-Bristol-Bears?comp_id=609&match_id=39862&date=2
date\u部分
行中遇到错误,这使我认为我没有正确读取文件。我用于保存到文件的代码:
game_links = [
'https://rugby.statbunker.com/competitions/MatchDetails/Gallagher-Premiership-19/20/Harlequins-VS-Bristol-Bears?comp_id=609&match_id=39862&date=26-Oct-2019',
'https://rugby.statbunker.com/competitions/MatchDetails/World-Cup-2007/France-VS-Argentina?comp_id=239&match_id=15479&date=07-Sep-2007'
]
for link in game_links:
response = requests.get(link)
html_loop = response.content
soup_loop = BeautifulSoup(html_loop, 'html.parser')
print(soup_loop)
每个输出都保存为自己的html文件。我运行的代码用于从中提取数据:
import requests
from bs4 import BeautifulSoup
import pandas as pd
from datetime import datetime
import time
import uuid
game_links = [open('test1.html', 'r', encoding='utf-8'), open('test2.html', 'r', encoding='utf-8')]
game = {}
for link in game_links:
soup_loop = link.read()
game['uuid'] = uuid.uuid1()
date_part = soup_loop.find('img', {'src': '/images/date.png'}).text
time_part = soup_loop.find('img', {'src': '/images/kickoff.png'}).text
if time_part == '':
game['datetime'] = datetime.strptime(date_part, '%d %b %Y')
else:
game['datetime'] = datetime.combine(datetime.strptime(date_part, '%d %b %Y'), datetime.strptime(time_part, '%H:%M').time())
print(game)
读取文件后,应使用BeautifulSoup再次解析该文件:
for link in game_links:
text = link.read()
soup_loop = BeautifulSoup(text, 'html.parser')
game['uuid'] = uuid.uuid1()
date_part = soup_loop.find('img', {'src': '/images/date.png'}).text
time_part = soup_loop.find('img', {'src': '/images/kickoff.png'}).text
if time_part == '':
game['datetime'] = datetime.strptime(date_part, '%d %b %Y')
else:
game['datetime'] = datetime.combine(datetime.strptime(date_part, '%d %b %Y'), datetime.strptime(time_part, '%H:%M').time())
print(game)
读取文件后,应使用BeautifulSoup再次解析该文件:
for link in game_links:
text = link.read()
soup_loop = BeautifulSoup(text, 'html.parser')
game['uuid'] = uuid.uuid1()
date_part = soup_loop.find('img', {'src': '/images/date.png'}).text
time_part = soup_loop.find('img', {'src': '/images/kickoff.png'}).text
if time_part == '':
game['datetime'] = datetime.strptime(date_part, '%d %b %Y')
else:
game['datetime'] = datetime.combine(datetime.strptime(date_part, '%d %b %Y'), datetime.strptime(time_part, '%H:%M').time())
print(game)
首先需要创建soup对象。大多数情况下,您可以这样做:
soup=BeautifulSoup(soup\u loop)
然后是代码的其余部分。首先,在使用打开的文件之前,将其挂起是非常糟糕的做法。除非有特殊原因,否则您应该让游戏链接包含文件名,并在循环、进程和关闭中逐个打开它们。关于问题本身,请添加您正在获取的错误消息。您需要首先创建soup对象。大多数情况下,您可以这样做:soup=BeautifulSoup(soup\u loop)
然后是代码的其余部分。首先,在使用打开的文件之前,将其挂起是非常糟糕的做法。除非有特殊原因,否则您应该让游戏链接包含文件名,并在循环、进程和关闭中逐个打开它们。关于问题本身,请添加您收到的错误消息