Python 从文件夹中读取HTML文件时出现的问题_Python_Html

Python 从文件夹中读取HTML文件时出现的问题

python html

Python 从文件夹中读取HTML文件时出现的问题,python,html,Python,Html,我有两个HTML文件，我想从中读取，就好像它们是网站一样，但我在开始的date\u部分行中遇到错误，这使我认为我没有正确读取文件。我用于保存到文件的代码： game_links = [ 'https://rugby.statbunker.com/competitions/MatchDetails/Gallagher-Premiership-19/20/Harlequins-VS-Bristol-Bears?comp_id=609&match_id=39862&date=2

我有两个HTML文件，我想从中读取，就好像它们是网站一样，但我在开始的

date\u部分

行中遇到错误，这使我认为我没有正确读取文件。我用于保存到文件的代码：

game_links = [
    'https://rugby.statbunker.com/competitions/MatchDetails/Gallagher-Premiership-19/20/Harlequins-VS-Bristol-Bears?comp_id=609&match_id=39862&date=26-Oct-2019',
    'https://rugby.statbunker.com/competitions/MatchDetails/World-Cup-2007/France-VS-Argentina?comp_id=239&match_id=15479&date=07-Sep-2007'
]
for link in game_links:
    response = requests.get(link)
    html_loop = response.content
    soup_loop = BeautifulSoup(html_loop, 'html.parser')
    print(soup_loop)

每个输出都保存为自己的html文件。我运行的代码用于从中提取数据：

import requests
from bs4 import BeautifulSoup
import pandas as pd
from datetime import datetime
import time
import uuid

game_links = [open('test1.html', 'r', encoding='utf-8'), open('test2.html', 'r', encoding='utf-8')]

game = {}

for link in game_links:
    soup_loop = link.read()

    game['uuid'] = uuid.uuid1()
    date_part = soup_loop.find('img', {'src': '/images/date.png'}).text
    time_part = soup_loop.find('img', {'src': '/images/kickoff.png'}).text
    if time_part == '':
        game['datetime'] = datetime.strptime(date_part, '%d %b %Y')
    else:
        game['datetime'] = datetime.combine(datetime.strptime(date_part, '%d %b %Y'), datetime.strptime(time_part, '%H:%M').time())
    print(game)

读取文件后，应使用BeautifulSoup再次解析该文件：

for link in game_links:
    text = link.read()
    soup_loop = BeautifulSoup(text, 'html.parser')
    game['uuid'] = uuid.uuid1()
    date_part = soup_loop.find('img', {'src': '/images/date.png'}).text
    time_part = soup_loop.find('img', {'src': '/images/kickoff.png'}).text
    if time_part == '':
        game['datetime'] = datetime.strptime(date_part, '%d %b %Y')
    else:
        game['datetime'] = datetime.combine(datetime.strptime(date_part, '%d %b %Y'), datetime.strptime(time_part, '%H:%M').time())
    print(game)

读取文件后，应使用BeautifulSoup再次解析该文件：

for link in game_links:
    text = link.read()
    soup_loop = BeautifulSoup(text, 'html.parser')
    game['uuid'] = uuid.uuid1()
    date_part = soup_loop.find('img', {'src': '/images/date.png'}).text
    time_part = soup_loop.find('img', {'src': '/images/kickoff.png'}).text
    if time_part == '':
        game['datetime'] = datetime.strptime(date_part, '%d %b %Y')
    else:
        game['datetime'] = datetime.combine(datetime.strptime(date_part, '%d %b %Y'), datetime.strptime(time_part, '%H:%M').time())
    print(game)

首先需要创建soup对象。大多数情况下，您可以这样做：

soup=BeautifulSoup（soup\u loop）

然后是代码的其余部分。首先，在使用打开的文件之前，将其挂起是非常糟糕的做法。除非有特殊原因，否则您应该让游戏链接包含文件名，并在循环、进程和关闭中逐个打开它们。关于问题本身，请添加您正在获取的错误消息。您需要首先创建soup对象。大多数情况下，您可以这样做：

soup=BeautifulSoup（soup\u loop）

然后是代码的其余部分。首先，在使用打开的文件之前，将其挂起是非常糟糕的做法。除非有特殊原因，否则您应该让游戏链接包含文件名，并在循环、进程和关闭中逐个打开它们。关于问题本身，请添加您收到的错误消息