Python 使用BeautifulSoup时数据不正确_Python_Parsing_Beautifulsoup

Python 使用BeautifulSoup时数据不正确

python parsing

Python 使用BeautifulSoup时数据不正确,python,parsing,beautifulsoup,Python,Parsing,Beautifulsoup,我想解析有关电影会议的信息网站。为此，我使用了parser BeautifulSoup，但它返回的数据不正确。例如，如果我直接在代码中手动检查它，它的时间是27:23:45，19:40。但它返回不正确的列表['21:00'，23:00']和来自div的不正确数据： <div class="showtimes-line has-21 has-23"> <div class="showtimes-line-technology t-cinetech t-2d"> <

我想解析有关电影会议的信息网站。为此，我使用了parser BeautifulSoup，但它返回的数据不正确。例如，如果我直接在代码中手动检查它，它的时间是27:23:45，19:40。但它返回不正确的列表

['21:00'，23:00']

和来自div的不正确数据：

<div class="showtimes-line has-21 has-23">
 <div class="showtimes-line-technology t-cinetech t-2d">
  <div class="showtimes-line-technology-title ">
   Cinetech+, 2D
  </div>
  <div class="showtimes-line-hours-wrapper">
   <a class="time h-21 " data-brand="Планета Кіно" data-category="2d" data-id="00000000000000000000000000000631" data-list="movie" data-name="Дедпул 2 (18+)" data-position="4" data-seat="" href="https://pay.planetakino.ua/hall/imax-kiev/484437" rel="nofollow">
    21:00
   </a>
   <a class="time h-23 " data-brand="Планета Кіно" data-category="2d" data-id="00000000000000000000000000000631" data-list="movie" data-name="Дедпул 2 (18+)" data-position="5" data-seat="" href="https://pay.planetakino.ua/hall/imax-kiev/486327" rel="nofollow">
    23:00
   </a>
  </div>
 </div>
</div>

提出的请求如下：

url='https://planetakino.ua/lvov2/movies/deadpool_2/#cinetech_2d_3d_4dx_week'
response = requests.get(url)
sessions = get_sessions(response, film.period)

我没有注意到您在Github上托管的代码中提供了

film.period

，因此我没有费心调试您的代码，而是决定从头开始实现它

在快速搜索之后，我发现Planeta Kino电影院的网站上有XML文件，其中包含电影的放映时间。你可以找到一些。我不知道为什么，但是没有与您问题中的链接对应的放映时间的

lvov2

电影院条目。但是，我通过简单地更改URL的一部分找到了它：

下面的代码应该完全符合您的要求：

import datetime
from typing import List

import dateparser
import requests
from bs4 import BeautifulSoup, Tag

Date = datetime.datetime
Screening = Tag
Screenings = List[Tag]


def get_movie_id(soup: BeautifulSoup, searched_movie: str) -> int:
    movie = soup.find(
        lambda elem: elem.name == 'movie' and searched_movie in elem.title.string
    )
    movie_id = int(movie['id'])
    return movie_id


def get_movie_screenings(soup: BeautifulSoup, movie_id: int, searched_date: Date) -> Screenings:
    formatted_date = searched_date.strftime('%Y-%m-%d')
    screenings = soup.select(f'showtimes '
                             f'> day[date={formatted_date}] '
                             f'> show[movie-id={movie_id}]')
    return screenings


def get_show_times(searched_movie: str, searched_date: Date) -> Screenings:
    url = 'http://planetakino.ua/lvov2/ua/showtimes/xml/'
    html = requests.get(url).text
    soup = BeautifulSoup(html, 'xml')

    movie_id = get_movie_id(soup, searched_movie)
    screenings = get_movie_screenings(soup, movie_id, searched_date)
    return screenings


date = dateparser.parse(input('Type the date: '))
if date is not None:
    import pprint
    pprint.pprint(get_show_times('Дедпул 2', date))
else:
    print('Sorry, I cannot parse the date you gave me.')

输出：

Type the date: 27 червня, середа
[<show full-date="2018-06-27 19:40:00" hall-id="104" movie-id="2385" order-url="https://pay.planetakino.ua/hall/pk-lvov2/485693" technology="Cinetech+2D" theatre-id="pk-lvov2" time="19:40"/>,
 <show full-date="2018-06-27 23:45:00" hall-id="101" movie-id="2385" order-url="https://pay.planetakino.ua/hall/pk-lvov2/485506" technology="4dx" theatre-id="pk-lvov2" time="23:45"/>]

键入日期：27日
[,
]

我曾经解析输入日期，因此它可以使用不同的格式和语言，例如6月27日、27日、27日等等。它真的很棒，我喜欢它

花点时间阅读和理解代码，你可能想看看和（和）

注意：您需要使用Python 3.6+，因为我使用了（

f-strings

）和类型提示（3.5+）。

我没有注意到您在Github上托管的代码中提供了

film.period

，因此我不必费心调试您的代码，而是决定从头开始实现它

lvov2