Python 我怎样才能清理这个结果?
我正在使用web API(Python 我怎样才能清理这个结果?,python,python-3.x,string,filter,python-requests,Python,Python 3.x,String,Filter,Python Requests,我正在使用web API(电影API)。当我使用对特定URL的请求发出post请求时,我得到以下响应: <a href='\"https:\/\/xdede.co\/peliculas\/p284052-ver-doctor-strange-online\"' up-target='\"body\"'>\n\t\t\t\t\t\t <div class='\"SPoster\"'>\n\t\t\t\t\t\t\t <img src='\"https:\/\/imag
电影API
)。当我使用对特定URL的请求发出post请求时,我得到以下响应:
<a href='\"https:\/\/xdede.co\/peliculas\/p284052-ver-doctor-strange-online\"' up-target='\"body\"'>\n\t\t\t\t\t\t
<div class='\"SPoster\"'>\n\t\t\t\t\t\t\t
<img src='\"https:\/\/image.tmdb.org\/t\/p\/w45\/7OpmunCEZo93nyRIbx59QRaFvZz.jpg\"'/>\n\t\t\t\t\t\t<\/div>\n\t\t\t\t\t\t
<h2>Doctor Strange<\/h2>\n\t\t\t\t\t\t<span>Pelicula<\/span>\n\t\t\t\t\t<\/a>\n\t\t\t\t<\/div>\n\t\t\t\t"}</span>
</h2></div></a>
我如何过滤这些混乱,以获得
href
和h2
标记?我试过beautifulsou
但什么都没试过。有什么建议吗?使用beautifulsou
和regex
import re
import bs4 as bs4
html = """<a href='\"https:\/\/xdede.co\/peliculas\/p284052-ver-doctor-strange-online\"' up-target='\"body\"'>\n\t\t\t\t\t\t<div class='\"SPoster\"'>\n\t\t\t\t\t\t\t<img src='\"https:\/\/image.tmdb.org\/t\/p\/w45\/7OpmunCEZo93nyRIbx59QRaFvZz.jpg\"'/>\n\t\t\t\t\t\t<\/div>\n\t\t\t\t\t\t<h2>Doctor Strange<\/h2>\n\t\t\t\t\t\t<span>Pelicula<\/span>\n\t\t\t\t\t<\/a>\n\t\t\t\t<\/div>\n\t\t\t\t"}</span></h2></div></a>"""
soup = bs4.BeautifulSoup(html, features='html.parser')
href = re.sub(r'[\\"]', '', soup.a['href'])
h2 = re.sub(r'<[^>]*>', '', soup.a.h2.text)
h2 = ' '.join(re.findall(r'(\w+)', h2))
print(href)
print(h2)
https://xdede.co/peliculas/p284052-ver-doctor-strange-online
Doctor Strange Pelicula