在Python 3中使用BeautifulSoup刮取URL_Python_Python 3.x_Beautifulsoup_Urllib

在Python 3中使用BeautifulSoup刮取URL

python python-3.x

在Python 3中使用BeautifulSoup刮取URL,python,python-3.x,beautifulsoup,urllib,Python,Python 3.x,Beautifulsoup,Urllib,我尝试了这个代码，但是带有URL的列表仍然是空的。没有错误，什么都没有 from bs4 import BeautifulSoup from urllib.request import Request, urlopen import re req = Request('https://www.metacritic.com/browse/movies/genre/date?page=0', headers={'User-Agent': 'Mozilla/5.0'}) html_page = ur

我尝试了这个代码，但是带有URL的列表仍然是空的。没有错误，什么都没有

from bs4 import BeautifulSoup
from urllib.request import Request, urlopen
import re

req = Request('https://www.metacritic.com/browse/movies/genre/date?page=0', headers={'User-Agent': 'Mozilla/5.0'})
html_page = urlopen(req).read()

soup = BeautifulSoup(html_page, features="xml")
links = []
for link in soup.findAll('a', attrs={'href': re.compile("^https://www.metacritic.com/movie/")}):
    links.append(link.get('href'))

print(links)

我想清除给定URL中以开头的所有URL

我做错了什么？

您的代码是正确的

该列表保持为空，因为该页面上没有与该模式匹配的URL。请尝试重新编译^/movie/。

您的代码是正确的

该列表保持为空，因为该页面上没有与该模式匹配的URL。请尝试重新编译^/movie/。

首先，您应该使用标准库html.parser而不是xml来解析页面内容。它可以更好地处理损坏的html，请参见

然后查看您正在解析的页面的源代码。要查找的元素如下所示：

因此，请按如下方式更改代码：

from bs4 import BeautifulSoup
from urllib.request import Request, urlopen
import re

req = Request('https://www.metacritic.com/browse/movies/genre/date?page=0', headers={'User-Agent': 'Mozilla/5.0'})
html_page = urlopen(req).read()

soup = BeautifulSoup(html_page, 'html.parser')
links = []
for link in soup.findAll('a', attrs={'href': re.compile("^/movie/")}):
    links.append(link.get('href'))

print(links)

首先，应该使用标准库html.parser而不是xml来解析页面内容。它可以更好地处理损坏的html，请参见

然后查看您正在解析的页面的源代码。要查找的元素如下所示：

因此，请按如下方式更改代码：

from bs4 import BeautifulSoup
from urllib.request import Request, urlopen
import re

req = Request('https://www.metacritic.com/browse/movies/genre/date?page=0', headers={'User-Agent': 'Mozilla/5.0'})
html_page = urlopen(req).read()

soup = BeautifulSoup(html_page, 'html.parser')
links = []
for link in soup.findAll('a', attrs={'href': re.compile("^/movie/")}):
    links.append(link.get('href'))

print(links)

@leiropi是对的，features=xml也给您带来了问题。soup=BeautifulSouphtml\u页面，“lxml”确实给出了正确的结果。@leiropi是对的，features=xml也给您带来了问题。soup=BeautifulSouphtml_页面，“lxml”确实给出了正确的结果。非常感谢。还有一个问题，因为您似乎很擅长正则表达式：我如何可以省略所有与/movie/movie name不同的URL，例如/movie/movie name/trails。我尝试重新编译^/movie/+[^\/]但他保留了所有不需要的URL。你可以使用类似^/movie/[a-zA-Z0-9\-]+$的正则表达式来匹配链接，只在/movie/if'/trailes/'之后包含字母、数字和负号，而不在link中。get'href'：links.appendlink.get'href'非常感谢。还有一个问题，因为您似乎很擅长正则表达式：我如何可以省略所有与/movie/movie name不同的URL，例如/movie/movie name/trails。我尝试重新编译^/movie/+[^\/]但他保留了所有不需要的URL。您可以使用类似^/movie/[a-zA-Z0-9\-]+$的正则表达式来匹配链接，只在/movie/if'/trailes/'之后包含字母、数字和负号，而不在link中。get'href'：links.appendlink.get'href'