Python BeautifulSoup为<；meta>；标签_Python_Html_Beautifulsoup

Python BeautifulSoup为<；meta>；标签

python html

Python BeautifulSoup为<；meta>；标签,python,html,beautifulsoup,Python,Html,Beautifulsoup,基本上，我试图从一个有bs4的站点获取所有元标记 import urllib.request from bs4 import BeautifulSoup response = urllib.request.urlopen("https://grab.careers/").read() response_decode = response.decode('utf-8') soup = BeautifulSoup(response_decode,"html.parser") metatags = s

基本上，我试图从一个有bs4的站点获取所有元标记

import urllib.request
from bs4 import BeautifulSoup
response = urllib.request.urlopen("https://grab.careers/").read()
response_decode = response.decode('utf-8')
soup = BeautifulSoup(response_decode,"html.parser")
metatags = soup.find_all('meta')
file = open('text.out','w')
for x in metatags:
    file.write(str(x))
file.close()

我希望上述代码只返回元标记。但是，正如您可以从以下代码段中看到的，soup同时返回元、链接和脚本内容：

<meta content="Grab Careers | Working For A Better Southeast Asia" name="twitter:title" />
<meta content="Working For A Better Southeast Asia on Grab Careers�" name="twitter:description" />
<link href="https://grab.careers/" rel="canonical">
<script
    type="application/ld+json">{"@context":"https://schema.org","@type":"WebSite","url":"https://grab.careers/","name":"Grab Careers","potentialAction":{"@type":"SearchAction","target":"https://grab.careers/search/{search_term_string}","query-input":"required name=search_term_string"}}</script>


{@context”：https://schema.org“，“@type”：“网站”，“url”：”https://grab.careers/“，”姓名“：”抢夺职业“，”潜在行动“：{@type:”搜索行动“，”目标“：”https://grab.careers/search/{search\u term\u string}，“查询输入”：“必需名称=search\u term\u string”}

我找不到任何资源来解决这个问题。我如何修复此问题，以便只返回元标记。

老实说，我无法确切地解释为什么会发生这种情况，但如果使用

lxml

而不是

html，则可以解决此问题。解析器

：

soup = BeautifulSoup(response_decode,"lxml")

lxml

需要安装软件包，例如：

pip install lxml

find\u all

方法提供一个标记及其所有子项

我认为这是由于缺少空HTML标记的关闭

。像

一样正确关闭的

meta

标记（注意

末尾）被解析为一个单独的标记

而像

这样的元标记不会被解析为关闭标记（由于缺少

），因此在遇到关闭

之前包含所有标记。这些子标签实际上包括您看到的链接和脚本标签。

这非常有用，非常感谢。我只知道响应中的一些结束标记以“>”结尾，而不是“/>”，这会破坏html解析器。但Lxml解析器没有问题。