Warning: file_get_contents(/data/phpspider/zhask/data//catemap/3/html/83.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python BeautifulSoup为<;meta>;标签_Python_Html_Beautifulsoup - Fatal编程技术网

Python BeautifulSoup为<;meta>;标签

Python BeautifulSoup为<;meta>;标签,python,html,beautifulsoup,Python,Html,Beautifulsoup,基本上,我试图从一个有bs4的站点获取所有元标记 import urllib.request from bs4 import BeautifulSoup response = urllib.request.urlopen("https://grab.careers/").read() response_decode = response.decode('utf-8') soup = BeautifulSoup(response_decode,"html.parser") metatags = s

基本上,我试图从一个有bs4的站点获取所有元标记

import urllib.request
from bs4 import BeautifulSoup
response = urllib.request.urlopen("https://grab.careers/").read()
response_decode = response.decode('utf-8')
soup = BeautifulSoup(response_decode,"html.parser")
metatags = soup.find_all('meta')
file = open('text.out','w')
for x in metatags:
    file.write(str(x))
file.close()
我希望上述代码只返回元标记。但是,正如您可以从以下代码段中看到的,soup同时返回元、链接和脚本内容:

<meta content="Grab Careers | Working For A Better Southeast Asia" name="twitter:title" />
<meta content="Working For A Better Southeast Asia on Grab Careers�" name="twitter:description" />
<link href="https://grab.careers/" rel="canonical">
<script
    type="application/ld+json">{"@context":"https://schema.org","@type":"WebSite","url":"https://grab.careers/","name":"Grab Careers","potentialAction":{"@type":"SearchAction","target":"https://grab.careers/search/{search_term_string}","query-input":"required name=search_term_string"}}</script>

{@context”:https://schema.org“,“@type”:“网站”,“url”:”https://grab.careers/“,”姓名“:”抢夺职业“,”潜在行动“:{@type:”搜索行动“,”目标“:”https://grab.careers/search/{search\u term\u string},“查询输入”:“必需名称=search\u term\u string”}

我找不到任何资源来解决这个问题。我如何修复此问题,以便只返回元标记。

老实说,我无法确切地解释为什么会发生这种情况,但如果使用
lxml
而不是
html,则可以解决此问题。解析器

soup = BeautifulSoup(response_decode,"lxml")
lxml
需要安装软件包,例如:
pip install lxml

find\u all
方法提供一个标记及其所有子项

我认为这是由于缺少空HTML标记的关闭
/
。 像
一样正确关闭的
meta
标记(注意
/
末尾)被解析为一个单独的标记


而像
这样的元标记不会被解析为关闭标记(由于缺少
/
),因此在遇到关闭
之前包含所有标记。这些子标签实际上包括您看到的链接和脚本标签。

这非常有用,非常感谢。我只知道响应中的一些结束标记以“>”结尾,而不是“/>”,这会破坏html解析器。但Lxml解析器没有问题。