Python BeautifulSoup解析器添加了不必要的结束html标记
比如说 你有类似html的Python BeautifulSoup解析器添加了不必要的结束html标记,python,python-3.x,beautifulsoup,Python,Python 3.x,Beautifulsoup,比如说 你有类似html的 <head> <meta charset="UTF-8"> <meta name="description" content="Free Web tutorials"> <meta name="keywords" content="HTML,CSS,XML,JavaScript"> <meta name="author" content="John Doe"> <meta name=
<head>
<meta charset="UTF-8">
<meta name="description" content="Free Web tutorials">
<meta name="keywords" content="HTML,CSS,XML,JavaScript">
<meta name="author" content="John Doe">
<meta name="viewport" content="width=device-width, initial-scale=1.0">
</head>
如果您在python中使用BeautifulSoup解析它,并使用prettify打印它,它将给出如下输出
输出:
from bs4 import BeautifulSoup as bs
import urllib3
URL = 'html file'
http = urllib3.PoolManager()
page = http.request('GET', URL)
soup = bs(page.data, 'html.parser')
print(soup.prettify())
<html>
<head>
<meta charset="UTF-8">
<meta name="description" content="Free Web tutorials">
<meta name="keywords" content="HTML,CSS,XML,JavaScript">
<meta name="author" content="John Doe">
<meta name="viewport" content="width=device-width, initial-scale=1.0">
</meta>
</meta>
</meta>
</meta>
</meta>
</head>
但是如果你有html元标记,比如
<meta name="description" content="Free Web tutorials" />
它将按原样输出。它不会添加结束标记
那么,如何阻止BeautifulSoup添加不必要的结束标记呢?要解决这个问题,您只需要将
html
解析器更改为lxml
解析器
那么您的python脚本将
from bs4 import BeautifulSoup as bs
import urllib3
URL = 'html file'
http = urllib3.PoolManager()
page = http.request('GET', URL)
soup = bs(page.data, 'lxml')
print(soup.prettify())
您只需将soup=bs(page.data,'html.parser')
更改为soup=bs(page.data,'lxml')