Python BeautifulSoup解析器添加了不必要的结束html标记_Python_Python 3.x_Beautifulsoup

Python BeautifulSoup解析器添加了不必要的结束html标记

python python-3.x

Python BeautifulSoup解析器添加了不必要的结束html标记,python,python-3.x,beautifulsoup,Python,Python 3.x,Beautifulsoup,比如说你有类似html的 <head> <meta charset="UTF-8"> <meta name="description" content="Free Web tutorials"> <meta name="keywords" content="HTML,CSS,XML,JavaScript"> <meta name="author" content="John Doe"> <meta name=

比如说

你有类似html的

<head>
  <meta charset="UTF-8">
  <meta name="description" content="Free Web tutorials">
  <meta name="keywords" content="HTML,CSS,XML,JavaScript">
  <meta name="author" content="John Doe">
  <meta name="viewport" content="width=device-width, initial-scale=1.0">
</head>

如果您在python中使用BeautifulSoup解析它，并使用prettify打印它，它将给出如下输出

输出：

from bs4 import BeautifulSoup as bs
import urllib3

URL = 'html file'

http = urllib3.PoolManager()

page = http.request('GET', URL)
soup = bs(page.data, 'html.parser')

print(soup.prettify())

<html>
<head>
  <meta charset="UTF-8">
    <meta name="description" content="Free Web tutorials">
        <meta name="keywords" content="HTML,CSS,XML,JavaScript">
            <meta name="author" content="John Doe">
                <meta name="viewport" content="width=device-width, initial-scale=1.0">
                </meta>
             </meta>
         </meta>
     </meta>
  </meta>
</head>

但是如果你有html元标记，比如

<meta name="description" content="Free Web tutorials" />

它将按原样输出。它不会添加结束标记

那么，如何阻止BeautifulSoup添加不必要的结束标记呢？

要解决这个问题，您只需要将

html

解析器更改为

lxml

解析器

那么您的python脚本将

from bs4 import BeautifulSoup as bs
import urllib3

URL = 'html file'

http = urllib3.PoolManager()

page = http.request('GET', URL)
soup = bs(page.data, 'lxml')

print(soup.prettify())

您只需将

soup=bs（page.data，'html.parser'）

更改为

soup=bs（page.data，'lxml'）