Python BeautifulSoup4无法正确打印。蟒蛇3_Python_Beautifulsoup_Python 3.5_Bs4

Python BeautifulSoup4无法正确打印。蟒蛇3

python

Python BeautifulSoup4无法正确打印。蟒蛇3,python,beautifulsoup,python-3.5,bs4,Python,Beautifulsoup,Python 3.5,Bs4,我目前正在学习Python3，我正在抓取一个站点来获取一些数据，这很好，但是当涉及到打印p标签时，我无法让它像我期望的那样工作 import urllib import lxml from urllib import request from bs4 import BeautifulSoup data = urllib.request.urlopen('www.site.com').read() soup = BeautifulSoup(data, 'lxml') stat = soup.

我目前正在学习Python3，我正在抓取一个站点来获取一些数据，这很好，但是当涉及到打印p标签时，我无法让它像我期望的那样工作

import urllib
import lxml
from urllib import request
from bs4 import BeautifulSoup



data = urllib.request.urlopen('www.site.com').read()
soup = BeautifulSoup(data, 'lxml')
stat = soup.find('div', {'style' : 'padding-left: 10px';})
dialog = stat.findChildren('p')

for child in dialog:
    childtext = child.get_text()
    #have tried child.string aswell (exactly the same result)
    childlist.append(childtext.encode('utf-8', 'ignore')
    #Have tried with str(childtext.encode('utf-8', 'ignore'))

print (childlist)

这一切都可以，但打印是“字节”

ascii编码的真实示例文本：

b"Announcementb'Firefox users may encounter browser warnings encountering SSL SHA-1 certificates"

注意，公告是p，其余的在p标签下是“强”

使用utf-8编码的相同样本

b"Announcement\xc2\xa0\xe2\x80\x93\xc2\xa0b'Firefox users may encounter browser warnings encountering SSL SHA-1 "

我希望得到：

"Announcement"
(newline / new item in list)
"Firefox users may encounter browser warnings encountering SSL SHA-1 certificates"

如您所见，不正确的字符在“ascii”中被剥离，但由于一些字符会破坏一些换行符，我还没有弄清楚如何正确打印，而且，b仍然存在

我真的不知道如何删除b并正确编码或解码。我已经尝试了谷歌搜索到的每一个“解决方案”

HTML内容=utf-8

我宁愿在处理之前不更改完整的数据，因为它会打乱我的其他工作，我认为不需要它

美化不起作用

有什么建议吗？

首先，您将得到

b'stuff'

表单的输出，因为您正在调用

.encode（）

，它返回一个对象。如果要打印字符串以供阅读，请将其保留为字符串

作为猜测，我假设您希望很好地打印HTML中的字符串，就像在浏览器中看到的一样。为此，需要对HTML字符串编码进行解码，如中所述，对于Python 3.5，这意味着：

import html
html.unescape(childtext)

除此之外，这将把HTML字符串中的任何

序列转换为

'\xa0'

字符，这些字符打印为空格。但是，如果您想在这些字符上断行，尽管

字面意思是“不间断空格”，您必须在打印前用实际空格替换这些字符，例如使用

x.replace（'\xa0'，''）

import html
html.unescape(childtext)