Python 美化组汉字编码错误_Python_Python 2.7_Unicode_Encoding_Beautifulsoup

Python 美化组汉字编码错误

python python-2.7 unicode encoding

Python 美化组汉字编码错误,python,python-2.7,unicode,encoding,beautifulsoup,Python,Python 2.7,Unicode,Encoding,Beautifulsoup,我试图识别并保存特定站点上的所有标题，并不断获取我认为是编码错误的内容该网站是：目前的代码是： holder = {} url = urllib.urlopen('http://paper.people.com.cn/rmrb/html/2016-05/06/nw.D110000renmrb_20160506_2-01.htm').read() soup = BeautifulSoup(url, 'lxml') head1 = soup.find_all(['h1','h2','h

我试图识别并保存特定站点上的所有标题，并不断获取我认为是编码错误的内容

该网站是：

目前的代码是：

holder = {}  

url = urllib.urlopen('http://paper.people.com.cn/rmrb/html/2016-05/06/nw.D110000renmrb_20160506_2-01.htm').read()

soup = BeautifulSoup(url, 'lxml')

head1 = soup.find_all(['h1','h2','h3'])

print head1

holder["key"] = head1

打印输出为：

[<h3>\u73af\u5883\u6c61\u67d3\u6700\u5c0f\u5316 \u8d44\u6e90\u5229\u7528\u6700\u5927\u5316</h3>, <h1>\u5929\u6d25\u6ee8\u6d77\u65b0\u533a\uff1a\u697c\u5728\u666f\u4e2d \u5382\u5728\u7eff\u4e2d</h1>, <h2></h2>]

但是，这给了我与注释中提到的相同的错误（“AttributeError:type对象'BeautifulSoup'没有属性'BeautifulSoup'”）删除第二个“.BeautifulSoup”会导致另一个错误（“RuntimeError:调用Python对象时超出了最大递归深度”）

我还尝试了这里建议的答案：

通过分解对象的创建

html = urllib2.urlopen("http://www.515fa.com/che_1978.html")
content = html.read().decode('utf-8', 'ignore')
soup = BeautifulSoup(content)

但这也产生了递归错误。任何其他提示都将不胜感激

谢谢

这可能会提供一个非常简单的解决方案，但不确定它是否完全满足您的需要，请告诉我：

holder = {}  

url = urllib.urlopen('http://paper.people.com.cn/rmrb/html/2016-05/06/nw.D110000renmrb_20160506_2-01.htm').read()

soup = BeautifulSoup(url, 'lxml')

head1 = soup.find_all(['h1','h2','h3'])

print unicode(head1)

holder["key"] = head1

参考：

使用以下方法解码：

写入csv时，您需要将数据编码为utf-8 str：

您可以在将数据保存到dict中时进行编码。

谢谢！不幸的是，这给了我与以前完全相同的输出，所以我仍然得到了u1234而不是字符。哦，伙计，这太接近了！这可以打印文本，这让我希望数据是正确的。然而，当我试图将其添加到字典中时，它恢复为unicode。我把第9步分解了一点，所以

g=soup.h3.text.encode（“utf-8”）.decode（“unicode转义”）

然后

print（g）

。效果很好。但当我试图将g添加到名为holder的字典中时：

holder[“key”]=g

，然后

print holder

我又得到了unicode输出。最后，我希望将字典输出到CSV，并确保它正确地通过了链。@user5356756，这只是repr表示，请尝试从dict打印值本身，您应该会看到相同的结果。同样根据答案的结尾，您应该真正升级到bs4gotcha，谢谢！这很有效。我在使用dictwriter将词典转换为csv时遇到了麻烦，但这远远超出了这个问题的范围，因此我将进行一些研究，并在需要时打开一个新的词典。至于bs4，我的脚本的第一行（我在上面没有复制）是来自bs4 import BeautifulSoup的

。为了从3切换到4，我还需要做其他事情吗？啊，好的，那么你正在使用bs4，当我看到BeautifulSoup.BeautifulSoup
时，我不确定。如果你print（bs4.\uu version\uuuuuuu）你看到了什么？哦，我的上帝，你是PYTHON之王，谢谢
holder = {}  

url = urllib.urlopen('http://paper.people.com.cn/rmrb/html/2016-05/06/nw.D110000renmrb_20160506_2-01.htm').read()

soup = BeautifulSoup(url, 'lxml')

head1 = soup.find_all(['h1','h2','h3'])

print unicode(head1)

holder["key"] = head1

In [6]: from bs4 import BeautifulSoup

In [7]: h = """<h3>\u73af\u5883\u6c61\u67d3\u6700\u5c0f\u5316 \u8d44\u6e90\u5229\u7528\u6700\u5927\u5316</h3>, <h1>\u5929\u6d25\u6ee8\u6d77\u65b0\u533a\uff1a\u697c\u5728\u666f\u4e2d \u5382\u5728\u7eff\u4e2d</h1>, <h2></h2>"""

In [8]: soup = BeautifulSoup(h, 'lxml')

In [9]: print(soup.h3.text.decode("unicode-escape"))
环境污染最小化 资源利用最大化

<meta http-equiv="content-language" content="utf-8" />

In [1]: from bs4 import BeautifulSoup

In [2]: import urllib

In [3]: url = urllib.urlopen('http://paper.people.com.cn/rmrb/html/2016-05/06/nw.D110000renmrb_20160506_2-01.htm').read()

In [4]: soup = BeautifulSoup(url.decode("utf-8"), 'lxml')

In [5]: print(soup.h3.text)
环境污染最小化 资源利用最大化

 .decode("unicode-escape").encode("utf-8")