Python get_text（）具有UnicodeDeer错误_Python_Unicode_Ascii_Beautifulsoup

Python get_text（）具有UnicodeDeer错误

python unicode

Python get_text（）具有UnicodeDeer错误,python,unicode,ascii,beautifulsoup,Python,Unicode,Ascii,Beautifulsoup,我有以下HTML： <div class="dialog"> <div class="title title-with-sort-row"> <h2>Description</h2> <div class="dialog-search-sort-bar"> </div> </div> <div class="content"><div style="margin-righ

我有以下HTML：

<div class="dialog">
<div class="title title-with-sort-row">
    <h2>Description</h2>
    <div class="dialog-search-sort-bar">
    </div>
</div>
<div class="content"><div style="margin-right: 20px; margin-left: 30px;">
    <span class="description2">
        With “Antonia Polygon – Standard”, you have a figure that is unique in the Poser community. 
        She is made available under a Creative Commons License that gives endless opportunities for further development. 
        This figure was developed by a group of talented members of the Poser community in a thirty-month effort. 
        The result is a figure that has very good bending and morphing behavior.
        <br />
    </span>
</div>
</div>

我得到：

<span class="description2">
    With Â“Antonia Polygon Â– StandardÂ”, you have a figure that is unique in the Poser community. 
    She is made available under a Creative Commons License that gives endless opportunities for further development. 
    This figure was developed by a group of talented members of the Poser community in a thirty-month effort. 
    The result is a figure that has very good bending and morphing behavior.
    <br/>
</span>

我得到一个（UnicodeEncodeError）：“ascii”codex无法对字符u'\x93'进行编码。

如何将这段HTML转换为纯ascii

#!/usr/bin/env python
# -*- coding: utf-8 -*-

foo = u'With Â“Antonia Polygon Â– StandardÂ”, you have a figure that is unique in the Poser community.She is made available under a Creative Commons License that gives endless opportunities for further development. This figure was developed by a group of talented members of the Poser community in a thirty-month effort. The result is a figure that has very good bending and morphing behavior.'

print foo.encode('ascii', 'ignore')

有三件事需要注意

首先是encode方法的

'ignore'

参数。它指示该方法删除不在所选编码范围内的字符（在本例中，ascii是安全的）

其次，我们通过在字符串前面加上

，显式地将foo设置为unicode

第三个是显式文件编码指令：

#-*-编码：utf8-*-

另外，如果你没有读到Daenyth在这个答案的评论中的非常好的观点，那么你就是一个愚蠢的笨蛋<如果要在HTML/XML中使用输出，则可以使用code>xmlcharrefreplace来代替上面的

ignore

。

字符

“

不是ASCII字符。您的目标是识别最相似的ASCII字符吗（这很难），或者您的目标是简单地删除所有非ASCII字符？或者您真正想要输出的是正确的Unicode，例如UTF-8，而不是ASCII？只是删除所有非ASCII字符连字：使用

xmlcharrefreplace

作为第二个参数在这种情况下会好得多，因为他处理的是html。是的，我同意。我只是在偷懒因斯·奥普在一篇评论中说，他只是想删除所有行为不端的角色。不过，值得一提的是，如果其他人有类似的问题，他们可能会遇到这种情况。

description = description.get_text()

#!/usr/bin/env python
# -*- coding: utf-8 -*-

foo = u'With Â“Antonia Polygon Â– StandardÂ”, you have a figure that is unique in the Poser community.She is made available under a Creative Commons License that gives endless opportunities for further development. This figure was developed by a group of talented members of the Poser community in a thirty-month effort. The result is a figure that has very good bending and morphing behavior.'

print foo.encode('ascii', 'ignore')