Python 如何处理utf-8编码字符串和美化组？_Python_Beautifulsoup

Python 如何处理utf-8编码字符串和美化组？

python

Python 如何处理utf-8编码字符串和美化组？,python,beautifulsoup,Python,Beautifulsoup,如何用正确的unicode替换unicode字符串中的HTML实体 u'"HAUS Kleider" - Über das Bekleiden und Entkleiden, das Verh&Yuml;llen und Veredeln' 到编辑事实上，实体是错误的。至少看起来很美，我们把它搞砸了所以问题是：如何处理utf-8编码的字符串和美化组？ from BeautifulSoup import BeautifulSou

如何用正确的unicode替换unicode字符串中的HTML实体

u'&quot;HAUS Kleider&quot; - &Uuml;ber das Bekleiden und Entkleiden, das Verh&Yuml;llen und Veredeln'

到

编辑
事实上，实体是错误的。至少看起来很美，我们把它搞砸了

所以问题是：如何处理utf-8编码的字符串和美化组？

from BeautifulSoup import BeautifulSoup

f = open('path_to_file','r')
lines = [i for i in f.readlines()]
soup = BeautifulSoup(''.join(lines))
allArticles = []
for row in rows:
    l =[]
    for r in row.findAll('td'):
            l += [r.string] # here things seem to go wrong
    allArticles+=[l]

Ü->&Yuml而不是Ü
>>> soup.originalEncoding
'utf-8'

但是我不能生成一个合适的unicode字符串，我想你需要的是。我认为有一种方法可以将HTML实体翻译成Unicode
请尝试使用音译器idHex/XML任何符合您要求的
。在演示页面上，您可以选择“插入样本：复合”，然后在“复合1”框中输入Hex/XML Any
，在框中添加一些输入数据，然后按“转换”。有帮助吗
有一个Python ICU绑定，但我认为它没有得到很好的处理。
htmlentitydefs.entitydefs[“quot”]返回
“

这是一本将实体转换为其实际字符的词典。
从这一点开始，您应该能够轻松地继续。

好的，我必须承认，这个问题很愚蠢。我正在交互式解释器中处理一个旧版本的

行

。我不知道它的内容有什么问题，但这是正确的代码：

from BeautifulSoup import BeautifulSoup

f = open('path_to_file','r')
lines = [i for i in f.readlines()]
soup = BeautifulSoup(''.join(lines))
rows = soup.findAll('tr')
allArticles = []
for row in rows:
    l =[]
    for r in row.findAll('td'):
        l += [r.string]
    allArticles+=[l]

真可耻！

可能的重复似乎出了问题？BeautifulSoup搞错了？实体是错误的？请尝试提供更精确的细节以回答此问题。BeautifulSoup倾向于处理UTF-8非常好。如果BeautifulSoup会给我正确的实体。请参阅我的编辑

from BeautifulSoup import BeautifulSoup

f = open('path_to_file','r')
lines = [i for i in f.readlines()]
soup = BeautifulSoup(''.join(lines))
rows = soup.findAll('tr')
allArticles = []
for row in rows:
    l =[]
    for r in row.findAll('td'):
        l += [r.string]
    allArticles+=[l]