Python 将HTML实体转换为Unicode，反之亦然_Python_Html_Html Entities

Python 将HTML实体转换为Unicode，反之亦然

python html

Python 将HTML实体转换为Unicode，反之亦然,python,html,html-entities,Python,Html,Html Entities,在Python中如何将HTML实体转换为Unicode，反之亦然？您需要有从BeautifulSoup导入BeautifulStoneSoup 导入cgi def HTMLEntitiesToUnicode（文本）： “”“将HTML实体转换为unicode。例如，&；将变为'&'。” text=unicode（BeautifulStoneSoup（text，convertEntities=BeautifulStoneSoup.ALL_ENTITIES））返回文本 def UNICOD

在Python中如何将HTML实体转换为Unicode，反之亦然？

您需要有

从BeautifulSoup导入BeautifulStoneSoup
导入cgi
def HTMLEntitiesToUnicode（文本）：
“”“将HTML实体转换为unicode。例如，&；将变为'&'。”
text=unicode（BeautifulStoneSoup（text，convertEntities=BeautifulStoneSoup.ALL_ENTITIES））
返回文本
def UNICODETOHTM特性（文本）：
“”“将unicode转换为HTML实体。例如，“&”变成“&；””
text=cgi.escape（text）.encode（'ascii'，'xmlcharrefreplace'）
返回文本
text=“&；、®；、¢；、£；、¥；、&euro；、§；、©；”
uni=HTMLENTITESTOUNICODE（文本）
htmlent=UnicodetOHTM实体（uni）
打印单元
打印htmlent
# &, ®, ¢, £, ¥, €, §, ©
#&及®, , ¢, £, ¥, €, §, ©

至于“反之亦然”（我需要自己，这让我找到了这个问题，但没有帮助，随后）：

将返回一个普通字符串，其中包含任何转换为XML（HTML）实体的非ascii字符。

正如hekevintran回答所建议的，您可以使用

cgi.escape

对stings进行编码，但是请注意，在该函数中，引号的编码默认为false，最好在字符串旁边传递

quote=True

关键字参数。但是，即使通过传递

quote=True

，函数也不会转义单引号（

“'”

）（由于这些问题，该函数自3.2版以来一直存在）

有人建议使用

html.escape

而不是

cgi.escape

。（3.2版中新增）

此外，

html.unescape

也已发布

因此，在python 3.4中，您可以：

使用

html.escape（text）.encode（'ascii'，'xmlcharrefreplace'）.decode（）

将特殊字符转换为html实体

和
```
html.unescape（text）
```
，用于将html实体转换回纯文本表示

Python 2.7和BeautifulSoup4的更新

Unescape—Unicode HTML到Unicode，带有

htmlparser

（Python 2.7标准库）：

Unescape—Unicode HTML到Unicode，带有

bs4

（BeautifulSoup4）：

我使用以下函数将xls文件中的unicode转换为html文件，同时保留xls文件中的特殊字符：

def html_wr(f, dat):
    ''' write dat to file f as html
        . file is assumed to be opened in binary format
        . if dat is nul it is replaced with non breakable space
        . non-ascii characters are translated to xml       
    '''
    if not dat:
        dat = '&nbsp;'
    try:
        f.write(dat.encode('ascii'))
    except:
        f.write(html.escape(dat).encode('ascii', 'xmlcharrefreplace'))

希望这对某些人有用

如果像我这样的人想知道为什么有些实体编号（代码）像

™；（用于商标符号）和#128；（对于欧元符号）

未正确编码，原因是ISO-8859-1（又名Windows-1252）中未定义这些字符

还要注意，html5的默认字符集是utf-8，而html4的默认字符集是ISO-8859-1

因此，我们必须以某种方式解决问题（首先找到并替换这些问题）

Mozilla文档中的参考（起点）

HTML仅严格要求

（符号）和

对于python3
使用HTML.unescape（）
：
@贾雷特·哈迪：事实上，《秀与说》在这样的背景下是完美的。从FAQ（）的第一个条目开始，“询问和回答您自己的编程问题也很好”。尽管如此，也鼓励寻找重复的答案。我发布了我过去为自己回答过的问题，以便其他用户搜索类似的答案。也可以在没有外部库的情况下完成。参见+1，他正在为数据集做出贡献。这个问题的范围比“重复”链接所指的范围更广：这个问题也要求“反之亦然”，即从Unicode到HTML实体。我已经忘记了xmlcharrefreplace，这非常有帮助。每当我需要安全地将编码或非ascii字符存储到mysql时，我发现我需要使用此方法。这不适用于包含unicode字符U+2019 HTML实体等价物&8217；这不是问题所要问的吗（这个答案转换为ascii，这是unicode的一个子集）？text.decode（'utf-8'）。encode（'ascii'，'xmlcharrefreplace'）@MikeS它工作正常<代码>>>>u'\u2019'.编码（'utf-8'）。解码（'utf-8'）。编码（'ascii'，'xmlcharrefreplace'）

给出了

'&8217；'美化组api已更改。请查看最新版本。@hekevintran:是否可以打印“¢；”£¥€§©；'而不是“、英镑、¥、€、§、）”。有什么想法吗？这个答案非常需要Python3的更新。在Python2.7中，你可以使用HTMLParser.unescape（text）upvote来显示一个标准的库解决方案，没有依赖性Revising我刚才看到问题上的注释@bobince。由于htmlparser现在已被记录在案，而且由于该注释并不突出，所以将该部分保留在答案中。
u'some string'.encode('ascii', 'xmlcharrefreplace')

>>> escaped = u'Monsieur le Cur&eacute; of the &laquo;Notre-Dame-de-Gr&acirc;ce&raquo; neighborhood'
>>> from HTMLParser import HTMLParser
>>> htmlparser = HTMLParser()
>>> unescaped = htmlparser.unescape(escaped)
>>> unescaped
u'Monsieur le Cur\xe9 of the \xabNotre-Dame-de-Gr\xe2ce\xbb neighborhood'
>>> print unescaped
Monsieur le Curé of the «Notre-Dame-de-Grâce» neighborhood

>>> html = '''<p>Monsieur le Cur&eacute; of the &laquo;Notre-Dame-de-Gr&acirc;ce&raquo; neighborhood</p>'''
>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup(html)
>>> soup.text
u'Monsieur le Cur\xe9 of the \xabNotre-Dame-de-Gr\xe2ce\xbb neighborhood'
>>> print soup.text
Monsieur le Curé of the «Notre-Dame-de-Grâce» neighborhood

>>> unescaped = u'Monsieur le Curé of the «Notre-Dame-de-Grâce» neighborhood'
>>> from bs4.dammit import EntitySubstitution
>>> escaper = EntitySubstitution()
>>> escaped = escaper.substitute_html(unescaped)
>>> escaped
u'Monsieur le Cur&eacute; of the &laquo;Notre-Dame-de-Gr&acirc;ce&raquo; neighborhood'

def html_wr(f, dat):
    ''' write dat to file f as html
        . file is assumed to be opened in binary format
        . if dat is nul it is replaced with non breakable space
        . non-ascii characters are translated to xml       
    '''
    if not dat:
        dat = '&nbsp;'
    try:
        f.write(dat.encode('ascii'))
    except:
        f.write(html.escape(dat).encode('ascii', 'xmlcharrefreplace'))

$ python3 -c "
> import html
> print(
>     html.unescape('&amp;&#169;&#x2014;')
> )"
&©—

$ python3 -c "
> import html
> print(
>     html.escape('&©—')
> )"
&amp;©—

$ python2 -c "
> from HTMLParser import HTMLParser
> print(
>     HTMLParser().unescape('&amp;&#169;&#x2014;')
> )"
&©—

$ python2 -c "
> import cgi
> print(
>     cgi.escape('&©—')
> )"
&amp;©—

#!/usr/bin/env python3
import fileinput
import html

for line in fileinput.input():
    print(html.unescape(line.rstrip('\n')))

import html
s = "&amp;"
decoded = html.unescape(s)
# &