在Python上解码HTML实体_Python_Python 3.x

在Python上解码HTML实体

python python-3.x

在Python上解码HTML实体,python,python-3.x,Python,Python 3.x,我有一个包含以下行的文件： StatsLearning_Lect1_2a_111213_v2_%5B2wLfFB_6SKI%5D_%5Btag22%5D.mp4 import os from html.parser import HTMLParser fpListDwn = open('listDwn', 'r') for lineNumberOnList, fileName in enumerate(fpListDwn): print(HTMLParser().unescape(

我有一个包含以下行的文件：

StatsLearning_Lect1_2a_111213_v2_%5B2wLfFB_6SKI%5D_%5Btag22%5D.mp4

import os
from html.parser import HTMLParser

fpListDwn = open('listDwn', 'r')

for lineNumberOnList, fileName in enumerate(fpListDwn):
    print(HTMLParser().unescape(fileName))

关于这一行，我在磁盘上有一些文件，但以解码形式保存：

StatsLearning_Lect1_2a_111213_v2_[2wLfFB_6SKI]_[tag22].mp4

我需要从第一个文件列表中获取文件名，从第二个文件中更正文件名，并将文件名更改为第二个名称。为此，我需要从文件名中解码html实体，所以我会这样做：

StatsLearning_Lect1_2a_111213_v2_%5B2wLfFB_6SKI%5D_%5Btag22%5D.mp4

import os
from html.parser import HTMLParser

fpListDwn = open('listDwn', 'r')

for lineNumberOnList, fileName in enumerate(fpListDwn):
    print(HTMLParser().unescape(fileName))

但此操作对运行没有任何影响，某些运行的结果是：

meysampg@freedom:~/Downloads/Practical Machine Learning$ python3 changeName.py
StatsLearning_Lect1_2a_111213_v2_%5B2wLfFB_6SKI%5D_%5Btag22%5D.mp4

StatsLearning_Lect1_2b_111213_v2_%5BLvaTokhYnDw%5D_%5Btag22%5D.mp4

StatsLearning_Lect3_4a_110613_%5BWjyuiK5taS8%5D_%5Btag22%5D.mp4

StatsLearning_Lect3_4b_110613_%5BUvxHOkYQl8g%5D_%5Btag22%5D.mp4

StatsLearning_Lect3_4c_110613_%5BVusKAosxxyk%5D_%5Btag22%5D.mp4

如何解决此问题？

这实际上是“百分比编码”，而不是HTML编码，请参见以下问题：

基本上，您希望改用

urllib.parse.unquote

：

from urllib.parse import unquote
unquote('StatsLearning_Lect1_2a_111213_v2_%5B2wLfFB_6SKI%5D_%5Btag22%5D.mp4')

Out[192]: 'StatsLearning_Lect1_2a_111213_v2_[2wLfFB_6SKI]_[tag22].mp4'

我想您应该使用urllib.parse而不是html.parser

>>> f="StatsLearning_Lect1_2a_111213_v2_%5B2wLfFB_6SKI%5D_%5Btag22%5D.mp4"
>>> import urllib.parse as parse
>>> f
'StatsLearning_Lect1_2a_111213_v2_%5B2wLfFB_6SKI%5D_%5Btag22%5D.mp4'
>>> parse.unquote(f)
'StatsLearning_Lect1_2a_111213_v2_[2wLfFB_6SKI]_[tag22].mp4'

因此，您的脚本应该如下所示：

import os
import urllib.parse as parse

fpListDwn = open('listDwn', 'r')

for lineNumberOnList, fileName in enumerate(fpListDwn):
    print(parse.unquote(fileName))