在Python上解码HTML实体

在Python上解码HTML实体,python,python-3.x,Python,Python 3.x,我有一个包含以下行的文件: StatsLearning_Lect1_2a_111213_v2_%5B2wLfFB_6SKI%5D_%5Btag22%5D.mp4 import os from html.parser import HTMLParser fpListDwn = open('listDwn', 'r') for lineNumberOnList, fileName in enumerate(fpListDwn): print(HTMLParser().unescape(

我有一个包含以下行的文件:

StatsLearning_Lect1_2a_111213_v2_%5B2wLfFB_6SKI%5D_%5Btag22%5D.mp4
import os
from html.parser import HTMLParser

fpListDwn = open('listDwn', 'r')

for lineNumberOnList, fileName in enumerate(fpListDwn):
    print(HTMLParser().unescape(fileName))
关于这一行,我在磁盘上有一些文件,但以解码形式保存:

StatsLearning_Lect1_2a_111213_v2_[2wLfFB_6SKI]_[tag22].mp4
我需要从第一个文件列表中获取文件名,从第二个文件中更正文件名,并将文件名更改为第二个名称。为此,我需要从文件名中解码html实体,所以我会这样做:

StatsLearning_Lect1_2a_111213_v2_%5B2wLfFB_6SKI%5D_%5Btag22%5D.mp4
import os
from html.parser import HTMLParser

fpListDwn = open('listDwn', 'r')

for lineNumberOnList, fileName in enumerate(fpListDwn):
    print(HTMLParser().unescape(fileName))
但此操作对运行没有任何影响,某些运行的结果是:

meysampg@freedom:~/Downloads/Practical Machine Learning$ python3 changeName.py
StatsLearning_Lect1_2a_111213_v2_%5B2wLfFB_6SKI%5D_%5Btag22%5D.mp4

StatsLearning_Lect1_2b_111213_v2_%5BLvaTokhYnDw%5D_%5Btag22%5D.mp4

StatsLearning_Lect3_4a_110613_%5BWjyuiK5taS8%5D_%5Btag22%5D.mp4

StatsLearning_Lect3_4b_110613_%5BUvxHOkYQl8g%5D_%5Btag22%5D.mp4

StatsLearning_Lect3_4c_110613_%5BVusKAosxxyk%5D_%5Btag22%5D.mp4
如何解决此问题?

这实际上是“百分比编码”,而不是HTML编码,请参见以下问题:

基本上,您希望改用
urllib.parse.unquote

from urllib.parse import unquote
unquote('StatsLearning_Lect1_2a_111213_v2_%5B2wLfFB_6SKI%5D_%5Btag22%5D.mp4')

Out[192]: 'StatsLearning_Lect1_2a_111213_v2_[2wLfFB_6SKI]_[tag22].mp4'

我想您应该使用urllib.parse而不是html.parser

>>> f="StatsLearning_Lect1_2a_111213_v2_%5B2wLfFB_6SKI%5D_%5Btag22%5D.mp4"
>>> import urllib.parse as parse
>>> f
'StatsLearning_Lect1_2a_111213_v2_%5B2wLfFB_6SKI%5D_%5Btag22%5D.mp4'
>>> parse.unquote(f)
'StatsLearning_Lect1_2a_111213_v2_[2wLfFB_6SKI]_[tag22].mp4'
因此,您的脚本应该如下所示:

import os
import urllib.parse as parse

fpListDwn = open('listDwn', 'r')

for lineNumberOnList, fileName in enumerate(fpListDwn):
    print(parse.unquote(fileName))