Python 尝试将字符串转换为unicode以加载UFT-8XML文件_Python_Xml_Unicode_Utf 8

Python 尝试将字符串转换为unicode以加载UFT-8XML文件

python xml unicode utf-8

Python 尝试将字符串转换为unicode以加载UFT-8XML文件,python,xml,unicode,utf-8,Python,Xml,Unicode,Utf 8,我正在构建一个创建UTF-8编码XML文件的EPG scraper。一切都很好，只是我很难将我缝合到一起的字符串的所有位编码为unicode字符串，以便加载到我的文件中我的代码如下： starttime = datetime.strptime(' '.join([str(now.year).encode('UTF-8'), str(e[4].encode('UTF-8')), str(e[0].encode('UTF-8'))]), '%Y %a %d %b %I:%M%p').strftim

我正在构建一个创建UTF-8编码XML文件的EPG scraper。一切都很好，只是我很难将我缝合到一起的字符串的所有位编码为unicode字符串，以便加载到我的文件中

我的代码如下：

starttime = datetime.strptime(' '.join([str(now.year).encode('UTF-8'), str(e[4].encode('UTF-8')), str(e[0].encode('UTF-8'))]), '%Y %a %d %b %I:%M%p').strftime('%Y%m%d%H%M%S')
endtime = datetime.strptime(' '.join([str(now.year).encode('UTF-8'), str(e[4].encode('UTF-8')), str(e[1].encode('UTF-8'))]), '%Y %a %d %b %I:%M%p').strftime('%Y%m%d%H%M%S')

global epg_data

clean_channel = str(channel.encode('UTF-8').replace('&', '&amp;').replace("'", "&apos;").replace('"', '&quot;').replace('<', '&lt;').replace('>', '&gt;'))
clean_e2 = str(e[2].encode('UTF-8').replace('&', '&amp;').replace("'", "&apos;").replace('"', '&quot;').replace('<', '&lt;').replace('>', '&gt;'))
clean_e3 = str(e[3].encode('UTF-8').replace('&', '&amp;').replace("'", "&apos;").replace('"', '&quot;').replace('<', '&lt;').replace('>', '&gt;'))
div_list3 = div_list2.encode('UTF-8').replace('&', '&amp;').replace("'", "&apos;").replace('"', '&quot;').replace('<', '&lt;').replace('>', '&gt;')
e5 = str(e[5].encode('UTF-8').replace('&', '&amp;').replace("'", "&apos;").replace('"', '&quot;').replace('<', '&lt;').replace('>', '&gt;'))

epg_data = ''.join([u'<programme start="',starttime,u' +0100" stop="',endtime,u' +0100" channel="',clean_channel,u'">\n', \
u'<title lang="eng">',e5,u'</title>\n<desc lang="eng">',clean_e2,' ',clean_e3,u'</desc>\n<icon src="',div_list3,u'" />\n', \
u'<country>UK</country>\n</programme>'])

Traceback (most recent call last):
  File "G:\Python27\Kodi\Sky TV Guide Scraper.py", line 332, in soup_to_text
    u'<country>UK</country>\n</programme>'])
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 75: ordinal not in range(128)

starttime=datetime.strtime（''.join（[str（now.year）.encode（'UTF-8'））、str（e[4].encode（'UTF-8'））、str（e[0].encode（'UTF-8'）））、'%Y%a%d%b%I:%M%p'）。strtime（'%Y%M%d%H%M%S'））
endtime=datetime.strtime（''.join（[str（now.year）.encode（'UTF-8'）、str（e[4]）、encode（'UTF-8'）、str（e[1]）、encode（'UTF-8'）））、'%Y%a%d%b%I:%M%p'）。strftime（'%Y%M%d%H%M%S'））
全球epg_数据
clean_channel=str（channel.encode（'UTF-8'）。替换（'&'，'&；'）。替换（''，“&apos；”）。替换（'''，''）。替换（'''））
clean_e2=str（e[2]。编码（'UTF-8'）。替换（'&'，'&；'）。替换（''，“&apos；”）。替换（'''，''）。替换（'''））
clean_e3=str（e[3]。编码（'UTF-8'）。替换（'&'，'&；'）。替换（''，“&apos；”）。替换（'''，''）。替换（''，''）。替换（''））
div_list3=div_list2.encode（'UTF-8'）。替换（'&'，'&；'）。替换（“'”，“&apos；”）。替换（“'”，“”）。替换（“”，”）
e5=str（e[5]。编码（'UTF-8'）。替换（'&'，'&；'）。替换（''，“&apos；”）。替换（'''，''）。替换（'''））
epg_数据=“”.join（[u'\n'\
u''，e5，u'\n'，clean_e2'，clean_e3，u'\n\n'\
u'UK\n']）

我在尝试分析以下内容时遇到问题（打印到IDLE）：


匆忙
《猛虎争霸》第六季第3/6集当一个臭名昭著的硬汉在本周末向阿尔伯特索要50万英镑时，球队试图通过锁定一个拥有一只价值巨大的金虎的花花公子来筹集资金。艾玛被派去劝说店主把这件物品借给一家大型博物馆，希望这伙人能偷走它，但一个无法穿透的地下室会引起麻烦。由前医生科林·贝克和洛丽塔·查克拉巴蒂主演的嘉宾：8.2
英国

生成的错误如下所示：

starttime = datetime.strptime(' '.join([str(now.year).encode('UTF-8'), str(e[4].encode('UTF-8')), str(e[0].encode('UTF-8'))]), '%Y %a %d %b %I:%M%p').strftime('%Y%m%d%H%M%S')
endtime = datetime.strptime(' '.join([str(now.year).encode('UTF-8'), str(e[4].encode('UTF-8')), str(e[1].encode('UTF-8'))]), '%Y %a %d %b %I:%M%p').strftime('%Y%m%d%H%M%S')

global epg_data

clean_channel = str(channel.encode('UTF-8').replace('&', '&amp;').replace("'", "&apos;").replace('"', '&quot;').replace('<', '&lt;').replace('>', '&gt;'))
clean_e2 = str(e[2].encode('UTF-8').replace('&', '&amp;').replace("'", "&apos;").replace('"', '&quot;').replace('<', '&lt;').replace('>', '&gt;'))
clean_e3 = str(e[3].encode('UTF-8').replace('&', '&amp;').replace("'", "&apos;").replace('"', '&quot;').replace('<', '&lt;').replace('>', '&gt;'))
div_list3 = div_list2.encode('UTF-8').replace('&', '&amp;').replace("'", "&apos;").replace('"', '&quot;').replace('<', '&lt;').replace('>', '&gt;')
e5 = str(e[5].encode('UTF-8').replace('&', '&amp;').replace("'", "&apos;").replace('"', '&quot;').replace('<', '&lt;').replace('>', '&gt;'))

epg_data = ''.join([u'<programme start="',starttime,u' +0100" stop="',endtime,u' +0100" channel="',clean_channel,u'">\n', \
u'<title lang="eng">',e5,u'</title>\n<desc lang="eng">',clean_e2,' ',clean_e3,u'</desc>\n<icon src="',div_list3,u'" />\n', \
u'<country>UK</country>\n</programme>'])

Traceback (most recent call last):
  File "G:\Python27\Kodi\Sky TV Guide Scraper.py", line 332, in soup_to_text
    u'<country>UK</country>\n</programme>'])
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 75: ordinal not in range(128)

回溯（最近一次呼叫最后一次）：
文件“G:\Python27\Kodi\Sky TV-Guide Scraper.py”，第332行，在soup\u-to\u文本中
u'UK\n']）
UnicodeDecodeError:“ascii”编解码器无法解码位置75:序号不在范围（128）中的字节0xc2

我在整理这件事时有点迷路了，所以我会非常感激你的帮助

谢谢

Unicode支持在python 2中相当混乱。这是迁移到python 3的前50个原因。将

str

或

unicode

编码为utf-8将返回一个

str

对象，该对象与常规ASCII字符串无法区分。你只要记住它是编码的

str（channel.encode（'utf-8'））

有点冗余（它已经是

str

，所以

str（…）

部分是不必要的

调用

'.join（[u'Unicode支持在python 2中相当混乱。这是移动到python 3的前50个原因。将str
或Unicode
编码到utf-8将返回一个str
对象，它与常规ASCII字符串无法区分。您只需记住它的编码。str（channel.encode（'utf-8'））
有点冗余（它已经是str
了，所以str（…）
部分是不必要的
调用''.join（[u'如何保存文件？以后如何解析？解析器似乎正在尝试ascii。请尝试添加“\n”
到xml的顶部。您使用python 2的原因是什么？python 3已经存在了近十年，并且具有更好的unicode支持。正确的头将在程序的后面添加。但问题早在那之前就已经出现了。问题是试图将ascii字符串转换为unicode，如上所述。关于如何解决的问题，有什么想法吗？实际上，我相信添加这行代码解决了我的问题：“epg_数据2=unicode（epg_数据，'UTF-8'）'您不是在将ascii转换为unicode，而是在将utf-8编码的二进制转换为unicode。修复程序之所以有效，是因为它会对utf-8编码的字符串进行解码。但您不应该做所有这些工作。请从原始代码中删除utf-8内容，并始终使用unicode。如何保存文件？以后如何解析？解析r似乎正在尝试ascii。请尝试添加“\n”
到xml的顶部。您使用python 2的原因是什么？python 3已经存在了近十年，并且具有更好的unicode支持。正确的头将在程序的后面添加。但问题早在那之前就已经出现了。问题是试图将ascii字符串转换为unicode，如上所述。关于如何解决的问题，有什么想法吗？实际上，我相信添加这行代码解决了我的问题：“epg_数据2=unicode（epg_数据，'UTF-8'）'您没有将ascii转换为unicode，而是将utf-8编码的二进制转换为unicode。修复程序之所以有效，是因为它对utf-8编码的字符串进行了解码。但您不应该做所有这些工作。从原始代码中删除utf-8内容，并始终使用unicode。