Python 如何处理urllib2.urlopen的url中的®?

Python 如何处理urllib2.urlopen的url中的®?,python,urllib2,python-unicode,urlopen,Python,Urllib2,Python Unicode,Urlopen,我收到一个url:®-75桌面虚拟化解决方案;这是从美苏来的 url=u'https://www.packtpub.com/virtualization-and-cloud/citrix-xenapp\xae-75-desktop-virtualization-solutions' 我想再次反馈到urllib2.urlopen import urllib2 source = urllib2.urlopen(url).read() 我得到的错误是: UnicodeEncodeError: 'g

我收到一个url:®-75桌面虚拟化解决方案;这是从美苏来的

url=u'https://www.packtpub.com/virtualization-and-cloud/citrix-xenapp\xae-75-desktop-virtualization-solutions'
我想再次反馈到urllib2.urlopen

import urllib2
source = urllib2.urlopen(url).read()
我得到的错误是:

UnicodeEncodeError: 'gbk' codec can't encode character u'\xae' in position 43: illegal multibyte sequence
因此,我尝试:

source = urllib2.urlopen(url.encode("utf-8")).read()
它获得了页面源代码,但是它与原始url不同

originalUrl = 'https://www.packtpub.com/virtualization-and-cloud/citrix-xenapp®-75-desktop-virtualization-solutions'
originalSource = urllib2.urlopen(originalUrl).read()
originalSource == source

结果是错误的。有没有办法修复这个url?如何将u'\xae'转换为原始®?

URL必须通过TestRing有效,且非ASCII码点编码正确。您需要编码为UTF-8,然后url引用url的路径:

import urllib
import urllib2
import urlparse

originalUrl = u'https://www.packtpub.com/virtualization-and-cloud/citrix-xenapp\xae-75-desktop-virtualization-solutions'
parsed_link = urlparse.urlsplit(originalUrl.encode('utf8'))
parsed_link = parsed_link._replace(path=urllib.quote(parsed_link.path))
encoded_link = parsed_link.geturl()
source = urllib2.urlopen(encoded_link).read()
演示:


除了URL.path之外,还有其他简单的方法来处理整个URL吗?不知道你的意思;如果您试图将urllib.quote应用于整个URL,则错误的内容会被编码为冒号。@Martijin,谢谢。你已经回答了我的问题。只需使用urllib.quote对URL.path进行编码。这似乎不正确。我能通过http://ru.wikipedia.org/wiki/Сачаааааааааааааааааааа1072。Wikipedia支持发送编码为UTF-8的URL,而无需正确引用URL。这超出了要求,您不能指望所有服务器都这样做。
>>> import urllib
>>> import urllib2 
>>> import urlparse
>>> originalUrl = u'https://www.packtpub.com/virtualization-and-cloud/citrix-xenapp\xae-75-desktop-virtualization-solutions'
>>> parsed_link = urlparse.urlsplit(originalUrl.encode('utf8'))
>>> parsed_link = parsed_link._replace(path=urllib.quote(parsed_link.path))
>>> encoded_link = parsed_link.geturl()
>>> encoded_link
'https://www.packtpub.com/virtualization-and-cloud/citrix-xenapp%C2%AE-75-desktop-virtualization-solutions'
>>> source = urllib2.urlopen(encoded_link).read()
>>> len(source)
68758