Python 如何使用urlopen获取非ascii url？_Python_Unicode_Urllib2_Non Ascii Characters_Urlopen

Python 如何使用urlopen获取非ascii url？

python unicode

Python 如何使用urlopen获取非ascii url？,python,unicode,urllib2,non-ascii-characters,urlopen,Python,Unicode,Urllib2,Non Ascii Characters,Urlopen,我需要从带有非ascii字符的URL获取数据，但urllib2.urlopen拒绝打开资源并引发： UnicodeEncodeError: 'ascii' codec can't encode character u'\u0131' in position 26: ordinal not in range(128) 我知道URL不符合标准，但我没有机会更改它使用Python访问包含非ascii字符的URL所指向的资源的方法是什么编辑：换句话说，可以/如何打开URL，如： http://ex

我需要从带有非ascii字符的URL获取数据，但urllib2.urlopen拒绝打开资源并引发：

UnicodeEncodeError: 'ascii' codec can't encode character u'\u0131' in position 26: ordinal not in range(128)

我知道URL不符合标准，但我没有机会更改它

使用Python访问包含非ascii字符的URL所指向的资源的方法是什么

编辑：换句话说，可以/如何打开URL，如：

http://example.org/Ñöñ-ÅŞÇİİ/

将unicode编码为UTF-8，然后URL编码。

严格来说URI不能包含非ASCII字符；你所拥有的是一个

要将IRI转换为普通ASCII URI，请执行以下操作：

地址主机名部分的非ASCII字符必须使用基于IDNA的算法进行编码
根据Ignacio的回答，路径中的非ASCII字符以及地址的大部分其他部分必须使用UTF-8和%-编码进行编码

因此：

（从技术上讲，这在一般情况下仍然不够好，因为

urlparse

不会分离主机名上的任何

user:pass

前缀或

：port

后缀。只有主机名部分应该是IDNA编码的。使用普通

urllib.quote

和

编码（'IDNA'）更容易编码）

在构建URL时，不必将IRI拆开。）

使用

httplib2

的

iri2uri

方法。它与bobin（他/她是这篇文章的作者吗？

Python 3有一些库来处理这种情况。使用

urllib.parse.urlspilt

将URL拆分为其组件，以及

urllib.parse.quote

以正确引用/转义unicode字符和

urllib.parse.urlunsplit

将其重新连接在一起

>>> import urllib.parse
>>> url = 'http://example.com/unicodè'
>>> url = urllib.parse.urlsplit(url)
>>> url = list(url)
>>> url[2] = urllib.parse.quote(url[2])
>>> url = urllib.parse.urlunsplit(url)
>>> print(url)
http://example.com/unicod%C3%A8

在python3中，对非ascii字符串使用

urllib.parse.quote

函数：

>>> from urllib.request import urlopen                                                                                                                                                            
>>> from urllib.parse import quote                                                                                                                                                                
>>> chinese_wikipedia = 'http://zh.wikipedia.org/wiki/Wikipedia:' + quote('首页')
>>> urlopen(chinese_wikipedia)

对于那些不严格依赖urllib的人，一个实用的替代方法是“开箱即用”地处理IRIs

例如，使用http://bücher.ch：

>>> import requests
>>> r = requests.get(u'http://b\u00DCcher.ch')
>>> r.status_code
200

这比公认的@bobince的答案所暗示的要复杂得多：

netloc应使用IDNA编码
非ascii URL路径应编码为UTF-8，然后转义百分比
非ascii查询参数应编码为从中提取的页面URL的编码（或服务器使用的编码），然后转义百分比

这就是所有浏览器的工作原理；它在中指定-请参见此。可以在w3lib中找到Python实现（这是Scrapy使用的库）；见：

从w3lib.url导入安全url\u字符串
url=安全url\u字符串（u'http://example.org/аöñ-аİİ/'，encoding=“”）

检查URL转义实现是否不正确/不完整的一个简单方法是检查它是否提供“页面编码”参数。

基于@darkfeline-answer：

作品最后我无法避免这些奇怪的角色，但最终我还是挺过来了

import urllib.request
import os


url = "http://www.fourtourismblog.it/le-nuove-tendenze-del-marketing-tenere-docchio/"
with urllib.request.urlopen(url) as file:
    html = file.read()
with open("marketingturismo.html", "w", encoding='utf-8') as file:
    file.write(str(html.decode('utf-8')))
os.system("marketingturismo.html")

谢谢你的回复。你能说得更具体些吗<代码>unicode（url，'utf-8'）引发

类型错误：不支持解码unicode

。另外，您建议使用哪种功能来编码url？例如，urlencode用于构建查询字符串。但我的只是服务器上的一条路径。对于第一部分，您需要

url.encode（'utf-8'）

（假设

url

是

unicode

对象）。@ignacio:谢谢。我仍然认为问题在于urlopen不接受非ascii字符作为URL（这在某种程度上是正确的，因为它们不是标准的）。请看我的更新。虽然这似乎是一个非常利基的问题，它肯定解决了我自己的一个非常具体的问题。回答得很好。如何在Python3中优雅地处理这个问题？有什么建议吗？这实际上非常适用于名称可能包含非美国字符（如汉字符号）的文件！在python 3中，您

导入urllib.parse

而不是

urlparse

，在urlencodenascii中解码b:

b.decode（'utf-8'）

，并将idna部分保留在iritori之外：

返回urllib.parse.urluparse（[url\u encode\u non\u ascii（part.encode（'utf-8'））表示部分中的部分））

使用utf-8进行查询并不总是正确的；细节在我的回答中。网络是一个奇怪的地方。@user230137你说它不工作是什么意思？非常适合我。请注意，这不能正确处理主机名（IDNA）。简单有效！：D比其他答案好很多。这是一个很好的解决方案。解决了在URL中使用汉字时的问题，可用于日语字符集。哇，这被低估了注意：这不能正确处理主机名（IDNA）。建议的解决方案不适用于非ASCII域名（IRI）

urllib2.urlopen（httplib2.iri2uri（“http://Бццццц.цф”），timeout=15）

返回urlopen错误[Errno-2]名称或服务未知

from w3lib.url import safe_url_string
url = safe_url_string(u'http://example.org/Ñöñ-ÅŞÇİİ/', encoding="<page encoding>")

from urllib.parse import urlsplit, urlunsplit, quote

def iri2uri(iri):
    """
    Convert an IRI to a URI (Python 3).
    """
    uri = ''
    if isinstance(iri, str):
        (scheme, netloc, path, query, fragment) = urlsplit(iri)
        scheme = quote(scheme)
        netloc = netloc.encode('idna').decode('utf-8')
        path = quote(path)
        query = quote(query)
        fragment = quote(fragment)
        uri = urlunsplit((scheme, netloc, path, query, fragment))

    return uri

import urllib.request
import os


url = "http://www.fourtourismblog.it/le-nuove-tendenze-del-marketing-tenere-docchio/"
with urllib.request.urlopen(url) as file:
    html = file.read()
with open("marketingturismo.html", "w", encoding='utf-8') as file:
    file.write(str(html.decode('utf-8')))
os.system("marketingturismo.html")