为什么Python坚持使用ascii？_Python_Utf 8_Ascii_Beautifulsoup_Python Requests

为什么Python坚持使用ascii？

python utf-8

为什么Python坚持使用ascii？,python,utf-8,ascii,beautifulsoup,python-requests,Python,Utf 8,Ascii,Beautifulsoup,Python Requests,当解析包含请求和Beauty Soup的HTML文件时，以下行在某些网页上引发异常： if 'var' in str(tag.string): 以下是上下文： response = requests.get(url) soup = bs4.BeautifulSoup(response.text.encode('utf-8')) for tag in soup.findAll('script'): if 'var' in str(tag.string): # This is

当解析包含请求和Beauty Soup的HTML文件时，以下行在某些网页上引发异常：

if 'var' in str(tag.string):

以下是上下文：

response = requests.get(url)  
soup = bs4.BeautifulSoup(response.text.encode('utf-8'))

for tag in soup.findAll('script'):
    if 'var' in str(tag.string):    # This is the line throwing the exception
        print(tag.string)

例外情况如下：

UnicodeDecodeError:“ascii”编解码器无法解码位置15中的字节0xc3:序号不在范围内（128）

我在

BeautifulSoup

行中尝试了使用和不使用

encode（'utf-8'）

函数，没有任何区别。我注意到，对于抛出异常的页面，javascript中的注释中有一个字符

Ã

，即使response.encoding报告的编码是

ISO-8859-1

。我确实意识到我可以使用unicodedata.normalize删除有问题的字符，但是我更愿意将

标记

变量转换为

utf-8

，并保留字符。以下方法都无助于将变量更改为

utf-8

：

tag.encode('utf-8')
tag.decode('ISO-8859-1').encode('utf-8')
tag.decode(response.encoding).encode('utf-8')

要将此字符串转换为可用的

utf-8

，必须对其执行什么操作

好的，那么基本上您得到的是一个用

拉丁语-1编码的HTTP响应。给你出问题的字符确实是Ã
，因为你可能会看到0xC3
正是拉丁语-1中的字符
我认为您对解码/编码请求时所设想的每种组合都进行了盲测试。首先，如果这样做：if'var'在str（tag.string）中：
每当string
var包含非ASCII字节时，python就会抱怨
查看您与我们共享的代码，IMHO的正确方法是：
response = requests.get(url)
# decode the latin-1 bytes to unicode  
#soup = bs4.BeautifulSoup(response.text.decode('latin-1'))
#try this line instead
soup = bs4.BeautifulSoup(response.text, from_encoding=response.encoding)

for tag in soup.findAll('script'):
    # since now soup was made with unicode strings I supposed you can treat
    # its elements as so
    if u'var' in tag.string:    # This is the line throwing the exception
        # now if you want output in utf-8
        print(tag.string.encode('utf-8'))

编辑：您可以查看
基本上，逻辑是：
您可以在encodingX
通过执行bytes.decode（'X'）来解码X
，这将返回一个unicode字节序列
您使用unicode
您将unicode编码为输出ubytes的某种编码Y
。encode（'Y'）
希望这能给问题带来一些启示。
好的，那么基本上您得到的是一个用拉丁语-1编码的HTTP响应。给你出问题的字符确实是Ã
，因为你可能会看到0xC3
正是拉丁语-1中的字符
我认为您对解码/编码请求时所设想的每种组合都进行了盲测试。首先，如果这样做：if'var'在str（tag.string）中：
每当string
var包含非ASCII字节时，python就会抱怨
查看您与我们共享的代码，IMHO的正确方法是：
response = requests.get(url)
# decode the latin-1 bytes to unicode  
#soup = bs4.BeautifulSoup(response.text.decode('latin-1'))
#try this line instead
soup = bs4.BeautifulSoup(response.text, from_encoding=response.encoding)

for tag in soup.findAll('script'):
    # since now soup was made with unicode strings I supposed you can treat
    # its elements as so
    if u'var' in tag.string:    # This is the line throwing the exception
        # now if you want output in utf-8
        print(tag.string.encode('utf-8'))

编辑：您可以查看
基本上，逻辑是：
您可以在encodingX
通过执行bytes.decode（'X'）来解码X
，这将返回一个unicode字节序列
您使用unicode
您将unicode编码为输出ubytes的某种编码Y
。encode（'Y'）
希望这能给问题带来一些启示。
您也可以尝试使用Unicode Dammit库（它是BS4的一部分）来解析页面。此处的详细说明：
您还可以尝试使用Unicode Dammit库（它是BS4的一部分）来解析页面。这里的详细描述：
您尝试了这些方法，但一直在做：如果str（tag.string）中的'var'：
？？@PauloBu:不，我当然使用转换的输出！您尝试了这些方法，但一直在做：如果str（tag.string）中的'var'：
？？@PauloBu:不，我当然使用转换的输出！谢谢我没有使用response.text.decode（'latin-1'）
而是尝试response.text.decode（response.encoding）
，因为此应用程序还需要与其他站点一起使用。正是这一行现在抛出了错误消息（当然，尽管位置不同）。没有通用的编码方法吗？现在有什么错误？这是使用任何编码的方法。您可以对响应进行编码、解码、使用unicode和int utf-8编码。现在抛出的错误是什么？响应。编码看起来是什么样子？相同的错误：UnicodeEncodeError:“ascii”编解码器无法对5837-5838位置的字符进行编码：序号不在范围（128）
，现在在这一行：soup=bs4.beautifulsou（response.text.decode（response.encoding））
（都是从CLI错误消息中复制的）。在本例中，我正在解析的页面是（不是我的站点，只是我偶然发现的一个示例）。我在实例化BeautifulSoup对象时编辑了答案中的代码。还为您提供了一个指向文档的链接，该链接将非常有用。我将查看该页面。如果该页面有效，请通知我。谢谢您，发送您提到的from_encoding=
编码似乎很有帮助！我现在正在测试。感谢您提供指向t的相关部分的链接他创建了documentation.thanking.而不是response.text.decode（'latin-1'）
我正在尝试response.text.decode（response.encoding）
，因为这个应用程序也需要与其他站点一起工作。这行代码现在抛出了错误消息（当然，尽管位置不同）。没有通用的方法来处理任何编码吗？现在有什么错误？这是处理任何编码的方法。你得到响应编码，解码它，使用unicode和int utf-8编码。现在抛出了什么错误，以及响应如何。编码
看起来像什么？相同的错误：UnicodeCodeError:'ascii'编解码器无法编码位置5837-5838中的e字符：序号不在范围（128）
，现在在这一行：soup=bs4.BeautifulSoup（response.text.decode（response.encoding））
（都是从CLI错误消息中复制的）。我在本例中解析的页面是（不是我的站点，只是我偶然发现的一个示例）.我在实例化BeautifulSoup对象时编辑了答案中的代码。