使用python解析内容时出现unicode解码错误_Python_Unicode

使用python解析内容时出现unicode解码错误

python unicode

使用python解析内容时出现unicode解码错误,python,unicode,Python,Unicode,我正试图从一个网站的表格中获取一些信息。我的代码如下： import csv import bs4 as bs import requests from lxml import html url = "http://resources.afaqs.com/index.html?id={}&category=AD+Agencies&alpha=" data = [] for number in range(1,100): soup = url.format(numbe

我正试图从一个网站的表格中获取一些信息。我的代码如下：

import csv
import bs4 as bs
import requests
from lxml import html


url = "http://resources.afaqs.com/index.html?id={}&category=AD+Agencies&alpha="
data = []

for number in range(1,100):
    soup = url.format(number)
    r = requests.get(soup)
    tree = html.fromstring(r.content)

    legalname = tree.xpath('//h2[@itemprop="legalname"]/text()')
    ownername = tree.xpath('//td[@itemprop="name"]/text()')
    locality = tree.xpath('//td[@itemprop="addressLocality"]/text()')
    pincode = tree.xpath('//span[@itemprop="postalCode"]/text()')
    addressregion = tree.xpath('//span[@itemprop="addressRegion"]/text()')
    telephone = tree.xpath('//span[@itemprop="telephone"]/text()')
    fax = tree.xpath('//span[@itemprop="faxNumber"]/text()')
    email = tree.xpath('//a[starts-with(@href, "mailto")]/text()')

    legalname = [unicode(i) for i in legalname]
    ownername = [unicode(i) for i in ownername]
    locality = [unicode(i) for i in locality]
    pincode = [unicode(i) for i in pincode]
    addressregion = [unicode(i) for i in addressregion]
    telephone = [unicode(i) for i in telephone]
    fax = [unicode(i) for i in fax]
    email = [unicode(i) for i in email]

    data = {"legalname" : [legalname], "ownername" : [ownername], "locality": [locality], "pincode" : [pincode], "addressregion" : [addressregion],  "email": [email]}
    with open('output.csv','a') as file:
        writer=csv.writer(file)
        writer.writerow(['col1', 'col2'])
        for key in sorted(data.keys()):
            writer.writerow([key]+data[key])

每当此代码遇到unicode错误值时，它都会返回一个错误。我曾尝试将文本转换为unicode，但不起作用。我经常会遇到以下错误：

UnicodeDecodeError: 'utf8' codec can't decode byte 0x96 in position 14: invalid start byte

我试着把这本书改成英文

tree = html.fromstring(r.content)

加入

myparser = etree.HTMLParser(encoding="utf-8")
tree = html.fromstring(r.content, parser=myparser)

如何将xpath文本值转换为utf-8，以便提取数据。

网站告诉您正确的编码是什么？是否有

r.headers['content-type']

header？如果该标题中有

charset

组件，请尝试使用

r.text

。然而，HTML和服务器标题总是很棘手；服务器经常配置错误，文本的默认编码应该是拉丁-1（因此

r.encoding

将在没有字符集参数的情况下设置为拉丁-1），HTML文件可以在

标记中指定自己的编码。@MartijnPieters r.headers['content-type']返回text/htmltry a.encode（'utf-8'，errors='ignore'）.strip（）@DaniyalSyed:如果数据一开始就不是UTF-8编码的呢？