使用python解析内容时出现unicode解码错误

使用python解析内容时出现unicode解码错误,python,unicode,Python,Unicode,我正试图从一个网站的表格中获取一些信息。我的代码如下: import csv import bs4 as bs import requests from lxml import html url = "http://resources.afaqs.com/index.html?id={}&category=AD+Agencies&alpha=" data = [] for number in range(1,100): soup = url.format(numbe

我正试图从一个网站的表格中获取一些信息。我的代码如下:

import csv
import bs4 as bs
import requests
from lxml import html


url = "http://resources.afaqs.com/index.html?id={}&category=AD+Agencies&alpha="
data = []

for number in range(1,100):
    soup = url.format(number)
    r = requests.get(soup)
    tree = html.fromstring(r.content)

    legalname = tree.xpath('//h2[@itemprop="legalname"]/text()')
    ownername = tree.xpath('//td[@itemprop="name"]/text()')
    locality = tree.xpath('//td[@itemprop="addressLocality"]/text()')
    pincode = tree.xpath('//span[@itemprop="postalCode"]/text()')
    addressregion = tree.xpath('//span[@itemprop="addressRegion"]/text()')
    telephone = tree.xpath('//span[@itemprop="telephone"]/text()')
    fax = tree.xpath('//span[@itemprop="faxNumber"]/text()')
    email = tree.xpath('//a[starts-with(@href, "mailto")]/text()')

    legalname = [unicode(i) for i in legalname]
    ownername = [unicode(i) for i in ownername]
    locality = [unicode(i) for i in locality]
    pincode = [unicode(i) for i in pincode]
    addressregion = [unicode(i) for i in addressregion]
    telephone = [unicode(i) for i in telephone]
    fax = [unicode(i) for i in fax]
    email = [unicode(i) for i in email]

    data = {"legalname" : [legalname], "ownername" : [ownername], "locality": [locality], "pincode" : [pincode], "addressregion" : [addressregion],  "email": [email]}
    with open('output.csv','a') as file:
        writer=csv.writer(file)
        writer.writerow(['col1', 'col2'])
        for key in sorted(data.keys()):
            writer.writerow([key]+data[key])
每当此代码遇到unicode错误值时,它都会返回一个错误。我曾尝试将文本转换为unicode,但不起作用。我经常会遇到以下错误:

UnicodeDecodeError: 'utf8' codec can't decode byte 0x96 in position 14: invalid start byte
我试着把这本书改成英文

tree = html.fromstring(r.content)
加入

myparser = etree.HTMLParser(encoding="utf-8")
tree = html.fromstring(r.content, parser=myparser)

如何将xpath文本值转换为utf-8,以便提取数据。

网站告诉您正确的编码是什么?是否有
r.headers['content-type']
header?如果该标题中有
charset
组件,请尝试使用
r.text
。然而,HTML和服务器标题总是很棘手;服务器经常配置错误,文本的默认编码应该是拉丁-1(因此
r.encoding
将在没有字符集参数的情况下设置为拉丁-1),HTML文件可以在
标记中指定自己的编码。@MartijnPieters r.headers['content-type']返回text/htmltry a.encode('utf-8',errors='ignore').strip()@DaniyalSyed:如果数据一开始就不是UTF-8编码的呢?