使用python解析内容时出现unicode解码错误
我正试图从一个网站的表格中获取一些信息。我的代码如下:使用python解析内容时出现unicode解码错误,python,unicode,Python,Unicode,我正试图从一个网站的表格中获取一些信息。我的代码如下: import csv import bs4 as bs import requests from lxml import html url = "http://resources.afaqs.com/index.html?id={}&category=AD+Agencies&alpha=" data = [] for number in range(1,100): soup = url.format(numbe
import csv
import bs4 as bs
import requests
from lxml import html
url = "http://resources.afaqs.com/index.html?id={}&category=AD+Agencies&alpha="
data = []
for number in range(1,100):
soup = url.format(number)
r = requests.get(soup)
tree = html.fromstring(r.content)
legalname = tree.xpath('//h2[@itemprop="legalname"]/text()')
ownername = tree.xpath('//td[@itemprop="name"]/text()')
locality = tree.xpath('//td[@itemprop="addressLocality"]/text()')
pincode = tree.xpath('//span[@itemprop="postalCode"]/text()')
addressregion = tree.xpath('//span[@itemprop="addressRegion"]/text()')
telephone = tree.xpath('//span[@itemprop="telephone"]/text()')
fax = tree.xpath('//span[@itemprop="faxNumber"]/text()')
email = tree.xpath('//a[starts-with(@href, "mailto")]/text()')
legalname = [unicode(i) for i in legalname]
ownername = [unicode(i) for i in ownername]
locality = [unicode(i) for i in locality]
pincode = [unicode(i) for i in pincode]
addressregion = [unicode(i) for i in addressregion]
telephone = [unicode(i) for i in telephone]
fax = [unicode(i) for i in fax]
email = [unicode(i) for i in email]
data = {"legalname" : [legalname], "ownername" : [ownername], "locality": [locality], "pincode" : [pincode], "addressregion" : [addressregion], "email": [email]}
with open('output.csv','a') as file:
writer=csv.writer(file)
writer.writerow(['col1', 'col2'])
for key in sorted(data.keys()):
writer.writerow([key]+data[key])
每当此代码遇到unicode错误值时,它都会返回一个错误。我曾尝试将文本转换为unicode,但不起作用。我经常会遇到以下错误:
UnicodeDecodeError: 'utf8' codec can't decode byte 0x96 in position 14: invalid start byte
我试着把这本书改成英文
tree = html.fromstring(r.content)
加入
myparser = etree.HTMLParser(encoding="utf-8")
tree = html.fromstring(r.content, parser=myparser)
如何将xpath文本值转换为utf-8,以便提取数据。网站告诉您正确的编码是什么?是否有
r.headers['content-type']
header?如果该标题中有charset
组件,请尝试使用r.text
。然而,HTML和服务器标题总是很棘手;服务器经常配置错误,文本的默认编码应该是拉丁-1(因此r.encoding
将在没有字符集参数的情况下设置为拉丁-1),HTML文件可以在
标记中指定自己的编码。@MartijnPieters r.headers['content-type']返回text/htmltry a.encode('utf-8',errors='ignore').strip()@DaniyalSyed:如果数据一开始就不是UTF-8编码的呢?