Scrapy python:unicode链接错误
链接编码 当抓取网站时,scrapy会提取包含&amd的链接,并抛出Exception:Scrapy python:unicode链接错误,python,scrapy,Python,Scrapy,链接编码 当抓取网站时,scrapy会提取包含&amd的链接,并抛出Exception: 不要用unicode URL实例化链接对象。假设utf-8编码(可能是错误的),那么我如何修复这个错误呢?我对这个字符有同样的问题→插入到某些链接上。我在github上找到了一个文件link\u extractors.py,其中包含: from scrapy.selector import HtmlXPathSelector from scrapy.contrib.linkextractors.sgml i
不要用unicode URL实例化链接对象。假设utf-8编码(可能是错误的),那么我如何修复这个错误呢?我对这个字符有同样的问题
→代码>插入到某些链接上。我在github上找到了一个文件link\u extractors.py
,其中包含:
from scrapy.selector import HtmlXPathSelector
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.utils.response import get_base_url
class CustomLinkExtractor(SgmlLinkExtractor):
"""Need this to fix the encoding error."""
def extract_links(self, response):
base_url = None
if self.restrict_xpaths:
hxs = HtmlXPathSelector(response)
base_url = get_base_url(response)
body = u''.join(f for x in self.restrict_xpaths
for f in hxs.select(x).extract())
try:
body = body.encode(response.encoding)
except UnicodeEncodeError:
body = body.encode('utf-8')
else:
body = response.body
links = self._extract_links(body, response.url, response.encoding, base_url)
links = self._process_links(links)
return links
后来我在我的spider.py中使用了它:
rules = (
Rule(CustomLinkExtractor(allow=('/gp/offer-listing*', ),
restrict_xpaths=("//li[contains(@class,'a-last')]/a", )),
callback='parse_start_url', follow=True,
),
)
任何例子都会很有帮助!