Scrapy python:unicode链接错误_Python_Scrapy - Fatal编程技术网

Scrapy python:unicode链接错误

python scrapy

Scrapy python:unicode链接错误,python,scrapy,Python,Scrapy,链接编码当抓取网站时，scrapy会提取包含&amd的链接，并抛出Exception：不要用unicode URL实例化链接对象。假设utf-8编码（可能是错误的），那么我如何修复这个错误呢？我对这个字符有同样的问题→插入到某些链接上。我在github上找到了一个文件link\u extractors.py，其中包含： from scrapy.selector import HtmlXPathSelector from scrapy.contrib.linkextractors.sgml i

链接编码

当抓取网站时，scrapy会提取包含&amd的链接，并抛出Exception：

不要用unicode URL实例化链接对象。假设utf-8编码（可能是错误的），那么我如何修复这个错误呢？

我对这个字符有同样的问题

→插入到某些链接上。我在github上找到了一个文件link\u extractors.py
，其中包含：
from scrapy.selector import HtmlXPathSelector
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.utils.response import get_base_url


class CustomLinkExtractor(SgmlLinkExtractor):
"""Need this to fix the encoding error."""

    def extract_links(self, response):
        base_url = None
        if self.restrict_xpaths:
            hxs = HtmlXPathSelector(response)
            base_url = get_base_url(response)
            body = u''.join(f for x in self.restrict_xpaths
                           for f in hxs.select(x).extract())
            try:
                body = body.encode(response.encoding)
            except UnicodeEncodeError:
                body = body.encode('utf-8')
        else:
            body = response.body

        links = self._extract_links(body, response.url, response.encoding, base_url)
        links = self._process_links(links)
        return links

后来我在我的spider.py中使用了它：
rules = (
    Rule(CustomLinkExtractor(allow=('/gp/offer-listing*', ),
                           restrict_xpaths=("//li[contains(@class,'a-last')]/a", )),
         callback='parse_start_url', follow=True,

         ),
)

任何例子都会很有帮助！




[scrapy]相关文章推荐



                                                        
Scrapy 在csv文件中不保存任何内容。什么'；怎么了？
scrapy 
Scrapy 刮壳不认识'；sel'；对象
scrapy 
Scrapy 尝试计划作业时出错
scrapy 
Scrapy 刮皮不'；t将开始URL作为项目返回
scrapy 
WIndows XP Home edition中的scrapy startproject错误
scrapy 
对于下面的场景，如何使用Scrapy获得第二页的内容？
scrapy 
scrapy splash脚本找不到CSS选择器
scrapy 
不知道如何使用scrapy'；项目加载器
scrapy 
作为守护进程运行的scrapyd找不到spider或项目
scrapy 
Scrapy-动态文件命名表单已解析项
scrapy 
当处理多个链接时，scrapy splash并不总是成功
scrapyweb-crawler 
                                       





随机文章推荐



                                                        
Gruntjs 如果jshint检测到错误，我的grunt regarde任务将中止
gruntjs 
Gruntjs grunt contrib jshint不'；不要抱怨console.log
gruntjs 
Gruntjs 为丑陋配置grunt
gruntjs 
Gruntjs 我们如何在grunt cssmin任务中设置目标？
gruntjs 
Gruntjs Grunt Usemin：如何定义目标？
gruntjs 
Gruntjs 注册简单的自定义任务
gruntjs 
Gruntjs 未找到多个Grunt任务
gruntjs 
Yeoman/GruntJS/UseMin
gruntjs 
Gruntjs 我可以指示Grunt压缩index.html中定义的所有JS文件吗？
我可以指示Grunt连接中定义的所有JS文件吗
index.html而不具体命名它们
Grunt还能创建新的index.html文件来加载连接的JS文件吗
以前的多个文件的名称
Grunt能同时丑化JS文件吗
Grunt不仅可以对JS文件，还可以对给定html文件中使用的CSS文件执行此操作吗
gruntjs 
Gruntjs 如何使用grunt创建图像？
gruntjs 
Gruntjs 如何在grunt任务的子进程中运行命令？
gruntjs 
Gruntjs 罗盘/咕噜声转换混音错误
gruntjs 
Gruntjs Grunt指定index.html中插入文件的顺序
gruntjs 
Gruntjs 使用Yeoman搭建脚手架时编译手写笔文件
gruntjs 
Gruntjs Grunt：可以用任何键盘运行任务吗？
gruntjs 
Gruntjs 如何使用grunt contrib concat的grunt babel源地图
gruntjs 
Gruntjs grunt contrib手表&x2B；如何指定目标文件？
gruntjs 
Gruntjs 使用grunt contrib compress时，如果指定路径中没有文件，是否可以不创建输出zip？
gruntjscompression 
Gruntjs 搜索文件夹和子文件夹中的所有.js文件
gruntjs 
Gruntjs 在Grunt中创建匹配的文件名
gruntjs


                                        

                                        
                                        


                                                
                                                        [python]相关推荐
                                                        
如何在python中通过SSLsocket执行https请求
									Python
							 									Ssl
							 									Https
							 
停止python程序
									Python
							 									Python 2.7
							 
使用python通过thrift从hbase读取数据时出错
									Python
							 									Hbase
							 
Python 显示、隐藏和销毁matplotlib.widgets
									Python
							 									Matplotlib
							 
Python 瓶子调试工具栏
									Python
							 
Python 从数据帧使用vincent创建多行图
									Python
							 									Pandas
							 									Dataframe
							 
在Python中使用PDFTron，从具有给定大小特征的PDF中删除所有图像元素
									Python
							 									Image
							 									Pdf
							 
Python 如何在Pyqt中单击按钮时替换框架中的内容
									Python
							 
Python setuptools似乎找到了不正确的匹配项
									Python
							 
Python comtypes:in call_with_inout，ctypes TypeError:'；c#u双'；对象是不可编辑的
									Python
							 
Python numpy中特征值归一化的变化
									Python
							 									Numpy
							 
Python 带有py2exe:ImportError:C扩展名的Pandas:dist未构建
									Python
							 									Pandas
							 
emacs python模式中的缩进不正确
									Python
							 									Emacs
							 
Python 熊猫将绘图保存到文件，并不'；屏幕上没有显示
									Python
							 									Pandas
							 
在html模板django python中使用拆分的子字符串
									Python
							 									Django
							 
Python 根据表中的行数获取多索引值
									Python
							 									Pandas
							 
Python：如何识别横线交叉的字符
									Python
							 									Opencv
							 
Python 用列表替换列中的多个字符串
									Python
							 									Pandas
							 
Python Google云存储XML API无法上载大于50K的文件
									Python
							 									Google App Engine
							 									Google Cloud Storage
							 
Python API调用上的UnicodeError（json）
									Python
							 									Json
							 									Api
							 									Unicode
							 
检查python中的context.table中是否存在字段
									Python
							 
Python 如何在django中将excel或csv文件转换为html表格？
									Python
							 									Django
							 									Excel
							 									Csv
							 
Python Mongodb与Jinja2的聚合
									Python
							 									Mongodb
							 
使用virtualenv的OSQuery和Python扩展
									Python
							 									Monitoring
							 
Python 广播-给定矩阵基的三维系数场到三维矩阵场
									Python
							 									Numpy
							 									Matrix
							 
Python管理某些连续的服务器进程
									Python
							 									Server
							 									Process
							 
python中的UVa 11450-dp
									Python
							 
为关键字（Python）的每个实例提取特定关键字后第一次出现的数字
									Python
							 									Regex
							 									String
							 									Text
							 
正则表达式Python分组
									Python
							 									Regex
							 
在Python中使用Selenium删除文本值
									Python
							 									Python 3.x
							 									Selenium
							 									Web Scraping
							 
                                                        
                                                

                                                
                                                        Tags
                                                        
Css
Rdf
Animation
Teradata
Path
Regex
Hadoop
Architecture
Fluent Nhibernate
Graphql
Stored Procedures
Silverlight
Subsonic
Vb6
Canvas
Join
Stanford Nlp
Heroku
Dependencies
Zsh
Itext
Inheritance
Netbeans
Dataframe
Keycloak
C# 3.0
Excel
Ruby On Rails
Yii
Wpf
Moodle
Oauth 2.0
Jira
Couchdb
Ios6
Sql Server 2008 R2
EmptyTag
Google Maps Api 3
Ocaml
C
Winapi
Jasmine
Cobol
Android Fragments
Python
Air
Vector
Orm
Generics
Openstack
Hive
Excel Formula
Github
Ember.js
Requirejs
Amazon Ec2
Gnuplot
Junit
Dynamics Crm 2011
Asterisk
Jqgrid
Ant
Opengl Es
Swing
Sprite Kit
Map
Routes
Ag Grid
Colors
Apache Storm
Maven
Spring Batch
Nosql
Docker Compose
Quickbooks
Data Binding
Three.js
Snowflake Cloud Data Platform
Internationalization
C#
Nservicebus
Server
Inno Setup
Razor
.net Core
Vba
Amazon Dynamodb
Dom
Fullcalendar
Linker
Smalltalk
Airflow
Pycharm
Azure Sql Database
Calendar
Phpmyadmin
Umbraco
Microservices
Localization
For Loop
Ssis
Ipython
Sails.js
Gulp
Postman
Websphere
Stm32
Codeigniter
Magento2
Macos
Primefaces
Grails
Omnet++
Swagger
Ios8
Powershell
Asp Classic
Sparql
Cygwin
Html5 Canvas
Tinymce
Ruby On Rails 3
Chart.js
Image Processing
Xaml
Sqlalchemy
Amazon Redshift
Vmware
Clearcase
E Commerce
Cassandra
Lua
Cron
Google Analytics
Grid
Axapta
Sublimetext3
Netty
Ruby On Rails 3.2
Streaming
Openlayers 3
Gridview
Angularjs
Video Streaming
Jaxb
Gmail
Shiny
Redirect
Pagination
Tensorflow
Syntax
Gremlin
Ubuntu
Sonarqube
Leaflet
Csv
Computer Vision
Openerp
Plugins
Debugging
Angular
Udp
Awk
Input
Parameters
Python 3.x
Ibm Cloud
Android Ndk
Command Line
Appium
Installation
Next.js
Chef Infra
Phantomjs
Smtp
Twilio
Jms
Mono
Spring Integration
Xamarin.android
Sharepoint 2007
Sap
Libgdx
Database Design
Video
Biztalk
Jupyter Notebook
Openssl
Cypress
String
Dynamic
Rust
Java
Exception Handling
Jsf
Acumatica
Android
Javafx
Twitter
Oracle
Knockout.js


                

                        
						
                        
                                
                                        
                                                
                                                        
                                                                Copyright © 2024. All Rights Reserved by  - Fatal编程技术网