Python 添加了刮擦规则，但未刮擦更多项目_Python_Scrapy - Fatal编程技术网

Python 添加了刮擦规则，但未刮擦更多项目

python scrapy

Python 添加了刮擦规则，但未刮擦更多项目,python,scrapy,Python,Scrapy,在我的Scrapy输出文件中，我发现缺少一些项，所以我手动添加这些缺少的页面作为第三条规则 class KjvSpider(CrawlSpider): name = 'kjv' start_urls = ['file:///G:/OEBPS2/bible-toc.xhtml'] rules = ( Rule(LinkExtractor(allow=r'OEBPS'), follow=True), # 1st rule Rule

在我的Scrapy输出文件中，我发现缺少一些项，所以我手动添加这些缺少的页面作为第三条规则

class KjvSpider(CrawlSpider):
    name = 'kjv'
    start_urls = ['file:///G:/OEBPS2/bible-toc.xhtml']

    rules = (
        Rule(LinkExtractor(allow=r'OEBPS'), follow=True),      # 1st rule

        Rule(LinkExtractor(allow=r'\d\.xhtml$'),
             callback='parse_item', follow=False),             # 2nd rule
        Rule(LinkExtractor(allow=[r'2-jn.xhtml$', r'jude.xhtml$', r'obad.xhtml$', r'philem.xhtml$'], ),
             callback='parse_item', follow=False),             # 3rd rule
    )

如果我启用

1st rule

和

3rd rule

（注释掉

2nd rule

），我可以正确下载四个缺少的项目，但不能下载全部项目（大约2000个itmes）

但是如果我启用了这三条规则，那么丢失的项目仍然丢失。（即，如果我添加

第三条规则

，则没有区别）

我不知道为什么规则不起作用

欢迎提出任何建议。提前感谢。

我想我必须在

第一条规则中拒绝这些缺失的URL，这样在第三条规则中，它就不会被过滤为重复请求。
因此，它将是正常的
e、 g
rules = (
    Rule(LinkExtractor(allow=r'OEBPS',deny=(r'2-jn.xhtml$', r'jude.xhtml$', 
         r'obad.xhtml$',r'philem.xhtml$')), follow=True),   # 1st rule

    Rule(LinkExtractor(allow=r'\d\.xhtml$'),
         callback='parse_item', follow=False),              # 2nd rule
    Rule(LinkExtractor(allow=[r'2-jn.xhtml$', r'jude.xhtml$', r'obad.xhtml$', r'philem.xhtml$'], ),
         callback='parse_item', follow=False),              # 3rd rule
)




[scrapy]相关文章推荐



                                                        
Scrapy SGMLLinkedExtractor don'；行不通
scrapy 
每天运行Scrapy并跟踪数据中的更改
scrapy 
Scrapy 多表单抓取认证页面
scrapy 
Scrapy 芹菜+；Eventlet+；发痒的
scrapy 
Scrapy 递归请求中的碎片项
scrapy 
Scrapy-更改日志统计数据之间的间隔
scrapy 
Scrapy 如何从httpcache中删除url或防止添加到缓存
scrapy 
Scrapy：如何编写HttpProxyMiddleware？
scrapy 
Scrapy 刮擦链接提取器规则
scrapy 
Scrapy在每行中返回相同的第一行数据，而不是每行单独的数据
scrapy 
使用spawn在node.js API中调用scrapy脚本
scrapyweb-crawler 
Scrapy 刮擦式卡盘未始终使用CloseSpider extension终止
scrapy 
来自管道的Scrapy异步api调用
scrapy 
Scrapy 抓取博客-通过提前检查json/csv中的URL，避免已经抓取的项目
scrapy 
Scrapy 向CrawlerProcess传递参数将停止爬行器对开始URL进行爬网
scrapyweb-crawler 
                                       





随机文章推荐



                                                        
Graph 寻找一个坚实和可扩展的库来呈现无向图
graph 
Graph 如何在drawrect中绘制的线条上检测触摸？
graph 
Graph 如何在java中将预使用生成的图形保存为pdf？
graph 
Graph 艾达用什么作图？
graph 
Graph neo4j数据库的良好模式
graphneo4j 
Graph 使用Matlab，当我改变图形的轴时，回归线的视觉几何角度是如何变化的？
graphmatlab 
Graph 人工智能中的条纹搜索与A*算法
graphartificial-intelligence 
Graph 图形生成工具
graph 
Graph 在Sparql中创建和查看图形时出错
graphsparql 
Graph 点语言与集中=真混淆图形
graphgraphviz 
Graph 无向图中的DFS
graph 
Graph OrientDB：如何在忽略某些边的情况下使用dijkstra函数查询图形
graphorientdb 
Graph 修正tensorflow运行中的竞争条件
graphtensorflow 
Graph Gephi 0.9.1不显示带有曲线边的边箭头
graph 
Graph 如何消除使用proc gchart创建的杂波图？
graphsas 
Graph 基于AQL的图形查询
grapharangodb 
Graph 如何从用户请求中按Manager属性筛选
graphoffice365microsoft-graph-api 
Graph 如何将单元格中的字符串用作图形数据范围？
graphgoogle-sheets 
Graph 如何用点绘制1/x函数？
graphgraphicsprocessing 
Graph API返回504网关超时
graphmicrosoft-graph-api


                                        

                                        
                                        


                                                
                                                        [python]相关推荐
                                                        
python列表索引超出范围错误
									Python
							 
Python 如何准确地测量流经命名管道的比特率？
									Python
							 
Python 使用输入编辑模型。数值误差
									Python
							 									Html
							 									Django
							 
从python子流程中的fd#3读取
									Python
							 									Debian
							 
Python Keller框中的矩阵，检索值时出错
									Python
							 									Numpy
							 
替换字符串python中的特殊字符
									Python
							 									String
							 									List
							 									Replace
							 
Python 占位符文本未显示（pyside/pyqt）
									Python
							 									Qt
							 									User Interface
							 
处理Python包中未处理的异常
									Python
							 									Django
							 									Wordpress
							 
Python 使用pytz的日期时间时区转换
									Python
							 									Python 2.7
							 									Datetime
							 
Python Matplotlib中的像素化动画
									Python
							 									Animation
							 									Matplotlib
							 
Python 将列表读入DataFrame的列中
									Python
							 									List
							 									Pandas
							 
Python dict到数据帧
									Python
							 									Pandas
							 
Python 有没有一种方法可以不考虑阿拉伯文字符的首/中/末形式而对其进行比较？
									Python
							 									String
							 
Python 如果在django的url.py中找不到，请将所有URL重定向到404.html
									Python
							 									Regex
							 									Django
							 
Python 如何在没有root访问权限的情况下部署nginx？
									Python
							 									Django
							 									Nginx
							 
Python 如何正确创建自定义文本编解码器？
									Python
							 									Python 3.x
							 
TypeError:super（）至少接受1个参数（给定0个）错误是否特定于任何python版本？
									Python
							 									Python 2.7
							 
Python dask dataframe如何将列转换为_datetime
									Python
							 									Pandas
							 									Dask
							 
Python 如何运行服务于特定路径的http服务器？
									Python
							 									Python 3.x
							 
Python3范围与Python2范围
									Python
							 									List
							 
Python TypeError:类型为'的对象；字节'；JSON不可序列化
									Python
							 									Json
							 									Serialization
							 									Scrapy
							 
Python 为什么pypy的改进筛速度较慢？
									Python
							 									Performance
							 
Python 插入轴的特定位置
									Python
							 									Matplotlib
							 									Graph
							 
Python 应该使用哪个运算符（+；vs+；=）来实现性能？（到位与未到位）
									Python
							 									Performance
							 
Python:Groupby求和和和连接字符串
									Python
							 									Pandas
							 
Python 将数据帧转换为numpy数组-首选哪种方法？
									Python
							 									Arrays
							 									Pandas
							 									Numpy
							 									Dataframe
							 
仅水平网格（在python中使用pandas plot+；pyplot）
									Python
							 									Pandas
							 									Matplotlib
							 
Python 芹菜和烧瓶-不能将新设置名称与旧设置名称混合
									Python
							 									Python 3.x
							 									Flask
							 
Python 如何打印关联表的结果
									Python
							 									Python 3.x
							 									Flask
							 
Python 为什么'；t pyngrok是否检测到我的配置文件？
									Python
							 
                                                        
                                                

                                                
                                                        Tags
                                                        
Windows 10
Modelica
Ibm Mobilefirst
Openssl
Sockets
Cocoa Touch
Debian
Firebase
Gmail
Xampp
Orm
Sharepoint 2010
Logstash
Http
Oracle Apex
Botframework
Azure
Windows Store Apps
Git
Internationalization
Sap
Gwt
Apache Zookeeper
Parallel Processing
Network Programming
Spring
Codenameone
Xamarin.android
Angular6
Time Complexity
Tinymce
Xpages
Linker
Web
F#
Wcf
Google Cloud Storage
Validation
Osgi
Omnet++
Templates
Salesforce
Jenkins
Haskell
Nsis
Clang
Log4j
Bazel
Ruby
Syntax
Neural Network
Windows
Google Cloud Firestore
Oracle
Ssas
Keras
Jestjs
Hyperledger Fabric
Xamarin
Visual C++
Mvvm
Tridion
Typo3
Powerbi
Amazon Redshift
Cluster Computing
Twilio
Drop Down Menu
Vb.net
Android Fragments
Listview
Performance
Wordpress
Zend Framework2
Zend Framework
Websocket
Netty
Solr
X86
Optimization
Virtualbox
Google Cloud Dataflow
Random
Printing
Uitableview
Smtp
Azure Sql Database
Email
C
Raspberry Pi
Crystal Reports
Menu
Pointers
Sorting
Silverstripe
Frameworks
Svn
Swift3
Inno Setup
Antlr4
Algorithm
Enums
Udp
Error Handling
Matrix
Exception Handling
Pascal
Sql Server
Plugins
Compression
Autodesk Forge
Design Patterns
Jekyll
Matlab
Kubernetes
Operating System
Quickbooks
Fonts
Streaming
.htaccess
Sed
Visual Studio 2008
Sharepoint 2013
Linux
Browser
Typescript
Leaflet
Redis
Memory
Sml
Internet Explorer 8
Google App Engine
Woocommerce
Telerik
Prestashop
Sas
Sapui5
Methods
String
Symfony1
Elixir
Path
Jhipster
Gps
Xml
Import
Ethereum
React Native
Yii2
Asp.net Core Mvc
Graphql
Geolocation
Laravel 4
Video
Certificate
Audio
Julia
Migration
Jersey
Asterisk
Android
Nest
Processing
Servlets
Binding
Bots
Plsql
Netlogo
Fluent Nhibernate
Intellij Idea
Anaconda
Reflection
Encoding
Layout
Mono
Opencart
Cuda
Logic
Gis
Sdk
Ag Grid
Mpi
Xcode4
Compiler Errors
Extjs
Wxpython
Apache Flink
Routes
Calendar
Npm
Video Streaming
Ant
Apache2
Ubuntu
Loopbackjs
Corda
Grid
Cron
Delphi
Azure Data Factory
Configuration


                

                        
						
                        
                                
                                        
                                                
                                                        
                                                                Copyright © 2024. All Rights Reserved by  - Fatal编程技术网