Python 如何防止重复链接被解析？_Python_Python 3.x_Web Scraping_Css Selectors_Web Crawler - Fatal编程技术网

Python 如何防止重复链接被解析？

python python-3.x web-scraping web-crawler

Python 如何防止重复链接被解析？,python,python-3.x,web-scraping,css-selectors,web-crawler,Python,Python 3.x,Web Scraping,Css Selectors,Web Crawler,我已经用python编写了一些脚本，以获取当前运行良好的网页中的下一页链接。这个刮刀唯一的问题是它不能摆脱重复链接。希望有人能帮我做到这一点。我试过： import requests from lxml import html page_link = "https://yts.ag/browse-movies" def nextpage_links(main_link): response = requests.get(main_link).text tree = html.

我已经用python编写了一些脚本，以获取当前运行良好的网页中的下一页链接。这个刮刀唯一的问题是它不能摆脱重复链接。希望有人能帮我做到这一点。我试过：

import requests
from lxml import html

page_link = "https://yts.ag/browse-movies"

def nextpage_links(main_link):
    response = requests.get(main_link).text
    tree = html.fromstring(response)
    for item in tree.cssselect('ul.tsc_pagination a'):
        if "page" in item.attrib["href"]:
            print(item.attrib["href"])

nextpage_links(page_link)

这是我得到的部分图像：

您可以将set用于以下目的：

import requests
from lxml import html

page_link = "https://yts.ag/browse-movies"

def nextpage_links(main_link):
    links = set()
    response = requests.get(main_link).text
    tree = html.fromstring(response)
    for item in tree.cssselect('ul.tsc_pagination a'):
        if "page" in item.attrib["href"]:
            links.add(item.attrib["href"])

    return links

nextpage_links(page_link)

您还可以使用

scrapy

，默认情况下会限制重复

制作一个集合，添加所有已处理的链接，并在处理之前检查链接是否已经存在。感谢Sumit Gupta的回答。它起作用了。您应该从脚本中取出print语句。顺便说一句，如果项目管道处理正确，scrapy将处理重复项。谢谢，删除了

打印语句：）




[python 3.x]相关文章推荐



                                                        
Python 3.x Python-从3x3 NP数组中随机选取一列或一行
python-3.xrandomnumpy 
Python 3.x 使用文件对话框加载音频文件并使用pygame播放
python-3.xtkinter 
Python 3.x 为多标签分类创建数据集
python-3.xpandasdataframescikit-learn 
Python 3.x Jupyter内核：有办法重命名它们吗？
python-3.x 
Python 3.x 无法单击标题下的HREF（不可见元素）
python-3.xseleniumxpathselenium-webdriverweb-scraping 
Python 3.x PyArrow表到PySpark数据帧的转换
python-3.xpandas 
Python 3.x 使用Python进行Web抓取，需要登录才能查看输出
python-3.xweb-scraping 
Python 3.x 如何将字符串中的每个单词与另一个字符串进行比较。python
python-3.x 
Python 3.x 运行此代码时，我收到名称错误
python-3.x 
Python 3.x 运行时错误：cublas运行时错误：资源分配失败
python-3.xneural-networkdeep-learningpytorch 
Python 3.x 如何使用导入将当前对象状态获取到python中的不同模块
python-3.xoop 
Python 3.x 如何独立并行运行一个依赖于另一个大函数的函数
python-3.x 
Python 3.x 根据odoo中的领料类型在继承的交收qweb报告上打印产品说明
python-3.xodoo 
Python 3.x 使用驼峰式大小写将json转换为模式文件
python-3.xflask 
Python 3.x 如何在没有虚拟环境的情况下使用PyQt5
python-3.x 
Python 3.x python中的类变量和实例变量
python-3.xclassoopexception 
Python 3.x 如何增加PyQt5 app.processEvents（）队列深度？
python-3.x 
Python 3.x Python，TypeError:没有编码的字符串参数
python-3.xencoding 
Python 3.x 如何使import IPython.nbformat发出的警告静音："；用户警告：nbformat.current已弃用；
python-3.xjupyter-notebookipython 
Python 3.x Fbchat登录失败。检查电子邮件/密码
python-3.xfacebook 
                                       





随机文章推荐



                                                        
.htaccess和重写URL
.htaccessmod-rewritehttps 
.htaccess规则重定向'；域名.tld.&x27；至'；domain.tld'；
.htaccess 
.htaccess 如何更改目录的url
.htaccessmod-rewrite 
.htaccess codeigniter会话在其不应'；T
.htaccesscodeignitersession 
.htaccess 重定向多个TLD的最有效方法'；s使用htaccess将公共根目录共享到一个主域
.htaccess 
.htaccess重写条件请求\u文件名
.htaccessmod-rewrite 
.htaccess 这是如何使用mod_speling的？
.htaccess 
.htaccess 重写URL结构
.htaccessmod-rewriteurl-rewriting 
.htaccess 使用htaccess删除扩展名和尾部斜杠
.htaccess 
.htaccess和A记录子域
.htaccessdns 
.htaccess 以其他方式重定向与目录匹配的url
.htaccessdirectory 
.htaccess 在htaccess中使用1个或多个查询重写URL
.htaccessurl-rewriting 
使用.htaccess将特定域别名上的特定页面重定向到https/SSL
.htaccessredirecturl-rewriting 
.htaccess htaccess url重写以删除url的结尾
.htaccess 
.htaccess mod_rewrite和RewriteRule从添加错误页面阻止我
.htaccesslaravelerror-handling 
.htaccess htaccess从两个域重定向
.htaccessredirectdns 
.htaccess重写类别筛选器
.htaccess 
.htaccess Symfony 3.4位于子目录中，从url中删除web/
.htaccesssymfony 
.htaccess非www到非www SSL和www到www SSL
.htaccessssl 
.htaccess 是否未考虑重写规则最后[L]标志？
.htaccessurl-rewriting


                                        

                                        
                                        


                                                
                                                        [python]相关推荐
                                                        
Python 如何找到3+；带Matplotlib的圆
									Python
							 									Graph
							 									Matplotlib
							 									Geometry
							 
Python '；没有这样的选择'；使用OptionParser
									Python
							 
如何在python中解析键为变量的json？
									Python
							 									Json
							 									Logging
							 
Python 获取一年中的下一个和最后一个ISO周编号
									Python
							 									Datetime
							 
Python akamai dns管理api
									Python
							 									Dns
							 
python内存因循环而爆炸
									Python
							 									For Loop
							 									Pytorch
							 
Python 重新创建文件时不会崩溃的读取循环
									Python
							 									File
							 
Python 如何定义Django模型方法？
									Python
							 									Django
							 									Django Models
							 
Python force_unicode（url）的Excel限制
									Python
							 									Excel
							 									Pandas
							 									Pycharm
							 
使用R或python将列拆分为多个列
									Python
							 									R
							 
Python 如何在不影响窗口大小或布局中的其他项目的情况下，将项目动态添加到水平布局中？
									Python
							 
Python 除非使用相同的配置文件手动打开浏览器，否则Selenium页面将加载为空白
									Python
							 									Selenium
							 									Selenium Webdriver
							 									Automation
							 									Web Crawler
							 
Python 多人随机游走
									Python
							 
如何在Python中测试运行http.server？
									Python
							 									Http
							 
如何在python timerotatingfilehandler中将滚动时间作为积分点
									Python
							 									Logging
							 
Python 在执行K倍交叉验证时，我得到了这个错误类型错误：如果没有指定评分，通过的估计器应该有一个'；得分'；方法
									Python
							 									Jupyter Notebook
							 
Python 如何根据字符串中的预定义部分拆分列表中的字符串部分
									Python
							 									String
							 
Python 无法克隆对象'&书信电报；0x0000023DD4D5F488处的keras.engine.sequential.sequential对象>'；
									Python
							 									Machine Learning
							 									Keras
							 									Scikit Learn
							 
Python 如何基于列值从数据帧中选择行？
									Python
							 									Pandas
							 									Dataframe
							 
Python Django-makemigrations-未检测到任何更改
									Python
							 									Django
							 
“我在使用”时遇到了一些困难；git push heroku master“；在最近安装的python上+；Windows 10系统上的otree组合
									Python
							 									Git
							 									Heroku
							 
Python DNS欺骗攻击不存在'；不要加载任何假页面
									Python
							 									Networking
							 									Dns
							 
Python 使用Pyinstaller获取exe，包括pygubu
									Python
							 
Python 使用十六进制编码读取csv文件
									Python
							 									Csv
							 
Python 在Ubuntu 18.04上安装SimpleLastix时出错
									Python
							 
Python优化Van序列
									Python
							 									Optimization
							 
Python 如何增加cython的可变跟踪大小限制？
									Python
							 
Python 带有多个条件的if语句，并理解；或；操作人员
									Python
							 									Python 3.x
							 									Loops
							 									For Loop
							 									If Statement
							 
如何从python脚本运行ansible
									Python
							 									Ansible
							 
Python 为什么预测非常不准确（这会造成非常大的损失）？
									Python
							 									Deep Learning
							 									Pytorch
							 
                                                        
                                                

                                                
                                                        Tags
                                                        
C# 4.0
Servlets
Windows Installer
Tinymce
Ubuntu
Linker
Cors
Dom
Antlr4
Ecmascript 6
Knockout.js
Socket.io
Cypress
Swift3
Smtp
Asp.net Mvc
Activemq
Sencha Touch 2
Replace
Linux
Process
Sip
Jira
Floating Point
Netlogo
Kubernetes
Node.js
Xml
Join
Authentication
Cloud
Material Ui
Pip
Opengl
Spring Cloud
Wpf
Prometheus
Iphone
Air
Gremlin
Three.js
Io
Swift2
Rabbitmq
Scikit Learn
Glassfish
Office365
Random
Smalltalk
Class
Html
Spring Batch
Plsql
Actionscript 3
Calendar
Kernel
Amp Html
Jmeter
Selenium Webdriver
Coffeescript
Quickbooks
File Io
.net 4.0
Validation
Concurrency
Optimization
Javafx
Core Data
Ip
Virtual Machine
Forms
Cassandra
Graphql
Recursion
Filter
Azure Data Factory
Ibm Mq
Pagination
Uitableview
Mapbox
Log4net
Verilog
Clojure
Tsql
Appium
Intellij Idea
Sms
Gwt
Mongodb
Syntax
Bazel
Interface
Akka
Visual C++
Indexing
Logstash
Eclipse Plugin
Pytorch
Jar
Haskell
Merge
Acumatica
Amazon Redshift
Go
Programming Languages
Microservices
Raspberry Pi
Grid
Eclipse
Service
Meteor
Cocoa Touch
Windows Store Apps
Discord.js
Pandas
Continuous Integration
Fiware
Nestjs
Seo
Configuration
Windows Runtime
Groovy
Vue.js
Internet Explorer
Applescript
Silverlight
Serialization
Umbraco
Google Bigquery
Sitecore
Office Js
Orientdb
Winapi
For Loop
Input
Openerp
Unix
Ipython
Nhibernate
Gmail
Directx
Airflow
Svn
Google Cloud Dataflow
Url
Ssh
Report
Rx Java
Version Control
C#
Blockchain
Hive
Ibm Midrange
Memory
Laravel 5
Erlang
Time
Terminal
Razor
Regex
Character Encoding
Keras
Variables
Jpa
Python 2.7
Axapta
Imagemagick
Data Binding
Computer Vision
Typescript
Prestashop
View
Java Me
Tomcat
Cmake
Xamarin.ios
Design Patterns
Jwt
Openlayers 3
Rss
Twitter
Iis
Mapping
Llvm
Loops
Url Rewriting
Sas
Xamarin.forms
Cron
Xpath
Tridion
Angular
Dns
Fullcalendar
Asp.net Mvc 5
Maven 2
Dependency Injection
Azure Sql Database
Events
Enums
Vagrant


                

                        
						
                        
                                
                                        
                                                
                                                        
                                                                Copyright © 2024. All Rights Reserved by  - Fatal编程技术网