Python web爬虫中的索引步骤_Python_Web Crawler - Fatal编程技术网

Python web爬虫中的索引步骤

python web-crawler

Python web爬虫中的索引步骤,python,web-crawler,Python,Web Crawler,我正在写一个网络爬虫（聚焦网络爬虫），其中：输入：seedsURL 输出：更大的seedsURL def crawl(seedURL, pageslimit): crawling code ... return list of urls crawled 现在我需要索引和存储数据，以方便快速准确的信息检索（搜索引擎）我的爬虫程序返回一个URL列表，我如何将它们传递到索引阶段？我应该下载文本文件中每个页面的内容吗是否有一些工具或库来执行索引步骤？还是必须手动完

我正在写一个网络爬虫（聚焦网络爬虫），其中：
输入：seedsURL
输出：更大的seedsURL

  def crawl(seedURL, pageslimit):
      crawling code ...

      return list of urls crawled

现在我需要索引和存储数据，以方便快速准确的信息检索（搜索引擎）

我的爬虫程序返回一个URL列表，我如何将它们传递到索引阶段？我应该下载文本文件中每个页面的内容吗

是否有一些工具或库来执行索引步骤？还是必须手动完成

你绝对应该使用这个网页爬行的工作。我将给你一个例子，说明如何使用它，以及你的web索引应该是怎样的。任何其他问题，去看看网站

使用Scrapy提供的XPath表达式，可以提取所需的资源，包括整个文件

例如：

Darwin-展览的演变

XPath表达式：

//h1/text（）

为什么要这样做？使用h1标记，您可以将其设置为字典中的一个键。有了字典，你可以更容易地访问文件。像这样：

web_index = { 'Darwin': 'example.html', 'Evolution': 'example.html' }

最好将web索引放在字典中，因为它是一个键值对，您可以轻松地从中“搜索”，而不是像在列表中那样依赖它们的索引。
我使用scrapy从特定网站中提取数据。但在另一个模块中，我需要抓取web的一部分（聚焦爬虫）来搜索相关信息。我构建了一个返回URL列表的URL，但对我来说，在数据库中搜索后对结果进行索引并不清楚。

[web crawler]相关文章推荐

Web crawler 如何抓取数十亿页？ web-crawler

Web crawler 调查结果；“全部”；国域 web-crawler

Web crawler 同一项目中不能有两个spider？ web-crawler scrapy

Web crawler 在爬网产品详细信息页面时动态分配列？ web-crawler

Web crawler Scrapy：如何忽略所有Javascript、JQuery。。。刮的时候 web-crawler scrapy

Web crawler 如何在sparql中自动递增变量？ web-crawler sparql

Web crawler Nutch抓取超时 web-crawler

随机文章推荐

.net 如何判断我'；我是否在web服务器下运行？ .net iis design-patterns

.net 关于在应用程序中构建自动化的想法 .net powershell automation

.net 在sql语句运行之前列出它所涉及的表？ .net sql

.net TransactionScope未使用SqlDataAdapter.Update回滚 .net

.net 在我的站点上实现第三方登录身份验证 .net authentication login

.net net中的用户控件 .net winforms

.net MVC应用中的前端字段对齐 .net html css asp.net-mvc-2

.net 从一个对象投射到另一个对象 .net linq-to-sql

.net 在加载应用程序时，如何执行Alt键？ .net vb.net winforms devexpress

.net 使VB Inputbox中的文本框变大 .net vb.net winforms

.net 使用ASP.MVC 3.0时自动映射配置重置 .net asp.net-mvc

.net “检索窗口”；窗口颜色“； .net wpf windows

.net 了解特定的CIL/CLR优化 .net optimization compiler-construction

.net 具有structuremap 2.6的装饰器模式 .net

.net 使用MVC4和MvcContrib.TestHelper测试路由 .net testing asp.net-mvc-4

.net FxCop 10.0 standalone无法分析使用AutoMapper的程序集 .net

.net 是不是；“坏的”；在字典中将对象用作键？ .net dictionary

联合安全-单独的SSL和RP证书（.NET 4.5和WIF） .net wcf

未采用Google站点验证API.NET重定向uri .net json

.net F#-将简单的for循环转换为更具功能的构造 .net f#functional-programming

[python]相关推荐

Python json请求上的Django URL
Python Django

Python Pycurl：如何确定请求的持续时间
Python Curl Time

如何使用python消除列表中的重复元素？
Python Parsing

Python all（）运算符的行为
Python

如何在应用程序引擎上进行相对导入？（python）
Python Google App Engine

创建一个字典，将一个对象映射到它在列表中出现的次数？用Python
Python List For Loop Dictionary

Python Can'；使用协同程序时，无法获得加薪或返回工作岗位
Python Asynchronous

python中的字符串相似性（语义）
Python String

错误："'；ascii'；编解码器可以'；t解码“位置”中的字节0xd8；在Python 3中
Python Unicode Utf 8

python帮助学习如何使用try-except处理错误
Python Python 3.x

Python 循环以返回整数的和
Python Python 2.7

使用PIL在python上重新设置图像后引发AttributeError（name）
Python

嵌套幂运算数的Python求值顺序
Python

如何在列表中顺序修改和添加单词-Python
Python List For Loop

Python 限制并发线程的数量
Python Multithreading

Python中使用多变量串联和反斜杠错误
Python Python 3.x

Python AttributeError:Wave#u写入实例没有属性'__退出'；
Python

在python中，使用==检查两个对象是否具有相同的值
Python

Python Amazon lambda函数不接受post
Python Aws Lambda

Python 烧瓶形式的多行单选按钮
Python Flask

Python 找不到满足要求DoubleTable的版本（来自版本：无）
Python

Python 如果用户点击并按住某个东西，我如何编程pygame使其不重复运行？
Python Python 3.x

Python 如何使用OpenGL绘制线立方体？
Python Opengl

Python 安装geopandas时出错：“；必须指定GDAL API版本”；在VisualStudio代码中
Python Pandas

Python Kivy映像未从源加载
Python

skimage.filters.laplace函数在Python中是如何工作的？
Python Filter

Python 无法打开Jupyter笔记本-Distutils错误
Python Jupyter Notebook

Python Django：与#x27相反；创建订单'；找不到
Python Django

Python 如何为PyTorch TensorBoard在一张图表中绘制多条PR曲线？
Python Pytorch

Python 无法将字符串转换为浮点值，尽管字符串内有什么''；似乎是一个数字
Python String Google Analytics Floating Point

Tags

Fonts Core Data Python 3.x Combobox Synchronization Makefile Programming Languages Hibernate Jekyll Prestashop Typo3 Migration Grid Reporting Services Azure Service Fabric Jhipster Assembly Common Lisp Unit Testing Hyperlink Sqlalchemy Vba Email Collections Ibm Cloud Mysql Video Streaming .htaccess Mvvm Zend Framework Keras Unix Oauth 2.0 Sprite Kit Opencl Ruby Jsp Kubernetes Apache Pig Stanford Nlp Nativescript Image Processing Npm Cakephp Webrtc Coding Style Google Analytics Embedded Compilation Compiler Construction Ms Office Plone Spring Batch Elixir Google Sheets Osgi Windows Installer Netsuite Sql Server 2012 Raspberry Pi Sublimetext2 Events Gps Ibm Mq Computer Science Spring Spring Integration Aurelia Ant Prometheus Yaml Windows Phone 7 Sugarcrm Xpath Dependency Injection Arduino Encryption Docusignapi Fortran Transactions Shell Snmp Hbase Gridview Wolfram Mathematica Tensorflow Ios4 Stored Procedures Variables D Time Complexity Gstreamer Amp Html Salesforce Ios6 Entity Framework Less Android Ndk Datatables Parallel Processing Sitecore Rabbitmq Cassandra Ffmpeg Validation Grafana Colors Couchbase Sharepoint 2007 Gmail Keycloak Corda Nestjs Ethereum Inno Setup Big O Qt4 Selenium Antlr Xml C# 3.0 Pip Mariadb Iis 7 Jdbc Linux Dataframe Wix Twilio Ember.js Scripting Twitter Bootstrap 3 Ruby On Rails 4 Netlogo Ubuntu Ipad Methods Web Applications Gdb Groovy Asp.net Functional Programming Tabs Powerbi Github Orientdb Google Maps Api 3 Facebook Graph Api Cryptography Marklogic User Interface Glassfish Proxy Model Sms Umbraco Qt Telerik Checkbox Maps Robotframework Soap Replace Cocoa Playframework Asp.net Mvc Csv Nest Actionscript Odata Iphone Macos Merge Windows Runtime Ocaml Google Apps Script Webview Antlr4 Google Cloud Storage Mobile Firefox Laravel 5 X86 Jqgrid Uwp Unicode Phpmyadmin Api Cucumber Sql Openlayers C Reflection Spring Cloud Autohotkey Android Pointers Angular Css Tree Time

Copyright © 2024. All Rights Reserved by - Fatal编程技术网