Python 解析多个URL并提取数据我需要解析一个HTML页面，得到所有符合我要求的URL_Python_Regex_Beautifulsoup_Parse Url - Fatal编程技术网

Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/python/337.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python 解析多个URL并提取数据我需要解析一个HTML页面，得到所有符合我要求的URL_Python_Regex_Beautifulsoup_Parse Url - Fatal编程技术网

Python 解析多个URL并提取数据我需要解析一个HTML页面，得到所有符合我要求的URL

python regex

Python 解析多个URL并提取数据我需要解析一个HTML页面，得到所有符合我要求的URL,python,regex,beautifulsoup,parse-url,Python,Regex,Beautifulsoup,Parse Url,现在，如果页面标题与某些内容匹配，我需要解析每个提取的URL以获取所需的数据，并根据名称将它们保存到多个文件中。我用以下方式完成了第1部分 pattern=re.compile(r'''class="topline"><A href="(.*?)"''') da = pattern.search(web_page) da = pattern.findall(soup1) col_width = max(len(word) for row in da for word in row)

现在，如果页面标题与某些内容匹配，我需要解析每个提取的URL以获取所需的数据，并根据名称将它们保存到多个文件中。我用以下方式完成了第1部分

pattern=re.compile(r'''class="topline"><A href="(.*?)"''')
da = pattern.search(web_page)
da = pattern.findall(soup1)
col_width = max(len(word) for row in da for word in row)
for row in da:
    if "some string" in row.upper():
        bb = "".join(row.ljust(col_width))
        print >> links, bb

pattern=re.compile（r''class=“topline”>首先，您实际上已经用BeautifulSoup
标记了这个问题，但您仍然在这里使用正则表达式
以下是如何获取链接、跟踪链接并检查标题的方法：
from urllib2 import urlopen
from bs4 import BeautifulSoup

URL = "url here"

soup = BeautifulSoup(urlopen(URL))
links = soup.select('.topline > a')
for a in links:
    link = link.get('href')
    if link:
        # follow link
        link_soup = BeautifulSoup(urlopen(link))
        title = link_soup.find('title')
        # check title

.topline>a
将找到带有topline
类的任何标记，并将a
标记放在正下方
希望有帮助。
使用BeutifulSoup或任何其他库解析HTML，不要使用正则表达式。谢谢。它确实有帮助。




[regex]相关文章推荐



                                                        
Regex 根据Linux中的文件数重命名目录
regexlinuxbash 
Regex 匹配除关键字以外的标识符
regexperl 
如何使用regex_replace终止字符串？
regexvisual-studio-2010winapivisual-c++ 
Regex 正则表达式从字符串中提取版本
regexshellsed 
Regex Python正则表达式匹配除最后一次之外的所有事件
regexpython-2.7directory 
Regex 如何在Perl脚本中查找和替换多行？
regexperlreplace 
Regex 理解sed示例
regexbashsed 
Regex用于区分单词“quot；“监管”&引用；“法规”；（从“调节器”开始；到“是”或“是”结束）
regex 
Regex 值开头的空格不起作用
regex 
Regex MS Access中的模式匹配：是否存在；或；操作人员
regexms-access 
Regex 用变量替换Shell中的多行文件
regexbashshellsedscripting 
我可以在htaccess中附加或创建一个带有一行regex的新查询字符串吗
regex.htaccess 
对于Regex，如果这两个模式都是贪婪的，那么模式引擎实际上是如何选择匹配的？
regex 
Regex 搜索一定长度的单词边界，并将其替换为'|'；
regexsas 
Regex Perl正则表达式从带括号的字符串中提取数字
regexperl 
Regex 在字符串中查找字符的位置
regexstringr 
Regex 在bash中使用正则表达式进行字符串验证
regexstringbashawk 
Regex 在Google Sheets中使用正则表达式转换1+；1到2？
regexgoogle-sheets 
Regex Boost'中的控件（vs可打印）字符是由什么组成的；s正则表达式'；cntrl'；角色类？
regexboostnotepad++ 
Regex 在param中表示嵌套路由？
regexexpress 
                                       





随机文章推荐



                                                        
F# 为什么不能在列表递归中推断元组类型？
f#functional-programmingrecursion 
如何使用F#创建多级树视图？
f#mono 
F#交互式窗口中的网格命令
f# 
F#语法问题
f# 
F# 如何获得F中的立方根#
f# 
“伪造”；“重复价值定义”；来自F#编译器的错误
f# 
F# 在FluentCassandra中将字符串转换为UTF8Type
f# 
F# F中的协程#
f#unity3d 
F#使用自定义列表附加自定义表
f#functional-programming 
F# 如何在F中等待TaskWaiter或配置TaskAwaitable#
f# 
F#对象在成员之前引用self-in-let绑定
f# 
与F#讨好问题。将函数附加到类型的正确方法是什么？
f# 
F# 单值歧视工会？
f# 
F# 如何解决f树的插入函数问题
f# 
F# 如何在F中重定向标准和错误输出#
f# 
F# 我是否正确使用Deedle Series.map？
f# 
F# IgnoreMissingMember设置不存在'；似乎无法使用FSharpLu.Json反序列化程序
f# 
F# 嵌套json解析失败，在F中#
f# 
F# Microsoft F文档中的示例未编译
f# 
F# 任何不记录值类型的原因
f#


                                        

                                        
                                        


                                                
                                                        [python]相关推荐
                                                        
                                                        
                                                

                                                
                                                        Tags
                                                        
Properties
Ssl
Liferay
Mqtt
Wxpython
Database Design
Automation
Osgi
Design Patterns
Swift3
Wolfram Mathematica
Internet Explorer
Php
Unity3d
Sed
Three.js
Embedded
Configuration
Testng
Apache Camel
Stm32
Jetty
Drupal 7
Iframe
Asp.net Mvc 2
Exchange Server
Syntax
Outlook
Imagemagick
Ssas
Exception
Odata
Jenkins
Nhibernate
Webrtc
Bison
Nativescript
Wso2
Google Maps
Hive
Pine Script
Deployment
Notifications
Dictionary
Google Chrome Devtools
Svg
Perl
Weblogic
Protocol Buffers
Parse Platform
Logstash
Sharepoint
Wcf
Omnet++
Methods
Nlp
Tree
Ffmpeg
Hyperlink
Coffeescript
Marklogic
Mfc
Pdf
Data Structures
Xamarin.ios
Visual Studio 2015
Netsuite
Titanium
Google Maps Api 3
Hash
Telegram
Dojo
Apache Pig
Curl
Grafana
Fortran
User Interface
Azure
Google Cloud Firestore
Grep
Shiny
Mpi
Geometry
Umbraco
Google Chrome Extension
Google Cloud Dataflow
Html
Pytorch
Terraform
Amazon Dynamodb
Jupyter Notebook
Fullcalendar
Rspec
Jmeter
Snmp
Reporting Services
Http
Reference
Mdx
Boost
Tfs
Binding
Openstack
Udp
Jqgrid
Jakarta Ee
Devexpress
Plugins
Rest
Protractor
Mono
C
Ios5
Ide
Xpath
Sql Server 2012
Multithreading
Jersey
Orm
Menu
File
Asp.net Mvc 3
Vba
Pandas
Powershell
Sql Server
Filesystems
Google Drive Api
Templates
Visual Studio 2013
Gtk
Struct
Scripting
Webstorm
Vb.net
File Io
Google App Maker
Inno Setup
Clearcase
Postgresql
Doctrine Orm
Jdbc
Oracle Apex
Functional Programming
Clojure
Activerecord
Sequelize.js
Shopify
Gnuplot
Svn
Mapreduce
Streaming
Blazor
Isabelle
Sublimetext2
Windows
Amazon S3
Material Ui
Plone
Visual Studio 2010
Iis
Soap
Groovy
Spotify
Nosql
Printing
Ftp
Version Control
Prestashop
Jekyll
Jquery Plugins
Websocket
Firefox Addon
Install4j
Graph
Variables
Jboss
Compiler Errors
Magento
Aurelia
Ubuntu
Jira
Spring Integration
Entity Framework 4
Loops
Sql Server 2008 R2
Azure Service Fabric
Calendar
Django Rest Framework
Silverlight 4.0
Serial Port
If Statement
Neo4j
Continuous Integration
Bazel
Exception Handling
Activemq
Windows 7
Pointers
Actionscript
Random


                

                        
						
                        
                                
                                        
                                                
                                                        
                                                                Copyright © 2024. All Rights Reserved by  - Fatal编程技术网