Python 如何提取网页上链接的URL_Python_Web Scraping_Beautifulsoup

Python 如何提取网页上链接的URL

python web-scraping

Python 如何提取网页上链接的URL,python,web-scraping,beautifulsoup,Python,Web Scraping,Beautifulsoup,我正在尝试下载下面链接中的所有PDF文件首先，我尝试提取所有PDF链接的URL（用红色括起来的链接）从bs4导入美化组将urllib2作为ul导入 resp=ul.urlopen（“https://www.osapublishing.org/search.cfm?q=comsol&meta=1&cj=1&cc=1") 汤=美汤（分别为“lxml”） f=打开（'url.txt'，'w'）对于soup.find_all（'a'，href=True）中的链接： f、写入（str（链接['

我正在尝试下载下面链接中的所有PDF文件

首先，我尝试提取所有PDF链接的URL（用红色括起来的链接）

从bs4导入美化组
将urllib2作为ul导入
resp=ul.urlopen（“https://www.osapublishing.org/search.cfm?q=comsol&meta=1&cj=1&cc=1")
汤=美汤（分别为“lxml”）
f=打开（'url.txt'，'w'）
对于soup.find_all（'a'，href=True）中的链接：
f、 写入（str（链接['href']）+'\n'）
f、 关闭（）
----------------------------------------------------------------
http://www.osa.org
#
https://www.osapublishing.org
#
#
#
#
/关于.cfm
/aop
/敖
/作为
/英格兰银行
/上校
/jdt
/jlt
/一丁点儿
/乔恩
/乔萨
/约萨布
/乔斯克
/视神经
/奥美
/oe
/ol
/prj
/乔恩
/乔萨
/在
/aop
/敖
/作为
/英格兰银行
/上校
/jdt
/jlt
/一丁点儿
/乔恩
/乔萨
/约萨布
/乔斯克
/视神经
/奥美
/oe
/ol
/prj
/乔恩
/乔萨
/在
/conferences.cfm
/conferences.cfm
/conferences.cfm？findby=会议
/conference.cfm？meetingid=5
/conference.cfm？meetingid=124
/conference.cfm？meetingid=56
/会议。cfm？会议ID=144，年=2015
/会议。cfm？会议ID=153，年=2015
/会议。cfm？会议ID=131，年=2015
/会议。cfm？会议ID=174，年=2015
/会议。cfm？会议ID=109，年=2015
#全球导航
/书籍/激光/激光.cfm
/oida/reports.cfm
http://www.osa-opn.org
/author/author.cfm
/提交/审查/同行审查.cfm
/图书馆/
/osadigitalarchive.cfm
/isp.cfm
http://imagebank.osa.org
/聚光灯
/中国/
#
/使用者
#
#
#
https://www.osapublishing.org
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
/
#
#
/使用者
#
#
/关于.cfm
/conferences.cfm
/conferences.cfm
/conferences.cfm？findby=会议
/中国/
/author/author.cfm
/提交/审查/同行审查.cfm
/图书馆/
/书籍/激光/激光.cfm
/oida/reports.cfm
http://www.osa-opn.org
http://imagebank.osa.org
/聚光灯/
/中国/
/关于.cfm
/惠益公司
/contactus.cfm
#
/privacy.cfm
/termsofuse.cfm
https://account.osa.org/eweb/dynamicpage.aspx?sso=1&site=osac&webcode=loginrequired&url_success=https%3A%2F%2Fwww%2Eosapublishing%2Eorg%2Fsearch%2Ecfm%3Fq%3Dcomsol%26meta%3D1%26cj%3D1%26cc%3D1%26usertoken%3D%7Btoken%7D
https://account.osa.org/eweb/Dynamicpage.aspx?webcode=forgotpassword*站点=osac
/privacy.cfm
http://www.osa.org/en-us/help/

但是，我想提取的链接似乎没有被提取。

我如何才能做到这一点？

所有要寻址的PDF链接都不是通过“”在HTML源中

PDF链接由AJAX加载

我想你需要打开带有POST和“正确”参数/cookies设置的URL。例如：“CFID=xxxxxxxx；CFTOKEN=xxxxxxxx；BIGipServerPubsWeb_HTTP=xxxxxxxxx.xxxxx.xxxxx；_ga=GAx.x.xxxxxxxxx.xxxxxxxxx；_gat=1”

您的响应将是JSON格式的。对象将包括“result[0].data.has-pdf=true”以测试现有pdf。链接看起来像：“fn:doc（“/oe/21/22/27371/oe-21-22-27371.xml”）/article/front/article-meta/abstract/p”，因此需要将它们与PDF文件匹配

但我猜他们可能有一些IP检查或其他安全的东西，所以也许你不能通过POST从任何域获得一些数据，而不是来源。只是猜测；）

所有要寻址的PDF链接都不在通过“”的HTML源中

PDF链接由AJAX加载

但我猜他们可能有一些IP检查或其他安全的东西，所以也许你不能通过POST从任何域获得一些数据，而不是来源。只是猜测；）

所以你的目标是查看：PDF链接，对吗？我看到的第一个例子是：

，这可能意味着一些事情，它们是动态生成的或通过AJAX调用的。当我点击链接时，我会被带到一个我登录或购买的页面。所以它不会直接把你带到PDF。如何手动获取pdf？第二个在浏览器中加载完整的pdf，看起来是动态生成的：我将添加一个条件以在脚本中查找“pdf”。谢谢您的回答。其中一些文件无需登录即可下载。我知道这些链接的URL不在HTML源代码中。有没有办法在没有URL的情况下打开这些链接？所以你的目标是查看：PDF链接，对吗？我看到的第一个例子是：

https://www.osapublishing.org/view_article.cfm?gotourl=https%3A%2F%2Fwww.osapublishing.org%2FDirectPDFAccess%2F6FA37648-E3C1-262B-6AF76128B6A12104_274099%2Foe-21-22-27371.pdf%3Fda%3D1%26id%3D274099%26seq%3D0%26mobile%3Dno&org=

在这里，您可以看到直接URL被传递到CF脚本，有些链接不需要登录，例如：

https://www.osapublishing.org/view_article.cfm?gotourl=https%3A%2F%2Fwww.osapublishing.org%2FDirectPDFAccess%2F6FA37648-E3C1-262B-6AF76128B6A12104_274099%2Foe-21-22-27371.pdf%3Fda%3D1%26id%3D274099%26seq%3D0%26mobile%3Dno&org=

在这里您可以看到直接URL被传递到CF脚本

from bs4 import BeautifulSoup
import urllib2 as ul

resp = ul.urlopen("https://www.osapublishing.org/search.cfm?q=comsol&meta=1&cj=1&cc=1")
soup = BeautifulSoup(resp, 'lxml')

f = open('url.txt', 'w')

for link in soup.find_all('a', href=True):

    f.write(str(link['href']) + '\n')

f.close()

----------------------------------------------------------------

<url.txt>
http://www.osa.org
#
https://www.osapublishing.org
#
#
#
#
/about.cfm

/aop
/ao
/as
/boe
/col
/jdt
/jlt
/jot
/jocn
/josaa
/josab
/josk
/optica
/ome
/oe
/ol
/prj
/jon
/josa
/on
/aop
/ao
/as
/boe
/col
/jdt
/jlt
/jot
/jocn
/josaa
/josab
/josk
/optica
/ome
/oe
/ol
/prj
/jon
/josa
/on
/conferences.cfm
/conferences.cfm
/conferences.cfm?findby=conference
/conference.cfm?meetingid=5
/conference.cfm?meetingid=124
/conference.cfm?meetingid=56
/conference.cfm?meetingid=144&yr=2015
/conference.cfm?meetingid=153&yr=2015
/conference.cfm?meetingid=131&yr=2015
/conference.cfm?meetingid=174&yr=2015
/conference.cfm?meetingid=109&yr=2015
#global-nav
/books/lasers/lasers.cfm
/oida/reports.cfm
http://www.osa-opn.org
/author/author.cfm
/submit/review/peer_review.cfm
/library/
/osadigitalarchive.cfm
/isp.cfm
http://imagebank.osa.org
/spotlight
/china/
#
/user
#
#
#
https://www.osapublishing.org
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
/
#
#
/user
#
#
/about.cfm
/conferences.cfm
/conferences.cfm
/conferences.cfm?findby=conference
/china/
/author/author.cfm
/submit/review/peer_review.cfm
/library/
/books/lasers/lasers.cfm
/oida/reports.cfm
http://www.osa-opn.org
http://imagebank.osa.org
/spotlight/
/china/
/about.cfm
/benefitslog.cfm
/contactus.cfm
#
/privacy.cfm
/termsofuse.cfm
https://account.osa.org/eweb/dynamicpage.aspx?sso=1&site=osac&webcode=loginrequired&url_success=https%3A%2F%2Fwww%2Eosapublishing%2Eorg%2Fsearch%2Ecfm%3Fq%3Dcomsol%26meta%3D1%26cj%3D1%26cc%3D1%26usertoken%3D%7Btoken%7D
https://account.osa.org/eweb/Dynamicpage.aspx?webcode=forgotpassword*Site=osac
/privacy.cfm
http://www.osa.org/en-us/help/