Python 从网页中提取URL并保存到磁盘
我正在尝试编写一个脚本,以自动使用文章标题进行查询,并使用特定的文件名将文章全文的PDF副本保存到我的计算机中 为此,我编写了以下代码:Python 从网页中提取URL并保存到磁盘,python,web-scraping,python-requests,Python,Web Scraping,Python Requests,我正在尝试编写一个脚本,以自动使用文章标题进行查询,并使用特定的文件名将文章全文的PDF副本保存到我的计算机中 为此,我编写了以下代码: url = "http://sci-hub.io/" data = read_csv("C:\\Users\\Sangeeta's\\Downloads\\distillersr_export (1).csv") for index, row in data.iterrows(): try: print('http://sci-hub.
url = "http://sci-hub.io/"
data = read_csv("C:\\Users\\Sangeeta's\\Downloads\\distillersr_export (1).csv")
for index, row in data.iterrows():
try:
print('http://sci-hub.io/' + str(row['DOI']))
res = requests.get('http://sci-hub.io/' + str(row['DOI']))
print(res.content)
except:
print('NO DOI: ' + str(row['ref']))
这将打开一个CSV文件,其中包含DOI列表和要保存的文件名。对于每个DOI,它然后查询sci-hub.io以获取全文。呈现的页面嵌入了PDF,但是我现在不确定如何提取PDF的URL并将其保存到磁盘
下图显示了该页面的一个示例:
在此图像中,所需的URL为
如何自动提取此URL,然后将PDF文件保存到磁盘
当我打印res.content时,我得到以下信息:
b'<!DOCTYPE html>\n<html>\n <head>\n <title></title>\n <meta charset="UTF-8">\n <meta name="viewport" content="width=device-width">\n </head>\n <body>\n <style type = "text/css">\n body {background-color:#F0F0F0}\n div {overflow: hidden; position: absolute;}\n #top {top:0;left:0;width:100%;height:50px;font-size:14px} /* 40px */\n #content {top:50px;left:0;bottom:0;width:100%}\n p {margin:0;padding:10px}\n a {font-size:12px;font-family:sans-serif}\n a.target {font-weight:normal;color:green;margin-left:10px}\n a.reopen {font-weight:normal;color:blue;text-decoration:none;margin-left:10px}\n iframe {width:100%;height:100%}\n \n p.agitation {padding-top:5px;font-size:20px;text-align:center}\n p.agitation a {font-size:20px;text-decoration:none;color:green}\n\n .banner {position:absolute;z-index:9999;top:400px;left:0px;width:300px;height:225px;\n border: solid 1px #ccc; padding: 5px;\n text-align:center;font-size:18px}\n .banner img {border:0}\n \n p.donate {padding:0;margin:0;padding-top:5px;text-align:center;background:green;height:40px}\n p.donate a {color:white;font-weight:bold;text-decoration:none;font-size:20px}\n\n #save {position:absolute;z-index:9999;top:180px;left:8px;width:210px;height:36px;\n border-radius: 4px; border: solid 1px #ccc; padding: 5px;\n text-align:center;font-size:18px;background:#F0F0F0;color:#333}\n\n #save a {text-decoration:none;color:white;font-size:inherit;color:#666}\n\n #save p { margin: 0; padding: 0; margin-top: 8px}\n\n #reload {position:absolute;z-index:9999;top:240px;left:8px;width:210px;height:36px;\n border-radius: 4px; border: solid 1px #ccc; padding: 5px;\n text-align:center;font-size:18px;background:#F0F0F0;color:#333}\n\n #reload a {text-decoration:none;color:white;font-size:inherit;color:#666}\n\n #reload p { margin: 0; padding: 0; margin-top: 8px}\n\n\n #saveastro {position:absolute;z-index:9999;top:360px;left:8px;width:230px;height:70px;\n border-radius: 4px; border: solid 1px #ccc; background: white; text-align:center}\n #saveastro p { margin: 0; padding: 0; margin-top: 16px}\n \n \n #donate {position:absolute;z-index:9999;top:170px;right:16px;width:220px;height:36px;\n border-radius: 4px; border: solid 1px #ccc; padding: 5px;\n text-align:center;font-size:18px;background:white;color:#333}\n \n #donate a {text-decoration:none;color:green;font-size:inherit}\n\n #donatein {position:absolute;z-index:9999;top:220px;right:16px;width:220px;height:36px;\n border-radius: 4px; border: solid 1px #ccc; padding: 5px;\n text-align:center;font-size:18px;background:green;color:#333}\n\n #donatein a {text-decoration:none;color:white;font-size:inherit}\n \n #banner {position:absolute;z-index:9999;top:50%;left:45px;width:250px;height:250px; padding: 0; border: solid 1px white; border-radius: 4px}\n \n </style>\n \n \n \n <script type = "text/javascript">\n window.onload = function() {\n var url = document.getElementById(\'url\');\n if (url.innerHTML.length > 77)\n url.innerHTML = url.innerHTML.substring(0,77) + \'...\';\n };\n </script>\n <div id = "top">\n \n <p class="agitation" style = "padding-top:12px">\n \xd0\xa1\xd1\x82\xd1\x80\xd0\xb0\xd0\xbd\xd0\xb8\xd1\x87\xd0\xba\xd0\xb0 \xd0\xbf\xd1\x80\xd0\xbe\xd0\xb5\xd0\xba\xd1\x82\xd0\xb0 Sci-Hub \xd0\xb2 \xd1\x81\xd0\xbe\xd1\x86\xd0\xb8\xd0\xb0\xd0\xbb\xd1\x8c\xd0\xbd\xd1\x8b\xd1\x85 \xd1\x81\xd0\xb5\xd1\x82\xd1\x8f\xd1\x85 \xe2\x86\x92 <a target="_blank" href="https://vk.com/sci_hub">vk.com/sci_hub</a>\n </p>\n \n </div>\n \n <div id = "content">\n <iframe src = "http://moscow.sci-hub.io/202d9ebdfbb8c0c56964a31b2fdfe8e9/roerdink2016.pdf" id = "pdf"></iframe>\n </div>\n \n <div id = "donate">\n <p><a target = "_blank" href = "//sci-hub.io/donate">\xd0\xbf\xd0\xbe\xd0\xb4\xd0\xb4\xd0\xb5\xd1\x80\xd0\xb6\xd0\xb0\xd1\x82\xd1\x8c \xd0\xbf\xd1\x80\xd0\xbe\xd0\xb5\xd0\xba\xd1\x82 →</a></p>\n </div>\n <div id = "donatein">\n <p><a target = "_blank" href = "//sci-hub.io/donate">support the project →</a></p>\n </div>\n <div id = "save">\n <p><a href = # onclick = "location.href=\'http://moscow.sci-hub.io/202d9ebdfbb8c0c56964a31b2fdfe8e9/roerdink2016.pdf?download=true\'">\xe2\x87\xa3 \xd1\x81\xd0\xbe\xd1\x85\xd1\x80\xd0\xb0\xd0\xbd\xd0\xb8\xd1\x82\xd1\x8c \xd1\x81\xd1\x82\xd0\xb0\xd1\x82\xd1\x8c\xd1\x8e</a></p>\n </div>\n <div id = "reload">\n <p><a href = "//sci-hub.io/reload/10.1016/j.anai.2016.01.022" target = "_blank">↻ \xd1\x81\xd0\xba\xd0\xb0\xd1\x87\xd0\xb0\xd1\x82\xd1\x8c \xd0\xb7\xd0\xb0\xd0\xbd\xd0\xbe\xd0\xb2\xd0\xbe</a></p>\n </div>\n \n \n<!-- Yandex.Metrika counter --> <script type="text/javascript"> (function (d, w, c) { (w[c] = w[c] || []).push(function() { try { w.yaCounter10183018 = new Ya.Metrika({ id:10183018, clickmap:true, trackLinks:true, accurateTrackBounce:true, ut:"noindex" }); } catch(e) { } }); var n = d.getElementsByTagName("script")[0], s = d.createElement("script"), f = function () { n.parentNode.insertBefore(s, n); }; s.type = "text/javascript"; s.async = true; s.src = "https://mc.yandex.ru/metrika/watch.js"; if (w.opera == "[object Opera]") { d.addEventListener("DOMContentLoaded", f, false); } else { f(); } })(document, window, "yandex_metrika_callbacks"); </script> <noscript><div><img src="https://mc.yandex.ru/watch/10183018?ut=noindex" style="position:absolute; left:-9999px;" alt="" /></div></noscript> <!-- /Yandex.Metrika counter -->\n </body>\n</html>\n'
b'\n\n\n\n\n\n\n\n\n body{background color:#f0f0}\n div{overflow:hidden;position:absolute;}\n#top{top:0;left:0;width:100%;height:50px;font size:14px}/*40px*/\n#content{top:50px;left:0;bottom 0;width:100%}\n p{margin 0;padding:10px}\n{font size:12px;font family:sans serif}\n a.target{font-weight:normal;color:green;margin left:10px}\n a.reopen{font-weight:normal;color:blue;text-decoration:none;margin left:10px}\n iframe{width:100%;height:100%}\n\n\n p.mization{padding top:5px;font-size:20px;text-align:center}\n p.mization a{字体大小:20px;文本装饰:无;颜色:绿色}\n\n.横幅{位置:绝对;z索引:9999;顶部:400px;左侧:0px;宽度:300px;高度:225px;\n边框:实心1px#ccc;填充:5px;\n文本对齐:中心;字体大小:18px}\n.横幅img{边框:0}\n\n\n{填充:0;边距:0;填充顶部:5px;文本对齐:中心;背景:绿色;高度:40px}\n p.a{颜色:白色;字体重量:粗体;文本装饰:无;字体大小:20px}\n\n保存{位置:绝对;z索引:9999;顶部:180px;左侧:8px;宽度:210px;高度:36px;\n边框半径:4px;边框:实心1px#ccc;填充:5px;\n文本对齐:中心;字体大小:18px;背景:#F0F0F0;颜色:#333}\n\n#保存一个{文本装饰:无;颜色:白色;字体大小:继承;颜色:#666}\n\n#保存{边距:0;填充:0;边距顶部:8px}\n\n#reload{位置:绝对;z索引:9999;顶部:240px;左侧:8px;宽度:210px;高度:36px;\n边框半径:4px;边框:实心1px#ccc;填充:5px;\n文本对齐:中心;字体大小:18px;背景:f0f0f0f0;颜色:333\n\n#reload a{文本装饰:无;颜色:白色;字体大小:继承;颜色:#666}\n\n#重新加载页边距:0;填充:0;页边距顶部:8px}\n\n\n#saveastro{位置:绝对;z索引:9999;顶部:360px;左侧:8px;宽度:230px;高度:70px;\n边框半径:4px;边框:实心1px#ccc;背景:白色;文本对齐:中心}\n#saveastro p{margin:0;padding:0;margin top:16px}\n\n\n#捐赠{位置:绝对;z-index:9999;top:170px;right:16px;宽度:220px;高度:36px;\n边框半径:4px;边框:实心1px#ccc;padding:5px;\n文本对齐:中心;字体大小:18px;背景:白色;颜色:#333\n\n#donatein{位置:绝对;z索引:9999;顶部:220px;右侧:16px;宽度:220px;高度:36px;\n边框半径:4px;边框:实心1px#ccc;填充:5px;\n文本对齐:中心;字体大小:18px;背景:绿色;颜色:333}\n\n#donatein a{文本装饰:无;颜色:白色;字体大小:继承}\n\n#banner{位置:绝对;z索引:9999;顶部:50%;左侧:45px;宽度:250px;高度:250px;填充:0;边框:纯白1px;边框半径:4px}\n\n\n\n\n\n\n window.onload=function(){\n var url=document.getElementById(\'url\');\n if(url.innerHTML.length>77)\n url.innerHTML=url.innerHTML.substring(0,77)+\'…\'';\n};\n\n\n\1\X0\X0\X0\X0\X0\X0\X0\X0\X0\X0\X0\X0\X8\X8\X8\X8\X8\X0\X0\X0\X0\X0\X0\X0\X0\X0\X0\X0\X0\X0\X0\X0\X0\X0\X0\X0\X0\X0\X8\X8\X8\X0\X8\X8\X8\X8\X8\X8\X8\X8\X8\X8\X0\X8\X8\X8\X0\X0\X0\X0\X0\X0\X0\X0\X0\X0\X0\X0\X0\X0\X0\X0\X0\X0\X0\X0\X0\X8\X8\X8 xd1\x85\xe2\x86\x92\n
\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n(函数(d,w,w,c){(w[c]=w[c]|【】)推送(函数(){try{w.yaCounter10183018=new Ya.Metrika({id:10183018,clickmap:true,trackLinks:true,trackLinks:true,Accuratetracktrackbounce:true,ut:}}ndex)}(e) {};var n=d.getElementsByTagName(“脚本”)[0],s=d.createElement(“脚本”),f=function(){n.parentNode.insertBefore(s,n);};s.type=“text/javascript”;s.async=true;s.src=”https://mc.yandex.ru/metrika/watch.js“if(w.opera==”[object opera]”{d.addEventListener(“DOMContentLoaded”,f,false);}else{f();})(文档,窗口,“yandex\u metrika\u回调”);\n\n\n
其中确实包含URL,但我不确定如何提取它
更新:
我现在可以提取URL,但当我尝试使用PDF(通过urllib.request)访问页面时,即使URL是有效的,我也会得到403响应。有没有关于为什么和如何修复的想法?(我可以通过浏览器访问,因此不会被IP阻止)您可以使用访问页面的html,甚至下载文件,并找到要下载的文件的url
import urllib
import re
site = urllib.urlopen(".../index.html")
data = site.read() # turns the contents of the site into a string
files = re.findall('(http|https):\/\/([\w_-]+(?:(?:\.[\w_-]+)+))([\w.,@?^=%&:/~+#-]*[\w@?^=%&/~+#-])?(.pdf)', data) # finds the url
for file in files:
urllib.urlretrieve(file, filepath) # "filepath" is where you want to save it
以下是解决方案:-
url = re.search('<iframe src = "\s*([^"]+)"', res.content)
url.group(1)
urllib.urlretrieve(url.group(1),'C:/.../Docs/test.pdf')
url=re.search(“您可以使用需要selenium、requests和scrapy的笨重代码来完成
使用selenium请求文章标题或DOI
>>> from selenium import webdriver
>>> driver.get("http://sci-hub.io/")
>>> input_box = driver.find_element_by_name('request')
>>> input_box.send_keys('amazing scientific results\n')
艺术家
>>> driver.current_url
'http://sci-hub.io/'
>>> driver.get("http://sci-hub.io/")
>>> input_box = driver.find_element_by_name('request')
>>> input_box.send_keys('DOI: 10.1016/j.anai.2016.01.022\n')
>>> driver.current_url
'http://sci-hub.io/10.1016/j.anai.2016.01.022'
>>> import requests
>>> r = requests.get(driver.current_url)
>>> from scrapy.selector import Selector
>>> selector = Selector(text=r.text)
>>> pdf_url = selector.xpath('.//iframe/@src')[0].extract()
>>> r = requests.get(pdf_url).content
>>> open('article_name', 'wb').write(r)
211853
res = requests.get('http://sci-hub.io/' + str(row['DOI']))
useful = BeautifulSoup(res.content, "html5lib").find_all("iframe")
urls = re.findall('http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+', str(useful[0]))
response = requests.get(urls[0])
with open("C:\\Users\\Sangeeta's\\Downloads\\ref\\" + str(row['ref']) + '.pdf', 'wb') as fw:
fw.write(response.content)