Javascript 使用href=";从网页中提取pdf链接#';使用python:AJAX post请求未返回预期结果

Javascript 使用href=";从网页中提取pdf链接#';使用python:AJAX post请求未返回预期结果,javascript,python,html,selenium,pdf,Javascript,Python,Html,Selenium,Pdf,我目前正在尝试从一个网站下载pdf(我正在尝试自动化这个过程),我尝试了许多不同的方法。我目前正在使用python和selenium/phantomjs,首先在网页源代码上找到pdf href链接,然后使用类似wget的东西下载pdf并将其存储在本地驱动器上 虽然在页面上查找所有href链接find\u elements\u by_xpath(“//a/@href”),或者缩小url路径为find\u elements\u by\u link\u text('Active Saver')的元素范

我目前正在尝试从一个网站下载pdf(我正在尝试自动化这个过程),我尝试了许多不同的方法。我目前正在使用python和selenium/phantomjs,首先在网页源代码上找到pdf href链接,然后使用类似wget的东西下载pdf并将其存储在本地驱动器上

虽然在页面上查找所有href链接
find\u elements\u by_xpath(“//a/@href”)
,或者缩小url路径为
find\u elements\u by\u link\u text('Active Saver')
的元素范围,然后使用
get\u attribute('href')
方法打印,但它没有正确显示链接

这是源元素,一个a标记,我需要它的链接是:

href="#" data-ng-mouseup="loadProductSheetPdf($event, download.ProductType)" target="_blank" data-ng-click="$event.preventDefault()" analytics-event="{event_name:'file_download', event_category: 'download', event_label:'product summary'}" class="ng-binding ng-isolate-scope">Active Saver< 
这不是PDF的链接。我知道这一点,因为当我在Firefox中打开页面并检查元素时,我可以看到实际的JavaScript执行源代码:

href="https://bupaanzstdhtauspub01.blob.core.windows.net/productfiles/J6_ActiveSaver_NSWACT_20180401_000000.pdf" data-ng-mouseup="loadProductSheetPdf($event, download.ProductType)" target="_blank" data-ng-click="$event.preventDefault()" analytics-event="{event_name:'file_download', event_category: 'download', event_label:'product summary'}" class="ng-binding ng-isolate-scope">Active Saver< 
然而,我不断收到错误,表明没有Json对象可以被解码。我可以使用response.text查看响应,然后

 u'<html>\r\n<head>\r\n<META NAME="robots" CONTENT="noindex,nofollow">\r\n<script src="/_Incapsula_Resource?SWJIYLWA=719d34d31c8e3a6e6fffd425f7e032f3">\r\n</script>\r\n<script>\r\n(function() { \r\nvar z="";var b="7472797B766172207868723B76617220743D6E6577204461746528292E67657454696D6528293B766172207374617475733D227374617274223B7661722074696D696E673D6E65772041727261792833293B77696E646F772E6F6E756E6C6F61643D66756E6374696F6E28297B74696D696E675B325D3D22723A222B286E6577204461746528292E67657454696D6528292D74293B646F63756D656E742E637265617465456C656D656E742822696D6722292E7372633D222F5F496E63617073756C615F5265736F757263653F4553324C555243543D363726743D373826643D222B656E636F6465555249436F6D706F6E656E74287374617475732B222028222B74696D696E672E6A6F696E28292B222922297D3B69662877696E646F772E584D4C4874747052657175657374297B7868723D6E657720584D4C48747470526571756573747D656C73657B7868723D6E657720416374697665584F626A65637428224D6963726F736F66742E584D4C4854545022297D7868722E6F6E726561647973746174656368616E67653D66756E6374696F6E28297B737769746368287868722E72656164795374617465297B6361736520303A7374617475733D6E6577204461746528292E67657454696D6528292D742B223A2072657175657374206E6F7420696E697469616C697A656420223B627265616B3B6361736520313A7374617475733D6E6577204461746528292E67657454696D6528292D742B223A2073657276657220636F6E6E656374696F6E2065737461626C6973686564223B627265616B3B6361736520323A7374617475733D6E6577204461746528292E67657454696D6528292D742B223A2072657175657374207265636569766564223B627265616B3B6361736520333A7374617475733D6E6577204461746528292E67657454696D6528292D742B223A2070726F63657373696E672072657175657374223B627265616B3B6361736520343A7374617475733D22636F6D706C657465223B74696D696E675B315D3D22633A222B286E6577204461746528292E67657454696D6528292D74293B6966287868722E7374617475733D3D323030297B706172656E742E6C6F636174696F6E2E72656C6F616428297D627265616B7D7D3B74696D696E675B305D3D22733A222B286E6577204461746528292E67657454696D6528292D74293B7868722E6F70656E2822474554222C222F5F496E63617073756C615F5265736F757263653F535748414E45444C3D363634323839373431333131303432323133352C353234303631363938363836323232363836382C393038303935393835353935393539353435312C31303035363336222C66616C7365293B7868722E73656E64286E756C6C297D63617463682863297B7374617475732B3D6E6577204461746528292E67657454696D6528292D742B2220696E6361705F6578633A20222B633B646F63756D656E742E637265617465456C656D656E742822696D6722292E7372633D222F5F496E63617073756C615F5265736F757263653F4553324C555243543D363726743D373826643D222B656E636F6465555249436F6D706F6E656E74287374617475732B222028222B74696D696E672E6A6F696E28292B222922297D3B";for (var i=0;i<b.length;i+=2){z=z+parseInt(b.substring(i, i+2), 16)+",";}z = z.substring(0,z.length-1); eval(eval(\'String.fromCharCode(\'+z+\')\'));})();\r\n</script></head>\r\n<body>\r\n<iframe style="display:none;visibility:hidden;" src="//content.incapsula.com/jsTest.html" id="gaIframe"></iframe>\r\n</body></html>'

u'\r\n\r\n\r\n\r\n\r\n\r\n(函数(){\r\nvar z=”“;变量b“的(var i=0;i我能够通过确保我的头信息和随post请求发送的请求负载(数据)都是完整和准确的(从Firefox开发工具web控制台获得)来解决这个问题。一旦我能够接收到post请求的响应数据,提取链接到我想要下载的pdf文件的url就相对简单了。然后,我使用urllib模块中的urlretrieve下载了pdf。我根据此模块中的脚本建模了我的脚本。然而,我也最终使用了urllib2.request表单urllib2模块而不是requests.post来自requests模块。出于某种原因,urllib2模块比requests模块工作得更一致。我的工作代码最终是这样的(这两个方法来自My class对象,但显示了工作代码):


从问题中提供的HTML中修剪标记名
的原因可能重复?我今天刚刚注册stackoverflow,所以我仍在研究如何使用它。由于某些原因,在复制和粘贴完整链接时,它似乎没有发布带有标记的完整链接(仅innerhtml)好的,欢迎使用Stack Overflow!作为一名新成员,您可能需要在某个时间阅读该文档。输入问题时,可以使用格式帮助和预览;有关更多信息,请参阅。由于HTML是一个有效的格式选项,其代码将从视图中消失;为防止出现这种情况,请使用
`code ticks`
(内联时)或者
code blocks
——我编辑了你的文章,使用后者,因为它是一行相当长的数据,但仍然是一行。
import requests
from lxml.etree import fromstring

url = "post_url"
data = {data dictionary to send with request extraced from dev tools}
response = requests.post(url,data)
response.json()
 u'<html>\r\n<head>\r\n<META NAME="robots" CONTENT="noindex,nofollow">\r\n<script src="/_Incapsula_Resource?SWJIYLWA=719d34d31c8e3a6e6fffd425f7e032f3">\r\n</script>\r\n<script>\r\n(function() { \r\nvar z="";var bfor (var i=0;i<b.length;i+=2){z=z+parseInt(b.substring(i, i+2), 16)+",";}z = z.substring(0,z.length-1); eval(eval(\'String.fromCharCode(\'+z+\')\'));})();\r\n</script></head>\r\n<body>\r\n<iframe style="display:none;visibility:hidden;" src="//content.incapsula.com/jsTest.html" id="gaIframe"></iframe>\r\n</body></html>'
....
def post_request(self,url,data):
        self.data = data
        self.url = url
        req = urllib2.Request(self.url)
        req.add_header('Content-Type', 'application/json')
        res = urllib2.urlopen(req,self.data)
        out = json.load(res)
        return out


    def get_pdf(self):
        link ='https://www.bupa.com.au/api/cover/datasheets/search'
        directory = '/Users/U1085012/OneDrive/PDS data project/Bupa/PDS Files/'
        excess = [None, 0,50,100,500]

        #singles
        for product in get_product_names_singles():
            self.search_request['PackageEntityName'] = product
            print product
            if 'extras' in product:
                self.search_request['ProductType'] = 2
            else:
                self.search_request['ProductType'] = 1
            for i in range(len(excess)):
                    try:
                        self.search_request['Excess'] = excess[i]
                        payload = json.dumps(self.search_request)
                        output = self.post_request(link,payload)
                    except urllib2.HTTPError:
                        continue
                    else:
                        break

            path = output['FilePath'].encode('ascii')
            file_name = output['FileName'].encode('ascii')
            #check to see if file exists if not then retrieve
            if os.path.exists(directory+file_name):
                pass
            else:
                ul.urlretrieve(path, directory+file_name