Warning: file_get_contents(/data/phpspider/zhask/data//catemap/8/python-3.x/17.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python 3.x 使用Python从网站请求获取完整html_Python 3.x_Beautifulsoup_Python Requests - Fatal编程技术网

Python 3.x 使用Python从网站请求获取完整html

Python 3.x 使用Python从网站请求获取完整html,python-3.x,beautifulsoup,python-requests,Python 3.x,Beautifulsoup,Python Requests,我正试图发送一个http请求到一个网站(例如,Digikey)并读回完整的html。例如,我使用以下链接:获取零件号,例如:。然而,我得到的不是完整的html import requests from bs4 import BeautifulSoup r = requests.get('https://www.digikey.com/products/en?keywords=511-8002-KIT') soup = BeautifulSoup(r.text) print(soup.prett

我正试图发送一个http请求到一个网站(例如,Digikey)并读回完整的html。例如,我使用以下链接:获取零件号,例如:。然而,我得到的不是完整的html

import requests
from bs4 import BeautifulSoup

r = requests.get('https://www.digikey.com/products/en?keywords=511-8002-KIT')
soup = BeautifulSoup(r.text)
print(soup.prettify())
输出:

<!DOCTYPE html>
<html>
 <head>
  <script>
   var i10cdone =(function(){ function pingBeacon(msg){ var i10cimg = document.createElement('script'); i10cimg.src='/i10c@p1/botox/file/nv-loaded.js?status='+window.encodeURIComponent(msg); i10cimg.onload = function(){ (document.head || document.documentElement).removeChild(i10cimg) }; i10cimg.onerror = function(){ (document.head || document.documentElement).removeChild(i10cimg) }; ( document.head || document.documentElement).appendChild(i10cimg) }; pingBeacon('loaded'); if(String(document.cookie).indexOf('i10c.bdddb=c2-f0103ZLNqAeI3BH6yYOfG7TZlRtCrMwqUo')>=0) { document.cookie = 'i10c.bdddb=;path=/';}; var error=''; function errorHandler(e) { if (e && e.error && e.error.stack ) { error=e.error.stack; } else if( e && e.message ) { error = e.message; } else { error = 'unknown';}} if(window.addEventListener) { window.addEventListener('error',errorHandler, false); } else { if ( window.attachEvent ){ window.attachEvent('onerror',errorHandler); }} return function(){ if (window.removeEventListener) {window.removeEventListener('error',errorHandler); } else { if (window.detachEvent) { window.detachEvent('onerror',errorHandler); }} if(error) { pingBeacon('error-' + String(error).substring(0,500)); document.cookie='i10c.bdddb=c2-f0103ZLNqAeI3BH6yYOfG7TZlRtCrMwqUo;path=/'; }}; })();
  </script>
  <script src="/i10c@p1/client/latest/auto/instart.js?i10c.nv.bucket=pci&amp;i10c.nv.host=www.digikey.com&amp;i10c.opts=botox&amp;bcb=1" type="text/javascript">
  </script>
  <script type="text/javascript">
   INSTART.Init({"apiDomain":"assets.insnw.net","correlation_id":"1553546232:4907a9bdc85fe4e8","custName":"digikey","devJsExtraFlags":"{\"disableQuerySelectorInterception\" :true,  'rumDataConfigKey':'/instartlogic/clientdatacollector/getconfig/monitorprod.json','custName':'digikey','propName':'northamerica'}","disableInjectionXhr":true,"disableInjectionXhrQueryParam":"instart_disable_injection","iframeCommunicationTimeout":3000,"nanovisorGlobalNameSpace":"I10C","partialImage":false,"propName":"northamerica","rId":"0","release":"latest","rum":false,"serveNanovisorSameDomain":true,"third_party":["IA://www.digikey.com/js/geotargeting.js"],"useIframeRpc":false,"useWrapper":false,"ver":"auto","virtualDomains":4,"virtualizeDomains":["^auth\\.digikey\\.com$","^authtest\\.digikey\\.com$","^blocked\\.digikey\\.com$","^dynatrace\\.digikey\\.com$","^search\\.digikey\\.com$","^www\\.digikey\\.ca$","^www\\.digikey\\.com$","^www\\.digikey\\.com\\.mx$"]}
);
  </script>
  <script>
   typeof i10cdone === 'function' && i10cdone();
  </script>
 </head>
 <body>
  <script>
   setTimeout(function(){document.cookie="i10c.eac23=1";window.location.reload(true);},30);
  </script>
 </body>
</html>

var i10cdone=(function(){function pingBeacon(msg){var i10cimg=document.createElement('script');i10cimg.src=)/i10c@p1/botox/file/nv-loaded.js?status='+window.encodeURIComponent(msg);i10cimg.onload=function(){(document.head | | document.documentElement).removeChild(i10cimg)};i10cimg.onerror=function(){(document.head | document.documentElement).removeChild(i10cimg)};(document.head | document.documentElement).appendChild(i10cimg)};pingBeacon('loaded');if(String(document.cookie).indexOf('i10c.bdddb=c2-f0103zlnqaei3bh6yyyogf7tzlrtcrmwquo')>=0){document.cookie='i10c.bdddb=;path=/'};var error='(e&&e.error&&e.error.stack){error=e.error.stack;}else if(e&&e.message){error=e.message;}else{error='unknown';}if(window.addEventListener){window.addEventListener('error',errorHandler,false);}else{if(window attachEvent.attachEvent('onerror',errorHandler);}返回函数(){if(window.removeEventListener){window.removeEventListener('error',errorHandler);}else{if(window.detachEvent){window.detachEvent('onerror',errorHandler);}}if(error){pingBeacon('error-'+String(error).substring(0500));document.cookie='i10c.bdddb=c2-F0103ZLNQAEI3BH6YYOf7TZLRTCRmWquo;path=/';};};();
INSTART.Init({“apiDomain”:“assets.insnw.net”,“correlation_id”:“1553546232:4907a9bdc85fe4e8”,“custName”:“digikey”,“devJsExtraFlags”:“{”disableQuerySelectorInterception\”:true,'rumDataConfigKey':'/instartlogic/clientdatacollector/getconfig/monitorprod.json','custName':'digikey','propName':'northamerica','disableInjectionXhr','true,“disableInjectionXhrQueryParam”:“instart_disable_injection”,“iframeCommunicationTimeout”:3000,“nanovisorGlobalNameSpace”:“I10C”,“partialImage”:false,“propName”:“northamerica”,“rId”:“0”,“release”:“latest”,“rum”:false,“serveNanovisorSameDomain”:true,“第三方”:[“IA://www.digikey.com/js/geotinging.js],“UseFrameRPC”:false,“useWrapper”:false,”版本:“自动”、“虚拟域”:4,“虚拟域”:[“^auth\\\.digikey\\.com$”、“^authtest\\.digikey\.com$”、“^blocked\\.digikey\.com$”、“^dynatrace\.digikey\.com$”、“^search\.digikey\.com$”、“^www\\.digikey\.ca$”、“^www\.digikey\.com$”、“^www.digikey\.com\\.mx$”]
);
i10cdone的类型==='函数'&&i10cdone();
setTimeout(函数(){document.cookie=“i10c.eac23=1”;window.location.reload(true);},30);
我之所以需要完整的html是为了搜索特定的关键字,比如“无铅”或“通孔”是否出现在特定的零件号结果中。我不仅为Digikey这样做,还为其他网站这样做

任何帮助都将不胜感激

谢谢

编辑:


谢谢大家的建议/回答。请在此为其他对此感兴趣的人提供更多信息:

您要查找的页面部分很可能包含使用Javascript动态生成的内容

访问
查看源代码:https://www.digikey.com/products/en?keywords=part_number
在您的浏览器上,您将看到请求正在获取完整的html—它只是没有执行Javascript代码

如果右键单击并单击inspect(Chrome),您将看到在执行javascript代码后创建的最终DOM

要获得呈现的内容,您需要使用一个完整的web驱动程序,它能够执行Javascript来呈现整个页面

下面是如何使用Selenium实现这一目标的示例:


问题可能是因为页面的javascript没有时间运行,因此无法填充必要的HTML元素。解决方案之一是使用selenium实现webdriver:

from selenium import webdriver
chrome = webdriver.Chrome()
chrome.get("https://www.digikey.com/products/en?keywords=511-8002-KIT")
source = chrome.page_source
通常情况下,这样做效率会低很多,因为你必须完全等待页面加载。解决这一问题的一种方法是寻找网站提供的各种API,以直接访问你想要的数据。我建议对这些API进行一些研究

下面是一些可以用来直接获取数据的潜在API


这是因为网站是用javascript呈现的,这意味着您需要一个浏览器来检索所有呈现的脚本。查看API似乎对您每天可以执行的搜索次数有限制,而Selenium搜索数千个部分的速度非常慢。不过,谢谢!Selenium不一定是它的关键所在“慢”,是指运行脚本的页面。Selenium将花费页面渲染所需的时间。如果如上所述需要快速渲染,则需要直接(即从API)获取数据,或者只需等待页面渲染即可。
from selenium import webdriver
chrome = webdriver.Chrome()
chrome.get("https://www.digikey.com/products/en?keywords=511-8002-KIT")
source = chrome.page_source