Python在iframe中请求html表_Python

Python在iframe中请求html表

python

Python在iframe中请求html表,python,Python,如何使用请求刮取此链接中的表？我试图使用请求，但由于表位于iframe中，html返回的结果不完整。我只需要表格中的html，一旦我有了它，我想我可以用beatuifulsoup处理这个问题。下面是我正在使用的编码： url = 'https://www.rad.cvm.gov.br/ENETCONSULTA/frmGerenciaPaginaFRE.aspx?NumeroSequencialDocumento=89180&CodigoTipoInstituicao=2' resp

如何使用请求刮取此链接中的表？我试图使用请求，但由于表位于iframe中，html返回的结果不完整。我只需要表格中的html，一旦我有了它，我想我可以用beatuifulsoup处理这个问题。下面是我正在使用的编码：

url = 'https://www.rad.cvm.gov.br/ENETCONSULTA/frmGerenciaPaginaFRE.aspx?NumeroSequencialDocumento=89180&CodigoTipoInstituicao=2'
resp = requests.get(url, verify=False)

实现这一点的最佳方法是使用Selenium，等待几秒钟，直到iframe加载，然后捕获iframe的内容

下面是一个如何做到这一点的示例：

import sys
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from time import sleep

url = 'https://www.rad.cvm.gov.br/ENETCONSULTA/frmGerenciaPaginaFRE.aspx?NumeroSequencialDocumento=89180&CodigoTipoInstituicao=2'
options = Options()
# activate the following two lines to run in headless mode.
# options.add_argument('--headless')
# options.add_argument('--disable-gpu')
options.add_argument("user-agent=Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.100 Safari/537.36")
# /usr/bin/chromedriver is the path where I've installed chromedriver.
driver = webdriver.Chrome('/usr/bin/chromedriver', chrome_options=options)
driver.get(url)
# Wait till iframe loads
sleep(5)
html = driver.execute_script("return document.getElementsByTagName('html')[0].innerHTML").encode('utf-8').strip()
# Now you have the fully-loaded HTML, you may continue to use getElementByTagName or a different library like bs4 to extract the content of the iframe. 
driver.close()

如果不想使用selenium，可以使用此脚本加载包含请求的表：

印刷品：

<table id="ctl00_cphPopUp_tbDados">
 <tr>
  <td style="padding:8px 5px 8px 5px; background:#cccfd1; border-bottom:1px solid #fff !important; text-align:center; color:#ffffff; font:normal normal bold 12px 'Trebuchet MS', sans-serif;">
   Conta
  </td>
  <td style="padding:8px 5px 8px 5px; background:#cccfd1; border-bottom:1px solid #fff !important; text-align:center; color:#ffffff; font:normal normal bold 12px 'Trebuchet MS', sans-serif;">
   Descrição
  </td>
  <td style="padding:8px 5px 8px 5px; background:#cccfd1; border-bottom:1px solid #fff !important; text-align:center; color:#ffffff; font:normal normal bold 12px 'Trebuchet MS', sans-serif;">
   01/07/2019
   <br/>
   a
   <br/>
   30/09/2019
  </td>

... and so on.

使用requests.packages.urllib3.util.ssl.DEFAULT_CIPHERS和verify=False是否是因为您在防火墙后面？@QHarr显然，这台服务器http://www.rad.cvm.gov.br/ 使用弱的、不安全的密码，因此如果没有它，请求将无法连接。我在这里找到了这个食谱，所以绕过它。

<table id="ctl00_cphPopUp_tbDados">
 <tr>
  <td style="padding:8px 5px 8px 5px; background:#cccfd1; border-bottom:1px solid #fff !important; text-align:center; color:#ffffff; font:normal normal bold 12px 'Trebuchet MS', sans-serif;">
   Conta
  </td>
  <td style="padding:8px 5px 8px 5px; background:#cccfd1; border-bottom:1px solid #fff !important; text-align:center; color:#ffffff; font:normal normal bold 12px 'Trebuchet MS', sans-serif;">
   Descrição
  </td>
  <td style="padding:8px 5px 8px 5px; background:#cccfd1; border-bottom:1px solid #fff !important; text-align:center; color:#ffffff; font:normal normal bold 12px 'Trebuchet MS', sans-serif;">
   01/07/2019
   <br/>
   a
   <br/>
   30/09/2019
  </td>

... and so on.