Python 刮削；“隐藏的”；来自网页的表格_Python_Selenium_Web Scraping_Beautifulsoup_Python Requests

Python 刮削；“隐藏的”；来自网页的表格

python selenium web-scraping

Python 刮削；“隐藏的”；来自网页的表格,python,selenium,web-scraping,beautifulsoup,python-requests,Python,Selenium,Web Scraping,Beautifulsoup,Python Requests,我正在尝试获取此URL处的表：。我试着读了它，请求并美化了它： from bs4 import BeautifulSoup as bs import requests s = requests.session() req = s.get('https://www.agenas.gov.it/covid19/web/index.php?r=site%2Ftab2', headers={ "User-Agent" : "Mozilla/5.0 (X11; Linux

我正在尝试获取此URL处的表：。我试着读了它，请求并美化了它：

from bs4 import BeautifulSoup as bs
import requests
s = requests.session()
req = s.get('https://www.agenas.gov.it/covid19/web/index.php?r=site%2Ftab2', headers={
"User-Agent" : "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) "
               "Chrome/51.0.2704.103 Safari/537.36"})
soup = bs(req.content)
table = soup.find('table')

但是，我只得到表的标题

<table class="table">
<caption class="pl8">Ricoverati e posti letto in area non critica e terapia intensiva.</caption>
<thead>
<tr>
<th class="cella-tabella-sm align-middle text-center" scope="col">Regioni</th>
<th class="cella-tabella-sm bg-blu align-middle text-center" scope="col">Ricoverati in Area Non Critica</th>
<th class="cella-tabella-sm bg-blu align-middle text-center" scope="col">PL in Area Non Critica</th>
<th class="cella-tabella-sm bg-blu align-middle text-center" scope="col">Ricoverati in Terapia intensiva</th>
<th class="cella-tabella-sm bg-blu align-middle text-center" scope="col">PL in Terapia Intensiva</th>
<th class="cella-tabella-sm bg-blu align-middle text-center" scope="col">PL Terapia Intensiva attivabili</th>
</tr>
</thead>
<tbody id="tab2_body">
</tbody>
</table>

有没有办法解决这个问题？谢谢。

那些

标题所需的“秘密”实际上嵌入在
标记中。因此，您可以将它们搜索出来，将它们解析为一个JSON
，并在请求头中使用
以下是方法：
导入json
进口稀土
导入请求
从bs4导入BeautifulSoup
标题={
“用户代理”：“Mozilla/5.0（X11；Linux x86_64）”
“AppleWebKit/537.36（KHTML，像壁虎一样）”
“Chrome/89.0.4389.90 Safari/537.36”，
“x-request-with”：“XMLHttpRequest”，
}
将requests.Session（）作为s：
终点=”https://Agenas:tab2-19@www.agenas.gov.it/covid19/web/index.php？r=json%2fta2“
常规页面=”https://www.agenas.gov.it/covid19/web/index.php?r=site%2Ftab2"
html=s.get（常规页面，页眉=页眉）。文本
soup=BeautifulSoup（html，“html.parser”）。查找所有（“脚本”）[-1]。字符串
hacked_payload=json.loads(
research（r“headers:\s（{.*}），”，soup，re.s.group（1）.strip（）
)
headers.update（黑客攻击的负载）
打印（json.dumps（s.get（end_point，headers=headers.json（），indent=2））

输出：
[
  {
    "regione": "Abruzzo",
    "dato1": "667",
    "dato2": "1495",
    "dato3": "89",
    "dato4": "215",
    "dato5": "0"
  },
  {
    "regione": "Basilicata",
    "dato1": "164",
    "dato2": "426",
    "dato3": "12",
    "dato4": "88",
    "dato5": "13"
  },

and so on ...

如果表中的数据是用javascript动态加载的，那么您可能必须使用selenium。@Jortega仅供参考，这可以在不使用selenium@Panzeroto的重炮的情况下完成，请记住标记解决问题的答案。看见
[
  {
    "regione": "Abruzzo",
    "dato1": "667",
    "dato2": "1495",
    "dato3": "89",
    "dato4": "215",
    "dato5": "0"
  },
  {
    "regione": "Basilicata",
    "dato1": "164",
    "dato2": "426",
    "dato3": "12",
    "dato4": "88",
    "dato5": "13"
  },

and so on ...