Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/python/313.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python 如何在web刮取期间查找所有隐藏的标签href?_Python_Json_Python 3.x_Web Scraping - Fatal编程技术网

Python 如何在web刮取期间查找所有隐藏的标签href?

Python 如何在web刮取期间查找所有隐藏的标签href?,python,json,python-3.x,web-scraping,Python,Json,Python 3.x,Web Scraping,在中的右侧有几个选项卡,其中包括要查看的文档 底层代码是一个标记,带有链接文档位置的部分href。我一直试图获取所有这些文档(通常以URL“/documents/”开头),但没有成功 当我抓取文件时,我似乎只抓取了在带有“听力文件”表的选项卡中找到的第一组文件。我共享了一段代码的插入,我试图获取此页面中的所有href import requests from bs4 import BeautifulSoup page = requests.get('https://www.jud11.fl

在中的右侧有几个选项卡,其中包括要查看的文档

底层代码是一个标记,带有链接文档位置的部分href。我一直试图获取所有这些文档(通常以URL“/documents/”开头),但没有成功

当我抓取文件时,我似乎只抓取了在带有“听力文件”表的选项卡中找到的第一组文件。我共享了一段代码的插入,我试图获取此页面中的所有href

import requests
from bs4 import BeautifulSoup

page = requests.get('https://www.jud11.flcourts.org/Judge-Details?judgeid=1063&sectionid=2')
soup = BeautifulSoup(page.content, 'html.parser')

for link in soup.find_all("a"):
    if link.has_attr('href'):
        print(link['href'])
如果输出只是第一个选项卡中的文档(在本例中),我共享一个代码段:

#collapse1
#collapse2
/documents/judges_forms/1062458802-Ex%20Parte%20Motions%20to%20Compel%20Discovery.pdf
/documents/judges_forms/1062459053-JointCaseMgtReport121.pdf
#collapse4
#collapse6
是否有人知道如何获取同一页面中存在的以下内容(我在下面列出)?(我会说使用浏览器上的Inspect Element功能确认这一点,但它不会显示它。您必须转到表“Harding Documents”的选项卡,然后检查元素)

/文件/法官表格/1422459010订单%20授予%20运动%20至%20撤回.docx

/文件/法官表格/1422459046-ORDER%20ON%20律师%20Fees.docx


谢谢你的帮助

您可以使用此示例从其他选项卡获取指向文档的链接:

import requests
from bs4 import BeautifulSoup


url = 'https://www.jud11.flcourts.org/Judge-Details?judgeid=1063&sectionid=2'
headers = {'X-MicrosoftAjax': 'Delta=true',
           'X-Requested-With': 'XMLHttpRequest'}

with requests.session() as s:

    soup = BeautifulSoup(s.get(url).content, 'html.parser')

    data = {}
    for i in soup.select('input[name]'):
        data[i['name']] = i.get('value', '')

    for page in range(0, 6):
        print('Tab no.{}..'.format(page))
        data['ScriptManager'] = "ScriptManager|dnn$ctr1843$View$rtSectionHearingTypes"
        data['__EVENTARGUMENT'] = '{"type":0,"index":"' + str(page) + '"}'
        data['__EVENTTARGET'] ="dnn$ctr1843$View$rtSectionHearingTypes"
        data['dnn_ctr1843_View_rtSectionHearingTypes_ClientState'] = '{"selectedIndexes":["' + str(page) + '"],"logEntries":[],"scrollState":{}}'
        data['__ASYNCPOST'] = "true"
        data['RadAJAXControlID'] = "dnn_ctr1843_View_RadAjaxManager1"

        soup = BeautifulSoup( s.post(url, headers=headers, data=data).content, 'html.parser' )
        for a in soup.select('a[href*="documents"]'):
            print('https://www.jud11.flcourts.org' + a['href'])
印刷品:

Tab no.0..
https://www.jud11.flcourts.org/documents/judges_forms/1062458802-Ex%20Parte%20Motions%20to%20Compel%20Discovery.pdf
https://www.jud11.flcourts.org/documents/judges_forms/1062459053-JointCaseMgtReport121.pdf
Tab no.1..
Tab no.2..
Tab no.3..
Tab no.4..
https://www.jud11.flcourts.org/documents/judges_forms/1422459010-Order%20Granting%20Motion%20to%20Withdraw.docx
https://www.jud11.flcourts.org/documents/judges_forms/1422459046-ORDER%20ON%20Attorneys%20Fees.docx
Tab no.5..
https://www.jud11.flcourts.org/documents/judges_forms/1512459051-Evidence%20Procedures.docx

你说的标签是什么意思?你能用你的意思发布屏幕截图吗?在网站的右侧,有“运动日历”、“止赎日历”、“特殊设置”等标签。我编辑了我的原始帖子……对于复杂的布局,你可能想使用seleniumI查看文档,你推荐解决这个问题的方法吗(在上面的帖子中)。我以前浏览过网页,但从未遇到过隐藏的底层链接。你知道我是否可以公开这个链接来了解我需要在代码中获取什么吗?谢谢hanks@AndrejKesely,你能帮我理解你的代码吗?我关注的是数据“ScriptManager”和“EVENTARGUMENT”等。它本质上在做什么?@Stefano这些是HTTP POST参数。该页面使用Ajax加载不同的选项卡,并为每个选项卡设置正确的参数。