Python 无法使用请求保留对下一页的名称的分析

Python 无法使用请求保留对下一页的名称的分析,python,python-3.x,web-scraping,python-requests,Python,Python 3.x,Web Scraping,Python Requests,我创建了一个脚本,用于解析位于中的表中的不同名称。脚本可以从登录页中删除名称。我不能做的就是把下一页的名字也删掉 要在该站点中手动生成结果,只需按开始搜索按钮,不做任何更改 到目前为止,我已经尝试过: import requests from bs4 import BeautifulSoup link = 'https://hsapps.azdhs.gov/ls/sod/SearchProv.aspx?type=DD' payload = { 'ctl00$ContentPlaceH

我创建了一个脚本,用于解析位于中的表中的不同名称。脚本可以从登录页中删除名称。我不能做的就是把下一页的名字也删掉

要在该站点中手动生成结果,只需按
开始搜索
按钮,不做任何更改

到目前为止,我已经尝试过:

import requests
from bs4 import BeautifulSoup

link = 'https://hsapps.azdhs.gov/ls/sod/SearchProv.aspx?type=DD'

payload = {
    'ctl00$ContentPlaceHolder1$btnSubmit1': 'Start Search'
}

with requests.Session() as s:
    s.headers['User-Agent'] = 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.104 Safari/537.36'
    r = s.get(link)
    soup = BeautifulSoup(r.text,"lxml")
    payload['__VIEWSTATE'] = soup.select_one('#__VIEWSTATE')['value']
    payload['__EVENTVALIDATION'] = soup.select_one('#__EVENTVALIDATION')['value']
    r = s.post(link,data=payload)
    soup = BeautifulSoup(r.text,"lxml")
    for item in soup.select("#ctl00_ContentPlaceHolder1_DgFacils tr:has(> td)"):
        item_name = item.select("td")[1].text
        print(item_name)
如何使用请求继续解析下一页中的名称


此站点使用不同的url来处理第一个页面和所有其他页面

我使用GoogleChrome控制台获取数据格式,将请求发送到分页页面

  • 进入网页并按
    开始搜索
  • 打开谷歌浏览器控制台
  • 单击分页
  • 检查分页请求。
  • 从控制台抓取所有POST参数,并在上一页的HTML中找到它们
  • 下面是最终的python代码

    
    import requests
    from bs4 import BeautifulSoup
    
    link = 'https://hsapps.azdhs.gov/ls/sod/SearchProv.aspx?type=DD'
    
    payload = {
        'ctl00$ContentPlaceHolder1$btnSubmit1': 'Start Search'
    }
    
    PAGE_COUNT = 86
    
    with requests.Session() as s:
        s.headers['User-Agent'] = 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.104 Safari/537.36'
        r = s.get(link)
        soup = BeautifulSoup(r.text,"lxml")
        payload['__VIEWSTATE'] = soup.select_one('#__VIEWSTATE')['value']
        payload['__EVENTVALIDATION'] = soup.select_one('#__EVENTVALIDATION')['value']
        r = s.post(link,data=payload)
        soup = BeautifulSoup(r.text,"lxml")
        for item in soup.select("#ctl00_ContentPlaceHolder1_DgFacils tr:has(> td)"):
            item_name = item.select("td")[1].text
            print(item_name)
    
        link = "https://hsapps.azdhs.gov/ls/sod/Provider.aspx?type=DD"
    
        for page in range(2, PAGE_COUNT + 1):
            payload = {
                '__VIEWSTATE': soup.select_one('#__VIEWSTATE')['value'],
                '__EVENTVALIDATION': soup.select_one('#__EVENTVALIDATION')['value'],
                '__EVENTTARGET': 'ctl00$ContentPlaceHolder1$ddPage',
                '__VIEWSTATEENCRYPTED': '',
                '__EVENTARGUMENT': '',
                '__LASTFOCUS': '',
                '__VIEWSTATEGENERATOR': soup.select_one('#__VIEWSTATEGENERATOR')['value'],
                'ctl00$ContentPlaceHolder1$HiddenField1':
                    soup.select_one('#ctl00_ContentPlaceHolder1_HiddenField1')['value'],
                'ctl00$ContentPlaceHolder1$HiddenField2':
                    soup.select_one('#ctl00_ContentPlaceHolder1_HiddenField2')['value'],
                'ctl00$ContentPlaceHolder1$ddPage': str(page),
            }
            r = s.post(link, data=payload)
            soup = BeautifulSoup(r.text, "lxml")
            for item in soup.select("#ctl00_ContentPlaceHolder1_DgFacils tr:has(> td)"):
                item_name = item.select("td")[1].text
                print(item_name)
    
    

    此站点使用不同的url来处理第一个页面和所有其他页面

    我使用GoogleChrome控制台获取数据格式,将请求发送到分页页面

  • 进入网页并按
    开始搜索
  • 打开谷歌浏览器控制台
  • 单击分页
  • 检查分页请求。
  • 从控制台抓取所有POST参数,并在上一页的HTML中找到它们
  • 下面是最终的python代码

    
    import requests
    from bs4 import BeautifulSoup
    
    link = 'https://hsapps.azdhs.gov/ls/sod/SearchProv.aspx?type=DD'
    
    payload = {
        'ctl00$ContentPlaceHolder1$btnSubmit1': 'Start Search'
    }
    
    PAGE_COUNT = 86
    
    with requests.Session() as s:
        s.headers['User-Agent'] = 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.104 Safari/537.36'
        r = s.get(link)
        soup = BeautifulSoup(r.text,"lxml")
        payload['__VIEWSTATE'] = soup.select_one('#__VIEWSTATE')['value']
        payload['__EVENTVALIDATION'] = soup.select_one('#__EVENTVALIDATION')['value']
        r = s.post(link,data=payload)
        soup = BeautifulSoup(r.text,"lxml")
        for item in soup.select("#ctl00_ContentPlaceHolder1_DgFacils tr:has(> td)"):
            item_name = item.select("td")[1].text
            print(item_name)
    
        link = "https://hsapps.azdhs.gov/ls/sod/Provider.aspx?type=DD"
    
        for page in range(2, PAGE_COUNT + 1):
            payload = {
                '__VIEWSTATE': soup.select_one('#__VIEWSTATE')['value'],
                '__EVENTVALIDATION': soup.select_one('#__EVENTVALIDATION')['value'],
                '__EVENTTARGET': 'ctl00$ContentPlaceHolder1$ddPage',
                '__VIEWSTATEENCRYPTED': '',
                '__EVENTARGUMENT': '',
                '__LASTFOCUS': '',
                '__VIEWSTATEGENERATOR': soup.select_one('#__VIEWSTATEGENERATOR')['value'],
                'ctl00$ContentPlaceHolder1$HiddenField1':
                    soup.select_one('#ctl00_ContentPlaceHolder1_HiddenField1')['value'],
                'ctl00$ContentPlaceHolder1$HiddenField2':
                    soup.select_one('#ctl00_ContentPlaceHolder1_HiddenField2')['value'],
                'ctl00$ContentPlaceHolder1$ddPage': str(page),
            }
            r = s.post(link, data=payload)
            soup = BeautifulSoup(r.text, "lxml")
            for item in soup.select("#ctl00_ContentPlaceHolder1_DgFacils tr:has(> td)"):
                item_name = item.select("td")[1].text
                print(item_name)