试图用python刮取.aspx站点,但它赢了';t柱
我试图使用Python3的urllib进行刮取,然后使用BeautifulSoup对其进行解析。然而,尽管设置了每个输入字段,POST请求只返回相同的页面。它应该重定向到填充字段并点击搜索按钮时,但它没有试图用python刮取.aspx站点,但它赢了';t柱,python,asp.net,web-scraping,beautifulsoup,urllib,Python,Asp.net,Web Scraping,Beautifulsoup,Urllib,我试图使用Python3的urllib进行刮取,然后使用BeautifulSoup对其进行解析。然而,尽管设置了每个输入字段,POST请求只返回相同的页面。它应该重定向到填充字段并点击搜索按钮时,但它没有 import urllib.request, urllib.parse, urllib.error import socket, ssl from bs4 import BeautifulSoup ssl_context = ssl._create_unverified_context()
import urllib.request, urllib.parse, urllib.error
import socket, ssl
from bs4 import BeautifulSoup
ssl_context = ssl._create_unverified_context()
page_html = ''
get_req = urllib.request.Request('https://www.idfpr.com/applications/professionprofile/default.aspx',
headers={
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'User-Agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.17 (KHTML, like Gecko) Chrome/24.0.1312.57 Safari/537.17',
'Content-Type': 'application/x-www-form-urlencoded',
# 'Accept-Encoding': 'gzip,deflate,sdch',
# 'Accept-Language': 'en-US,en;q=0.8',
# 'Accept-Charset': 'ISO-8859-1,utf-8;q=0.7,*;q=0.3'
})
try:
page_html = urllib.request.urlopen(get_req,
context=ssl_context
).read()
except socket.timeout:
print("Request timed out. Moving on.")
exit(1)
except urllib.error.URLError as e:
print(e)
exit(1)
except ssl.CertificateError as e:
print(e)
exit(1)
soup_dummy = BeautifulSoup(
page_html,
'html5lib'
)
# parse and retrieve two vital form values
lastfocus = soup_dummy.select("#__LASTFOCUS")[0]['value']
viewstate = soup_dummy.select("#__VIEWSTATE")[0]['value']
viewstategen = soup_dummy.select("#__VIEWSTATEGENERATOR")[0]['value']
eventvalidation = soup_dummy.findAll("input", {"type": "hidden", "name": "__EVENTVALIDATION"})[0]['value']
eventargument = soup_dummy.select('#__EVENTARGUMENT')[0]['value']
# build input list of doctors
doctors = [(0,"AKRAMI","CYRUS")]
for doctor in doctors: # iterate over doctors and search for them on the IL site
formData = (
('__LASTFOCUS', lastfocus),
('__VIEWSTATE', viewstate),
('__VIEWSTATEGENERATOR', viewstategen),
('__EVENTTARGET', 'ctl00$ctl00$MainContent$MainContentContainer$Search'),
('__EVENTARGUMENT', eventargument),
('__EVENTVALIDATION', eventvalidation),
('ctl00$ctl00$MainContent$MainContentContainer$LastName', doctor[1]),
('ctl00$ctl00$MainContent$MainContentContainer$FirstName', doctor[2]),
('ctl00$ctl00$MainContent$MainContentContainer$ddlCounty', '0'),
('ctl00$ctl00$MainContent$MainContentContainer$City', ''),
('ctl00$ctl00$MainContent$MainContentContainer$ddlSpecialty', '0'),
('ctl00$ctl00$MainContent$MainContentContainer$SpecialtyKeyword', ''),
('ctl00$ctl00$MainContent$MainContentContainer$ddlHospitals', '0'),
('ctl00$ctl00$MainContent$MainContentContainer$Search', 'Search'),
('ctl00$ctl00$MainContent$MainContentContainer$Clear', 'Clear')
)
encodedFields = urllib.parse.urlencode(formData).encode('ascii')
# second HTTP request with form data
post_req = urllib.request.Request('https://www.idfpr.com/applications/professionprofile/default.aspx',
data=encodedFields,
headers={
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'User-Agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.17 (KHTML, like Gecko) Chrome/24.0.1312.57 Safari/537.17',
'Content-Type': 'application/x-www-form-urlencoded',
# 'Accept-Encoding': 'gzip,deflate,sdch',
# 'Accept-Language': 'en-US,en;q=0.8',
# 'Accept-Charset': 'ISO-8859-1,utf-8;q=0.7,*;q=0.3'
})
page_html = urllib.request.urlopen(post_req,
data=encodedFields,
context=ssl_context
).read()
soup = BeautifulSoup(page_html, "html5lib")
我错过了什么?我猜这与
\uu EVENTTARGET
有关;我了解到您需要将其设置为要点击的提交按钮,在本例中是ctl00$ctl00$MainContent$MainContentContainer$Search
,但这不起作用。下面的操作对我有效。不过,我使用的是requests.Session()
import requests
from bs4 import BeautifulSoup as bs
import urllib3; urllib3.disable_warnings(urllib3.exceptions.InsecureRequestWarning)
with requests.Session() as s:
r = s.get('https://www.idfpr.com/applications/professionprofile/default.aspx', verify=False)
soup = bs(r.content, 'lxml')
vs = soup.select_one('#__VIEWSTATE')['value']
ev = soup.select_one('#__EVENTVALIDATION')['value']
vsg = soup.select_one('#__VIEWSTATEGENERATOR')['value']
data = {
'__LASTFOCUS': '',
'__VIEWSTATE':vs,
'__VIEWSTATEGENERATOR': vsg,
'__EVENTTARGET': '',
'__EVENTARGUMENT': '',
'__EVENTVALIDATION': ev,
'ctl00$ctl00$MainContent$MainContentContainer$LastName': 'Alaraj',
'ctl00$ctl00$MainContent$MainContentContainer$FirstName': 'Ali ',
'ctl00$ctl00$MainContent$MainContentContainer$ddlCounty': '0',
'ctl00$ctl00$MainContent$MainContentContainer$City': '',
'ctl00$ctl00$MainContent$MainContentContainer$ddlSpecialty': '0',
'ctl00$ctl00$MainContent$MainContentContainer$SpecialtyKeyword': '',
'ctl00$ctl00$MainContent$MainContentContainer$ddlHospitals': '0',
'ctl00$ctl00$MainContent$MainContentContainer$Search': 'Search'
}
r = s.post('https://www.idfpr.com/applications/professionprofile/default.aspx', data=data)
soup = bs(r.content, 'lxml')
print(soup.select_one('#MainContent_MainContentContainer_gvwProfiles').text)
我不确定这是否与您的问题有关,但我非常确定,像您在当前代码中那样为“清除”按钮发送字段没有任何价值。@显而易见的编译器当它无法发布时,我的第一个假设是,这是因为我没有为每个字段定义值。所以我只是确保抓取页面加载上的值,然后重新分配它们,以防万一。所以在我这么做之前,错误就在那里\这是有效的,但是我仍然不明白为什么我以前的代码没有。它是否与表单字段有关,或者与我如何处理SSL或其他完全相关的内容有关?我设置了一个未经验证的SSL上下文。我想知道是不是因为你没有在POST请求上设置
verify=False
,你的请求成功了,我的请求失败了。我不认为这有什么区别,尽管你可以很容易地进行测试