使用BeautifulSoup使用Python抓取aspx网页

使用BeautifulSoup使用Python抓取aspx网页,python,web-scraping,Python,Web Scraping,我试着把这一页擦掉: 代码如下: import urllib from bs4 import BeautifulSoup headers = { 'Accept':'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8', 'Origin': 'http://www.indiapost.gov.in', 'User-Agent': 'Mozilla/5.0 (Windows NT 6.1) Ap

我试着把这一页擦掉:

代码如下:

import urllib
from bs4 import BeautifulSoup

headers = {
    'Accept':'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
    'Origin': 'http://www.indiapost.gov.in',
    'User-Agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.17 (KHTML, like Gecko)  Chrome/24.0.1312.57 Safari/537.17',
    'Content-Type': 'application/x-www-form-urlencoded',
    'Referer': 'http://www.nitt.edu/prm/nitreg/ShowRes.aspx',
    'Accept-Encoding': 'gzip,deflate,sdch',
    'Accept-Language': 'en-US,en;q=0.8',
    'Accept-Charset': 'ISO-8859-1,utf-8;q=0.7,*;q=0.3'
}

class MyOpener(urllib.FancyURLopener):
    version = 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.17 (KHTML, like Gecko) Chrome/24.0.1312.57 Safari/537.17'

myopener = MyOpener()
url = 'http://www.nitt.edu/prm/nitreg/ShowRes.aspx'
# first HTTP request without form data
f = myopener.open(url)
soup = BeautifulSoup(f)
# parse and retrieve two vital form values
viewstate = soup.findAll("input", {"type": "hidden", "name": "__VIEWSTATE"})
eventvalidation = soup.findAll("input", {"type": "hidden", "name": "__EVENTVALIDATION"})

print viewstate[0]['value']





formData = (
     ('__EVENTVALIDATION', eventvalidation),
    ('__VIEWSTATE', viewstate),
    ('__VIEWSTATEENCRYPTED',''),
    ('TextBox1', '106110006'),
    ('Button1', 'Show'),
)

encodedFields = urllib.urlencode(formData)
# second HTTP request with form data
f = myopener.open(url, encodedFields)

try:
    # actually we'd better use BeautifulSoup once again to
    # retrieve results(instead of writing out the whole HTML file)
    # Besides, since the result is split into multipages,
    # we need send more HTTP requests
    fout = open('tmp.html', 'w')
except:
    print('Could not open output file\n')
fout.writelines(f.readlines())
fout.close()
我不断收到服务器错误: 源错误:

在执行当前web请求期间生成了未经处理的异常。有关异常的起源和位置的信息可以使用下面的异常堆栈跟踪来识别

堆栈跟踪:

[FormatException: Invalid character in a Base-64 string.]
   System.Convert.FromBase64String(String s) +0
   System.Web.UI.LosFormatter.Deserialize(String input) +25
   System.Web.UI.Page.LoadPageStateFromPersistenceMedium() +101

[HttpException (0x80004005): Invalid_Viewstate
    Client IP: 10.0.0.166
    Port: 51915
    User-Agent: Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.17 (KHTML, like Gecko) Chrome/24.0.1312.57 Safari/537.17
    ViewState: [<input name="__VIEWSTATE" type="hidden" value="dDwtMTM3NzI1MDM3O3Q8O2w8aTwxPjs+O2w8dDw7bDxpPDE+O2k8Mj47PjtsPHQ8cDxwPGw8VmlzaWJsZTs+O2w8bzxmPjs+Pjs+O2w8aTwxPjtpPDM+Oz47bDx0PDtsPGk8Mz47PjtsPHQ8O2w8aTwwPjs+O2w8dDw7bDxpPDE+Oz47bDx0PEAwPDs7Ozs7Ozs7Ozs+Ozs+Oz4+Oz4+Oz4+O3Q8cDxwPGw8VmlzaWJsZTs+O2w8bzxmPjs+Pjs+Ozs+Oz4+O3Q8O2w8aTw5PjtpPDExPjs+O2w8dDxwPHA8bDxWaXNpYmxlOz47bDxvPGY+Oz4+Oz47Oz47dDx0PHA8cDxsPFZpc2libGU7PjtsPG88Zj47Pj47Pjs7Pjs7Pjs+Pjs+Pjs+Pjs+zHrNhAd1tTLXbBUyAJRtS6omUc0="/>]
    Http-Referer: 
    Path: /prm/nitreg/ShowRes.aspx.]
   System.Web.UI.Page.LoadPageStateFromPersistenceMedium() +447
   System.Web.UI.Page.LoadPageViewState() +18
   System.Web.UI.Page.ProcessRequestMain() +447
[格式异常:Base-64字符串中的字符无效。]
System.Convert.FromBase64String(字符串s)+0
反序列化(字符串输入)+25
System.Web.UI.Page.LoadPageStateFromPersistenceMedium()+101
[HttpException(0x80004005):视图状态无效
客户端IP:10.0.0.166
端口:51915
用户代理:Mozilla/5.0(Windows NT 6.1)AppleWebKit/537.17(KHTML,如Gecko)Chrome/24.0.1312.57 Safari/537.17
视图状态:[]
Http Referer:
路径:/prm/nitreg/ShowRes.aspx。]
System.Web.UI.Page.LoadPageStateFromPersistenceMedium()+447
System.Web.UI.Page.LoadPageViewState()+18
System.Web.UI.Page.ProcessRequestMain()+447
Base-64字符串中的字符无效。
问题是什么?

您使用的是ViewState输入对象,而不是值

ViewState: [<input name="__VIEWSTATE" type="hidden" value="dDwtMTM3NzI1MDM3O3Q8O2w8aTwxPjs+O2w8dDw7bDxpPDE+O2k8Mj47PjtsPHQ8cDxwPGw8VmlzaWJsZTs+O2w8bzxmPjs+Pjs+O2w8aTwxPjtpPDM+Oz47bDx0PDtsPGk8Mz47PjtsPHQ8O2w8aTwwPjs+O2w8dDw7bDxpPDE+Oz47bDx0PEAwPDs7Ozs7Ozs7Ozs+Ozs+Oz4+Oz4+Oz4+O3Q8cDxwPGw8VmlzaWJsZTs+O2w8bzxmPjs+Pjs+Ozs+Oz4+O3Q8O2w8aTw5PjtpPDExPjs+O2w8dDxwPHA8bDxWaXNpYmxlOz47bDxvPGY+Oz4+Oz47Oz47dDx0PHA8cDxsPFZpc2libGU7PjtsPG88Zj47Pj47Pjs7Pjs7Pjs+Pjs+Pjs+Pjs+zHrNhAd1tTLXbBUyAJRtS6omUc0="/>]
注意您的eventvalidation值也有相同的问题,我也解决了它

编辑: 该页中不存在_事件验证。您只需从
表单数据
中删除
\u事件验证

formData = (
    ('__VIEWSTATE', viewstate[0]['value']),
    ('__VIEWSTATEENCRYPTED',''),
    ('TextBox1', '106110006'),
    ('Button1', 'Show'),
)

谢谢。但是我在Python.indexer中发现了这个错误。错误:列表索引超出范围。这是因为在这个链接中不存在_EVENTVALIDATION:再次感谢。我如何从这里开始,这里有一个使用Javascript的选项选择器框。这是另一个问题。。。您必须使用新的
\uuu VIEWSTATE
和一些其他值制作另一篇文章<代码>\u事件目标:Dt1,
文本框1:106110006
Dt1:70
。好的。我仍然无法完成。代码如下:
formData = (
    ('__VIEWSTATE', viewstate[0]['value']),
    ('__VIEWSTATEENCRYPTED',''),
    ('TextBox1', '106110006'),
    ('Button1', 'Show'),
)