Python 使用BeautifulSoup和Mechanize登录网页_Python_Web Scraping_Beautifulsoup_Mechanize

Python 使用BeautifulSoup和Mechanize登录网页

python web-scraping

Python 使用BeautifulSoup和Mechanize登录网页,python,web-scraping,beautifulsoup,mechanize,Python,Web Scraping,Beautifulsoup,Mechanize,我正在尝试使用BeautifulSoup和Mechanize以编程方式登录网页这是我的代码： #import urllib2 from mechanize import Browser, _http, urlopen from BeautifulSoup import BeautifulSoup import cookielib data_url = "http://data.theice.com/ViewData/EndOfDay/LdnOptions.aspx?p=AER" def a

我正在尝试使用BeautifulSoup和Mechanize以编程方式登录网页

这是我的代码：

#import urllib2
from mechanize import Browser, _http, urlopen
from BeautifulSoup import BeautifulSoup
import cookielib

data_url = "http://data.theice.com/ViewData/EndOfDay/LdnOptions.aspx?p=AER"

def are_we_logged_on(html):
    soup = BeautifulSoup(html)
    elem = soup.find("input", {"id" : "ctl00_ContentPlaceHolder1_LoginControl_m_userName" } )
    return elem is None


# Browser
br = Browser()

# Cookie Jar
cj = cookielib.LWPCookieJar()
br.set_cookiejar(cj)

# Browser options
br.set_handle_equiv(True)
#br.set_handle_gzip(True)
br.set_handle_redirect(True)
br.set_handle_referer(True)
br.set_handle_robots(False)

# Follows refresh 0 but not hangs on refresh > 0
br.set_handle_refresh(_http.HTTPRefreshProcessor(), max_time=1)

# User-Agent (this is cheating, ok?)
br.addheaders = [('User-agent', 'Mozilla/5.0 (Windows NT 6.2; WOW64; rv:32.0) Gecko/20100101 Firefox/32.0')]

# The site we will navigate into, handling it's session
response = br.open(data_url)
html = response.get_data()

# do we need to log in?
logged_on = are_we_logged_on(html)


if not logged_on :
    print "DEBUG: Attempting to log in ..."
    # Select the first (index zero) form
    br.select_form(nr=0)

    # User credentials
    br.form['ctl00$ContentPlaceHolder1$LoginControl$m_userName'] = 'username'
    br.form['ctl00$ContentPlaceHolder1$LoginControl$m_password'] = 'password'

    # Login
    post_url, post_data, headers =  br.form.click_request_data()
    print post_url
    print post_data
    print headers
    resp = urlopen(post_url, post_data)

    # Check if login succesful
    html2 = resp.read()
    logged_on = are_we_logged_on(html2)

    if not logged_on:
        with open("icedump_fail.html","w") as f:
            f.write(html2)        
        print "DEBUG: Failed to logon. Aborting script ...!"
        exit(-1)


# If we got this far, then we are logged in ...

当我运行脚本时，执行路径总是导致“登录失败”消息打印到屏幕上

谁能看出我做错了什么？。我还没有想法，也许需要一双新的眼睛。

打开“调试”模式（

br.set\u debug\u http（True）

）帮助我检查提交登录表单时发送的底层请求

mechanize

，并将其与使用浏览器登录时发送的实际请求进行比较

这表明

\uu EVENTTARGET

参数被发送为空，但不应为空

下面是帮助我解决问题的代码的固定部分：

br.select_form(nr=0)
br.form.set_all_readonly(False)

br.form['ctl00$ContentPlaceHolder1$LoginControl$m_userName'] = 'username'
br.form['ctl00$ContentPlaceHolder1$LoginControl$m_password'] = 'password'
br.form['__EVENTTARGET'] = 'ctl00$ContentPlaceHolder1$LoginControl$LoginButton'

# Login
response = br.submit()
html2 = response.read()
logged_on = are_we_logged_on(html2)

作为旁注，确保没有违反您正在“数字签名”的协议：

刮削：

严禁为了从本网站自动提取数据而刮取本网站需要注意的是，这一过程可能导致消耗ICE的系统资源。ICE（或其附属公司、代理或承包商）可监控本网站的使用情况，以便进行清理并可能采取一切必要措施，以确保访问网站从执行或合理相信的实体中删除进行网页清理活动

我会使用Selenium，因为它功能齐全，功能更强大。您实际上也可以看到结果：

from selenium import webdriver

chrome = webdriver.Chrome()
chrome.get('http://data.theice.com/ViewData/EndOfDay/LdnOptions.aspx?p=AER')

user = chrome.find_element_by_name('ctl00$ContentPlaceHolder1$LoginControl$m_userName')
pswd = chrome.find_element_by_name('ctl00$ContentPlaceHolder1$LoginControl$m_password')
form = chrome.find_element_by_name('ctl00_ContentPlaceHolder1_LoginControl_LoginButton')

user.send_keys(your_username_string)
pswd.send_keys(your_password_string)
form.click() # hit the login button

当我在浏览器中输入您的URL时，我将重定向到此页面：。尝试更改您的url？@Moshe:您需要先注册一个帐户，才能登录到该页面。在示例代码段中，我使用的是注册的用户名和密码，但尽管我已注册，但仍被重定向到登录页面。我可以手动登录，使用脚本中使用的相同uname和pwd，因此这显然不是问题所在。您使用哪个工具观察浏览器发送的请求？2.您提出的建议在一定程度上解决了这个问题（在涉及只读变量的一些修改之后），但每次我调用

br.open（url）.get_data（）

，我都会再次看到一个登录页面，这不太实用。@HomunculusReticulli 1。我使用过浏览器开发工具（确切地说是chrome开发工具中的网络选项卡）。2.嗯，我很确定答案中提供的解决方案至少有助于解决登录时的一个问题。但是，是的，从你所描述的，现在我们有一个不同的问题，让我们在会话期间登录。这对我来说是很难复制的，因为我没有在ICE注册。谢谢，可以免费注册。请您注册一下，看看是否可以复制这个问题？目前，当您登录时，您会被带到一个“登录页”（与请求的url不同），当您尝试导航到包含数据的页面时，您会被请求再次登录…请您帮助解决此问题，以便我可以接受您的回答？Thanks@HomunculusReticulli下面是从“我的文件”页面的树状视图部分输出所有文件的示例工作代码：。对我有用。很抱歉，现在发生的事情太多了。您知道如何安装Selenium吗？这应该对你有用。这实际上与权力无关。Selenium和mechanize是非常不同的工具，这意味着有不同的方法来解决这个问题。对于selenium，有一个真正的浏览器作为依赖项，即使您切换到headless，性能也会比mechanize差得多。不管怎样，问题是关于机械化的，尽管我喜欢你建议的替代方案。