Python 使用Scrapy登录并抓取网页
希望有人能帮我。我正在使用Scrapy登录和刮取数据。此特定代码适用于一个网站。但我有另一个网站,Scrapy无法登录,因为以下问题: 页面上没有表单元素 用户名和密码字段以及submit按钮不在form元素中,而是在table元素中,这使它有点混乱。如何使用table/tr元素而不是form元素将Scrapy登录到网页 任何帮助都将不胜感激Python 使用Scrapy登录并抓取网页,python,html,web-scraping,scrapy,scrapy-spider,Python,Html,Web Scraping,Scrapy,Scrapy Spider,希望有人能帮我。我正在使用Scrapy登录和刮取数据。此特定代码适用于一个网站。但我有另一个网站,Scrapy无法登录,因为以下问题: 页面上没有表单元素 用户名和密码字段以及submit按钮不在form元素中,而是在table元素中,这使它有点混乱。如何使用table/tr元素而不是form元素将Scrapy登录到网页 任何帮助都将不胜感激 class LoginSpider(BaseSpider): name = 'project' allowed_domains = ["dom
class LoginSpider(BaseSpider):
name = 'project'
allowed_domains = ["domain.com"]
start_urls = ["theloginURL"]
#this function will look for the form element and login with the username and password
def parse(self, response):
return [FormRequest.from_response(response,
formdata={'user_name': ' username123', 'Password': ' psd123'},
formxpath='//*[@name="Form1"]',callback=self.after_login)]
这是登录页面上的HTML代码,以备大家需要:
<table height="260px" id="loginMainTable" width="100%" cellspacing="0" cellpadding="0">
<tbody><tr>
<td>
<table align="center" class="blueBorder" cellspacing="0" cellpadding="0">
<tbody><tr class="HeaderFooterHide">
<td id="companyTD" width="100%" colspan="3" style="position:relative;">
<span class="header1 header1pos">
Application and Network Access Portal
</span>
<table width="100%" cellpadding="0" cellspacing="0">
<tbody><tr>
<td width="32px">
<img src="/uniquesig6d7a33b352f4c09846f8a6563bae192b/uniquesig0/InternalSite/images/headertopl.gif" align="absmiddle">
</td>
<td style="background-image: url('/uniquesig6d7a33b352f4c09846f8a6563bae192b/uniquesig0/InternalSite/images/headertopm.gif'); background-repeat: repeat-x">
</td>
<td width="520px" style="background-image: url('/uniquesig6d7a33b352f4c09846f8a6563bae192b/uniquesig0/InternalSite/images/headertopr.gif');">
</td>
</tr>
</tbody></table>
</td>
</tr>
<tr class="HeaderFooterHide">
<td width="100%" colspan="3" style="position:relative;">
<span style="position:absolute;margin-left:20px;">
</span>
<table width="100%" cellpadding="0" cellspacing="0">
<tbody><tr>
<td width="30px">
<img src="/uniquesig6d7a33b352f4c09846f8a6563bae192b/uniquesig0/InternalSite/images/headerbottoml.gif" align="absmiddle">
</td>
<td style="background-image: url('/uniquesig6d7a33b352f4c09846f8a6563bae192b/uniquesig0/InternalSite/images/headerbottomm.gif'); background-repeat: repeat-x">
</td>
<td width="30px" style="background-image: url('/uniquesig6d7a33b352f4c09846f8a6563bae192b/uniquesig0/InternalSite/images/headerbottomr.gif');">
</td>
</tr>
</tbody></table>
</td>
</tr>
<tr>
<td class="contentleft">
</td>
<td align="center" class="internalTD">
<table width="100%" height="100%" cellspacing="0" cellpadding="0" align="center">
<form id="form1" name="form1" autocomplete="off" method="post" action="/uniquesig6d7a33b352f4c09846f8a6563bae192b/uniquesig0/InternalSite/Validate.asp" onsubmit="return(SubmitForm());"></form>
<tbody><tr>
<td valign="top" height="250px" align="center">
<table border="0" cellspacing="0" cellpadding="0" class="content">
<tbody><tr>
<td class="msgText">Log On</td>
</tr>
</tbody></table>
<table border="0" cellspacing="0" cellpadding="0" class="content">
<tbody><tr>
<td class="paramText">Username</td>
<td><input class="paramTextbox" type="text" id="user_name" name="user_name" maxlength="50" size="11"></td>
</tr>
<tr>
<td class="paramText">Password</td>
<td><input class="paramTextbox" type="password" id="password" name="password" maxlength="20" onkeypress="capsDetect(arguments[0]);" size="11"></td>
</tr>
<input type="hidden" id="repository" name="repository" value="ADLDS" size="11">
<tr height="0px">
<td colspan="2" id="capsLockTD" height="0px"> </td>
</tr>
<tr>
<td> </td>
<td class="EzBiz_Text1"><input name="chkUsername" type="CHECKBOX" onclick="saveUserName()" id="chkUsername">Remember my User ID</td>
</tr>
<tr>
<td colspan="2" align="right">
<input border="0" class="button" type="submit" id="submit_button" value="Log On">
</td>
<td></td>
</tr>
</tbody></table>
<div class="EzBiz_Loginbutton" style="float: left;">
<input border="0" class="EzBiz_Button" type="image" id="submit_button" src="/uniquesig6d7a33b352f4c09846f8a6563bae192b/uniquesig0/InternalSite/images/CustomUpdate/login_submit.jpg">
</div>
<div class="EzBiz_PasswordForget">
<img src="/uniquesig6d7a33b352f4c09846f8a6563bae192b/uniquesig0/InternalSite/images/CustomUpdate/arrow.gif">
<a href="hhttps:www.url.vom" target="_blank">Did you forget your password?</a>
</div>
</td>
</tr>
<input type="hidden" name="site_name" id="site_name" value="ezbizportal">
<input type="hidden" name="secure" id="secure" value="1">
<input type="hidden" name="resource_id" id="resource_id" value="49203789E2E14C4A92CEC904C24909CE">
<input type="hidden" name="login_type" id="login_type" value="2">
<tr>
<td>
<table cellspacing="0" cellpadding="0" class="content" width="100%">
<tbody><tr>
<td id="openerExistsTD" class="notification">
For security purposes, when you finish working with this site do one of the following:<li>Click the Logoff button to log off from the site.</li><li>Close all browser windows (including applications that are open in other windows).</li>
</td>
</tr>
<!-- Windows XP Service Pack 2 / 2003 / Vista Message - Start -->
<tr>
<td class="notification">
This site is intended for authorized users only.<br>
If you experience access problems contact the <a href="mailto:">site administrator</a>.
</td>
</tr>
</tbody></table>
</td>
</tr>
</tbody></table>
</td>
<td class="contentright">
</td>
</tr>
<tr class="HeaderFooterHide">
<td width="100%" colspan="3" style="position:relative;">
<span class="bottomText bottomTextPos">
© 2010 Microsoft Corporation. All rights reserved. <a href="javascript:alert('Microsoft Corporation licenses the software and services on this portal to you according to your Microsoft Unified Access Gateway 2010 (the "software") license. You may not use this portal without a license for the software. Contact your IT administrator for the license terms.')">Terms and Conditions.</a>
</span>
<table width="100%" cellpadding="0" cellspacing="0">
<tbody><tr>
<td width="47px">
<img src="/uniquesig6d7a33b352f4c09846f8a6563bae192b/uniquesig0/InternalSite/images/footerbgl.gif" align="absmiddle">
</td>
<td style="background-image: url('/uniquesig6d7a33b352f4c09846f8a6563bae192b/uniquesig0/InternalSite/images/footerbgm.gif'); background-repeat: repeat-x">
</td>
<td width="47px" style="background-image: url('/uniquesig6d7a33b352f4c09846f8a6563bae192b/uniquesig0/InternalSite/images/footerbgr.gif');">
</td>
</tr>
</tbody></table>
</td>
</tr>
</tbody></table>
</td>
</tr>
</tbody></table>
应用程序和网络访问门户
登录
用户名
密码
记住我的用户ID吗
出于安全考虑,当您完成此网站的工作时,请执行以下操作之一:单击注销按钮以从该网站注销。 关闭所有浏览器窗口(包括在其他窗口中打开的应用程序)
本网站仅供授权用户使用。
如果遇到访问问题,请联系。
©2010微软公司。版权所有。
通用登录模块更有用。尝试:
安装
代码
关于
loginform是一个库,用于填写给定登录名的HTML登录表单
url、用户名和密码。将推断要填充的表单和字段
自动地
您文章中的代码格式似乎有一些问题,而且您似乎在两次定义该类。你能纠正代码问题吗(可能是通过创建?嗨,谢谢你的回复。我不小心粘贴了两次该类,我继续粘贴了所有代码,不包括网站。我还为登录页面粘贴了部分HTML代码。注意,输入元素如何不在表单标记内,上面的代码正在查找表单。我如何修复代码以便能够登录到这个HTML页面?您好,仍在寻求帮助!我会尝试修改您的问题,使其只包含重要信息。现在这里有很多代码,但没有太多解释您试图做什么,我怀疑大部分代码是不需要的。谢谢您的输入。我修改了我的问题。但是我仍然保留了HTML代码以备不时之需。我想您真正需要关注的代码是python代码,我只包括了重要的函数。
pip install -i https://pypi.binstar.org/pypi/simple loginform
from loginform import fill_login_form
import requests
url = "https://example.com/login"
r = requests.get(url)
result=fill_login_form(url, text, 'username', 'password')