Python 使用Scrapy登录并抓取网页

Python 使用Scrapy登录并抓取网页,python,html,web-scraping,scrapy,scrapy-spider,Python,Html,Web Scraping,Scrapy,Scrapy Spider,希望有人能帮我。我正在使用Scrapy登录和刮取数据。此特定代码适用于一个网站。但我有另一个网站,Scrapy无法登录,因为以下问题: 页面上没有表单元素 用户名和密码字段以及submit按钮不在form元素中,而是在table元素中,这使它有点混乱。如何使用table/tr元素而不是form元素将Scrapy登录到网页 任何帮助都将不胜感激 class LoginSpider(BaseSpider): name = 'project' allowed_domains = ["dom

希望有人能帮我。我正在使用Scrapy登录和刮取数据。此特定代码适用于一个网站。但我有另一个网站,Scrapy无法登录,因为以下问题: 页面上没有表单元素

用户名和密码字段以及submit按钮不在form元素中,而是在table元素中,这使它有点混乱。如何使用table/tr元素而不是form元素将Scrapy登录到网页

任何帮助都将不胜感激

class LoginSpider(BaseSpider):

  name = 'project'

  allowed_domains = ["domain.com"]

  start_urls = ["theloginURL"]

  #this function will look for the form element and login with the username and password

  def parse(self, response):

  return [FormRequest.from_response(response,

  formdata={'user_name': ' username123', 'Password': ' psd123'},

  formxpath='//*[@name="Form1"]',callback=self.after_login)]
这是登录页面上的HTML代码,以备大家需要:

<table height="260px" id="loginMainTable" width="100%" cellspacing="0" cellpadding="0">
<tbody><tr>
    <td>
        <table align="center" class="blueBorder" cellspacing="0" cellpadding="0">   

<tbody><tr class="HeaderFooterHide">
<td id="companyTD" width="100%" colspan="3" style="position:relative;">
<span class="header1 header1pos">       
    Application and Network Access Portal       
</span>
    <table width="100%" cellpadding="0" cellspacing="0">
        <tbody><tr>
            <td width="32px">                           
                <img src="/uniquesig6d7a33b352f4c09846f8a6563bae192b/uniquesig0/InternalSite/images/headertopl.gif" align="absmiddle">
            </td>
            <td style="background-image: url('/uniquesig6d7a33b352f4c09846f8a6563bae192b/uniquesig0/InternalSite/images/headertopm.gif'); background-repeat: repeat-x">             
                &nbsp;
            </td>
            <td width="520px" style="background-image: url('/uniquesig6d7a33b352f4c09846f8a6563bae192b/uniquesig0/InternalSite/images/headertopr.gif');">                               
            </td>
        </tr>
     </tbody></table>
    </td>
</tr>
<tr class="HeaderFooterHide">
<td width="100%" colspan="3" style="position:relative;">    
<span style="position:absolute;margin-left:20px;">      

</span>
    <table width="100%" cellpadding="0" cellspacing="0">
        <tbody><tr>
            <td width="30px">                           
                <img src="/uniquesig6d7a33b352f4c09846f8a6563bae192b/uniquesig0/InternalSite/images/headerbottoml.gif" align="absmiddle">
            </td>
            <td style="background-image: url('/uniquesig6d7a33b352f4c09846f8a6563bae192b/uniquesig0/InternalSite/images/headerbottomm.gif'); background-repeat: repeat-x">              
            &nbsp;
            </td>
            <td width="30px" style="background-image: url('/uniquesig6d7a33b352f4c09846f8a6563bae192b/uniquesig0/InternalSite/images/headerbottomr.gif');">                             
            </td>
        </tr>
    </tbody></table>    
    </td>
 </tr>

            <tr>
                <td class="contentleft">                                
                </td>
                <td align="center" class="internalTD">
                    <table width="100%" height="100%" cellspacing="0" cellpadding="0" align="center">
                        <form id="form1" name="form1" autocomplete="off" method="post" action="/uniquesig6d7a33b352f4c09846f8a6563bae192b/uniquesig0/InternalSite/Validate.asp" onsubmit="return(SubmitForm());"></form>
                        <tbody><tr>
                            <td valign="top" height="250px" align="center">
                                <table border="0" cellspacing="0" cellpadding="0" class="content">

                                    <tbody><tr>
                                        <td class="msgText">Log On</td>
                                    </tr>
                                </tbody></table>
                                <table border="0" cellspacing="0" cellpadding="0" class="content">

                                        <tbody><tr>
                                            <td class="paramText">Username</td>
                                            <td><input class="paramTextbox" type="text" id="user_name" name="user_name" maxlength="50" size="11"></td>
                                        </tr>
                                        <tr>
                                            <td class="paramText">Password</td>
                                            <td><input class="paramTextbox" type="password" id="password" name="password" maxlength="20" onkeypress="capsDetect(arguments[0]);" size="11"></td>                                             
                                        </tr>

                                                <input type="hidden" id="repository" name="repository" value="ADLDS" size="11">

                                    <tr height="0px">
                                        <td colspan="2" id="capsLockTD" height="0px">&nbsp;</td>
                                    </tr>

                                    <tr>
                                        <td>&nbsp;</td>
                                        <td class="EzBiz_Text1"><input name="chkUsername" type="CHECKBOX" onclick="saveUserName()" id="chkUsername">Remember my User ID</td>
                                    </tr>   
                                    <tr>
                                        <td colspan="2" align="right">
                                            <input border="0" class="button" type="submit" id="submit_button" value="Log On">
                                        </td>
                                        <td></td>
                                    </tr>
                                </tbody></table>
                                <div class="EzBiz_Loginbutton" style="float: left;">
                                    <input border="0" class="EzBiz_Button" type="image" id="submit_button" src="/uniquesig6d7a33b352f4c09846f8a6563bae192b/uniquesig0/InternalSite/images/CustomUpdate/login_submit.jpg">
                                </div>
                                <div class="EzBiz_PasswordForget">
                                    <img src="/uniquesig6d7a33b352f4c09846f8a6563bae192b/uniquesig0/InternalSite/images/CustomUpdate/arrow.gif">
                                    <a href="hhttps:www.url.vom" target="_blank">Did you forget your password?</a>
                                </div>
                            </td>
                        </tr>
                        <input type="hidden" name="site_name" id="site_name" value="ezbizportal">
                        <input type="hidden" name="secure" id="secure" value="1">
                        <input type="hidden" name="resource_id" id="resource_id" value="49203789E2E14C4A92CEC904C24909CE">
                        <input type="hidden" name="login_type" id="login_type" value="2">

                        <tr>
                            <td>
                                <table cellspacing="0" cellpadding="0" class="content" width="100%">
                                    <tbody><tr>
                                        <td id="openerExistsTD" class="notification">
                                        For security purposes, when you finish working with this site do one of the following:<li>Click the Logoff button to log off from the site.</li><li>Close all browser windows (including applications that are open in other windows).</li>
                                        </td>
                                    </tr>       
                                    <!-- Windows XP Service Pack 2  / 2003 / Vista Message - Start -->

                                    <tr>
                                        <td class="notification">
                                        This site is intended for authorized users only.<br>
                                        If you experience access problems contact the <a href="mailto:">site administrator</a>.
                                        </td>
                                    </tr>
                                </tbody></table>
                            </td>    
                        </tr>


                    </tbody></table>
                </td>
                <td class="contentright">                       
                </td>
            </tr>

<tr class="HeaderFooterHide">
<td width="100%" colspan="3" style="position:relative;">
<span class="bottomText bottomTextPos">     
    © 2010 Microsoft Corporation. All rights reserved. <a href="javascript:alert('Microsoft Corporation licenses the software and services on this portal to you according to your Microsoft Unified Access Gateway 2010 (the &quot;software&quot;) license. You may not use this portal without a license for the software. Contact your IT administrator for the license terms.')">Terms and Conditions.</a>
    </span>
    <table width="100%" cellpadding="0" cellspacing="0">
        <tbody><tr>
            <td width="47px">                           
                <img src="/uniquesig6d7a33b352f4c09846f8a6563bae192b/uniquesig0/InternalSite/images/footerbgl.gif" align="absmiddle">
            </td>
            <td style="background-image: url('/uniquesig6d7a33b352f4c09846f8a6563bae192b/uniquesig0/InternalSite/images/footerbgm.gif'); background-repeat: repeat-x">              
                &nbsp;
            </td>
            <td width="47px" style="background-image: url('/uniquesig6d7a33b352f4c09846f8a6563bae192b/uniquesig0/InternalSite/images/footerbgr.gif');">                             
            </td>
        </tr>
    </tbody></table>            
</td>
</tr>

        </tbody></table>
    </td>
</tr>   
</tbody></table>

应用程序和网络访问门户
登录
用户名
密码
记住我的用户ID吗
出于安全考虑,当您完成此网站的工作时,请执行以下操作之一:
  • 单击注销按钮以从该网站注销。
  • 关闭所有浏览器窗口(包括在其他窗口中打开的应用程序)
  • 本网站仅供授权用户使用。
    如果遇到访问问题,请联系。 ©2010微软公司。版权所有。
    通用登录模块更有用。尝试:

    安装 代码 关于 loginform是一个库,用于填写给定登录名的HTML登录表单 url、用户名和密码。将推断要填充的表单和字段 自动地


    您文章中的代码格式似乎有一些问题,而且您似乎在两次定义该类。你能纠正代码问题吗(可能是通过创建?嗨,谢谢你的回复。我不小心粘贴了两次该类,我继续粘贴了所有代码,不包括网站。我还为登录页面粘贴了部分HTML代码。注意,输入元素如何不在表单标记内,上面的代码正在查找表单。我如何修复代码以便能够登录到这个HTML页面?您好,仍在寻求帮助!我会尝试修改您的问题,使其只包含重要信息。现在这里有很多代码,但没有太多解释您试图做什么,我怀疑大部分代码是不需要的。谢谢您的输入。我修改了我的问题。但是我仍然保留了HTML代码以备不时之需。我想您真正需要关注的代码是python代码,我只包括了重要的函数。
    pip install -i https://pypi.binstar.org/pypi/simple loginform
    
    from loginform import fill_login_form
    import requests
    url = "https://example.com/login"
    r = requests.get(url)
    result=fill_login_form(url, text, 'username', 'password')