Perl LWP登录网站的问题

Perl LWP登录网站的问题,perl,www-mechanize,lwp,Perl,Www Mechanize,Lwp,我是LWP的新手,谢谢你的帮助。我正在编写一个小的perl脚本来登录到一个网站并下载一个文件。这个过程在浏览器上运行得非常好,但不是通过LWP。有了浏览器,这个过程就完成了 通过身份验证(用户名、密码)登录网站 成功登录后,wesbite将加载另一个页面 然后可以访问下载页面并下载文件 如果一个用户未登录并试图访问下载页面,则该网站 加载注册页面以创建登录 这个过程在浏览器上运行得非常好。URL和用户/通行证都是真实的,因此您可以在网站上使用代码中的详细信息进行尝试 然而,通过脚本,我获得了一个

我是LWP的新手,谢谢你的帮助。我正在编写一个小的perl脚本来登录到一个网站并下载一个文件。这个过程在浏览器上运行得非常好,但不是通过LWP。有了浏览器,这个过程就完成了

  • 通过身份验证(用户名、密码)登录网站
  • 成功登录后,wesbite将加载另一个页面
  • 然后可以访问下载页面并下载文件
  • 如果一个用户未登录并试图访问下载页面,则该网站 加载注册页面以创建登录
  • 这个过程在浏览器上运行得非常好。URL和用户/通行证都是真实的,因此您可以在网站上使用代码中的详细信息进行尝试

    然而,通过脚本,我获得了一个成功代码,但网站不允许访问步骤2或步骤3。下载注册页面而不是下载文件。我怀疑这意味着登录无法使用脚本

    我们将非常感谢所有帮助完成这项工作的人

    代码如下

    #!/usr/bin/perl -w
    use strict;
    use warnings;
    
    use LWP::Simple;
    use LWP::UserAgent;
    use HTTP::Cookies;
    use HTTP::Request;
    use WWW::Mechanize;
    
    my $base_url = "http://www.eoddata.com/default.aspx";
    my $username = 'xcytt';
    my $password = '321pass';
    
    # create a cookie jar on disk
    my $cookies = HTTP::Cookies->new(
        file     => 'cookies1.txt',
        autosave => 1,
    );
    
    my $http = LWP::UserAgent->new();
    $http->cookie_jar($cookies);
    
    my $login = $http->post(
        'http://www.eoddata.com/default.aspx',
        Content => [
            username => $username,
            password => $password,
        ]
    );
    
    # check if log in succeeded
    
    if ( $login->is_success ) {
        print "The response from server is " . $login->status_line . "\n\n";
        print "The headers in the response are \n" . $login->headers()->as_string() . "\n\n";
        print "Logged in Successfully\n\n";
        print "Printing cookies after successful login\n\n";
        print $http->cookie_jar->as_string() . "\n";
        my $url = "http://www.eoddata.com/Data/symbollist.aspx?e=NYSE";
        print "Now trying to download " . $url . "\n\n";
    
        # make request to download the file
        my $file_req = HTTP::Request->new( 'GET', $url );
        print "Printing cookies before file download request\n\n";
        print $http->cookie_jar->as_string() . "\n";
        my $get_file = $http->request($file_req);
    
        # check request status
        if ( $get_file->is_success ) {
            print "The response from server is " . $get_file->status_line . "\n\n";
            print "The headers in the response are " . $get_file->headers()->as_string() . "\n\n";
            print "Downloaded $url, saving it to file ...\n\n";
            open my $fh, '>', 'tmp_NYSE.txt' or die "ERROR: $!n";
            print $fh $get_file->decoded_content;
            close $fh;
        } else {
            print "File Download failure\n";
        }
    } else {
        print "Login Error\n";
    }
    
    脚本的输出:

    The response from server is 200 OK
    
    The headers in the response are 
    Cache-Control: private
    Date: Sun, 12 Oct 2014 17:43:47 GMT
    Server: Microsoft-IIS/7.5
    Content-Length: 39356
    Content-Type: text/html; charset=utf-8
    Client-Date: Sun, 12 Oct 2014 17:43:48 GMT
    Client-Peer: 64.182.238.14:80
    Client-Response-Num: 1
    Link: <styles/jquery-ui-1.10.0.custom.min.css>; rel="stylesheet"; type="text/css"
    Link: <styles/main.css>; rel="stylesheet"; type="text/css"
    Link: <styles/button.css>; rel="stylesheet"; type="text/css"
    Link: <styles/nav.css>; rel="stylesheet"; type="text/css"
    Link: </styles/colorbox.css>; rel="stylesheet"; type="text/css"
    Link: </styles/slides.css>; rel="stylesheet"; type="text/css"
    Set-Cookie: ASP.NET_SessionId=cjgm4oscl1xmlzwnzql4gcns; path=/; HttpOnly
    Title: End of Day Stock Quote Data and Historical Stock Prices
    X-AspNet-Version: 4.0.30319
    X-Meta-Description: Free end of day stock market data and historical quotes for many of the world's top exchanges including NASDAQ, NYSE, AMEX, TSX, OTCBB, FTSE, SGX, HKEX, and FOREX.
    X-Meta-Keywords: metastock eod,free eod,free eod data,eod download,stock,exchange,data,historical stock quotes,free,historical share prices,download,day,end,prices,market,chart,NYSE,NASDAQ,AMEX,FTSE,FOREX,ASX,SGX,NZSE,tsx stock,stock share prices,stock ticker symbol,daily prices,daily stock,historic stock price,stock futures
    X-Meta-Verify-V1: cT9ZK5uSlR3GrcasqgUh7Yh3fnuRGsRY1IRvE85ffa0=
    X-Powered-By: ASP.NET
    
    
    Logged in Successfully
    
    Printing cookies after successful login
    
    Set-Cookie3: ASP.NET_SessionId=cjgm4oscl1xmlzwnzql4gcns; path="/"; domain=www.eoddata.com; path_spec; discard; HttpOnly; version=0
    
    Now trying to download http://www.eoddata.com/Data/symbollist.aspx?e=NYSE
    
    Printing cookies before file download request
    
    Set-Cookie3: ASP.NET_SessionId=cjgm4oscl1xmlzwnzql4gcns; path="/"; domain=www.eoddata.com; path_spec; discard; HttpOnly; version=0
    
    The response from server is 200 OK
    
    The headers in the response are Cache-Control: private
    Date: Sun, 12 Oct 2014 17:43:48 GMT
    Server: Microsoft-IIS/7.5
    Content-Length: 49880
    Content-Type: text/html; charset=utf-8
    Client-Date: Sun, 12 Oct 2014 17:43:49 GMT
    Client-Peer: 64.182.238.14:80
    Client-Response-Num: 1
    Link: <styles/jquery-ui-1.10.0.custom.min.css>; rel="stylesheet"; type="text/css"
    Link: <styles/main.css>; rel="stylesheet"; type="text/css"
    Link: <styles/button.css>; rel="stylesheet"; type="text/css"
    Link: <styles/nav.css>; rel="stylesheet"; type="text/css"
    Title: Member Registration
    X-AspNet-Version: 4.0.30319
    X-Meta-Description: Register now for Free end of day stock market data and historical quotes for many of the world's top exchanges including NASDAQ, NYSE, AMEX, TSX, OTCBB, FTSE, ASX, SGX, HKEX, and FOREX.
    X-Meta-Keywords: metastock eod,free eod,free eod data,eod download,stock,exchange,data,historical stock quotes,free,download,day,end,prices,market,chart,NYSE,NASDAQ,AMEX,FTSE,FOREX,ASX,SGX,NZSE,tsx stock,stock share prices,stock ticker symbol,daily prices,daily stock,historic stock price
    X-Powered-By: ASP.NET
    
    
    Downloaded http://www.eoddata.com/Data/symbollist.aspx?e=NYSE, saving it to file ...
    
    下面是下载的文件片段,它不是我想要的输出。请注意,标题是“成员注册”,而不是我期望的数据文件

    <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
    <html xmlns="http://www.w3.org/1999/xhtml">
    <head><link rel="stylesheet" href="styles/jquery-ui-1.10.0.custom.min.css" type="text/css" /><link rel="stylesheet" href="styles/main.css" type="text/css" /><link rel="stylesheet" href="styles/button.css" type="text/css" /><link rel="stylesheet" href="styles/nav.css" type="text/css" />
    <script src="../scripts/jquery-1.9.0.min.js" type="text/javascript"></script>
    <script src="../scripts/jquery-ui-1.10.0.custom.min.js" type="text/javascript"></script>
    <script type="text/javascript">     var _sf_startpt = (new Date()).getTime()</script>
    <meta name="keywords" content="metastock eod,free eod,free eod data,eod download,stock,exchange,data,historical stock quotes,free,download,day,end,prices,market,chart,NYSE,NASDAQ,AMEX,FTSE,FOREX,ASX,SGX,NZSE,tsx stock,stock share prices,stock ticker symbol,daily prices,daily stock,historic stock price" />
    <meta name="description" content="Register now for Free end of day stock market data and historical quotes for many of the world's top exchanges including NASDAQ, NYSE, AMEX, TSX, OTCBB, FTSE, ASX, SGX, HKEX, and FOREX." />
    <title>
    Member Registration
    </title></head>
    
    
    var\u sf\u startpt=(新日期()).getTime()
    会员登记
    
    大多数
    use
    语句都是不必要的,因为LWP通常会引入它需要的任何模块

    如果您使用的是
    LWP::UserAgent
    ,那么您当然不需要
    LWP::Simple
    WWW::Mechanize
    ,默认情况下,LWP将创建内存中的
    HTTP::Cookies
    对象

    问题很可能是您从web站点获取的HTML包含JavaScript代码,在检索后对其进行修改。LWP不会为您模拟这种情况,因此页面仍保持从网站发送的状态


    没有很好的解决方案,但允许您使用Perl代码驱动已安装的Firefox浏览器,并执行您需要的操作。

    您的登录代码没有让您登录——您发布的数据与登录表单所接受的输入不相似

    使用
    WWW::Mechanize
    mech dump
    检查
    http://www.eoddata.com/default.aspx
    显示以下内容:

    POST http://www.eoddata.com/default.aspx [aspnetForm]
      ctl00_tsm_HiddenField=         (hidden readonly)
      __VIEWSTATE=/wEPDwUJNTgzMTIzMjMyD2QWAmYPZBYCAgMPZBYCAgcPZBYCAh0PZBYEAgMPZBYCAgcPDxYCHgRUZXh0ZWRkAgcPDxYCHgdWaXNpYmxlaGRkGAEFHl9fQ29udHJvbHNSZXF1aXJlUG9zdEJhY2tLZXlfXxYBBRpjdGwwMCRjcGgxJGxnMSRjaGtSZW1lbWJlcuq72b0jSSSEoSOAcZlLZzWMmsYqjOMTbPl/Op1ToVKf (hidden readonly)
      __VIEWSTATEGENERATOR=CA0B0334  (hidden readonly)
      __PREVIOUSPAGE=72Ep8BrmYqNbOSb65afxljULshovHpRLBJcMC0funBrM2g0qkkpORQb_wqNsu_2SbA5JbxbwNkpXlR_SZWwgPwwbGdBP4YGDoNJCDtPRQS81 (hidden readonly)
      __EVENTVALIDATION=/wEdAAvsaJw1zF2h8PWbp8tJHjaFx+CzKn9gssNaJswg1PWksJd223BvmKj73tdq9M98Zo0JWPh42opnSCw9zAHys7YwDyn98qMl4Da8RNKOYtjmMtj1Nek/A8Dky1WNDflwB7GO1vgbcIR7aON1c4Cm5wJw0r2yvex8d7TohORX6QMo1j8IRvmRE3IYRPV0S4fj4csX1838LMsOJxqMoksh8zNIRuOmXf1pY8AyXSwvWgp1mYRx4mHFI6oep3qpPKhhA22Mc6tB5KOFIqkGgyvucIby (hidden readonly)
      ctl00$Menu1$s1$txtSearch=      (text)
      ctl00$Menu1$s1$btnSearch=Search (submit)
      ctl00$cph1$btns1=CLICK HERE    (submit)
      ctl00$cph1$btns2=CLICK HERE    (submit)
      ctl00$cph1$btns3=CLICK HERE    (submit)
      ctl00$cph1$lg1$txtEmail=       (text)
      ctl00$cph1$lg1$txtPassword=    (password)
      ctl00$cph1$lg1$chkRemember=<UNDEF> (checkbox) [*<UNDEF>/off|on]
      ctl00$cph1$lg1$btnLogin=Login  (submit)
    
    转储
    $get_file
    中的内容时,我得到了预期的符号和公司名称列表

    您可以使用WWW::Mechanize填写表单字段,也可以从
    http://www.eoddata.com/default.aspx
    (尤其是隐藏字段,在每次页面加载时都会更改),然后使用这些值和登录凭据创建POST请求


    还请注意,完全有可能从服务器获得成功响应,而无需执行您打算执行的操作(例如登录)。重定向和带有“登录失败”的页面都将被LWP::UA视为成功。

    如果有人仍然对这个问题感兴趣,我又看了一眼,发现只使用
    LWP
    是非常可行的。但是,的功能使使用HTML表单变得简单得多

    下面是一个使用提供的凭据登录页面的程序。作为一个ASP页面,它有可怕的输入名称。例如,用户名和密码字段以及登录按钮的名称分别为
    ctl00$cph1$lg1$txtmail
    ctl00$cph1$lg1$txtPassword
    ,以及
    ctl00$cph1$lg1$btnLogin
    。我使用这些方法直接使用正则表达式定位这些输入字段,我认为这使代码更加清晰

    我已经显示了登录后到达的HTML页面的标题,以证明它正在工作

    use strict;
    use warnings;
    
    use WWW::Mechanize;
    
    my $base_url = 'http://www.eoddata.com/default.aspx';
    my $username = 'xcytt';
    my $password = '321pass';
    
    my $mech = WWW::Mechanize->new;
    
    $mech->get($base_url);
    
    my $form = $mech->form_id('aspnetForm');
    
    my @inputs  = $form->inputs;
    my ($email) = grep $_->name =~ /Email/,    @inputs;
    my ($pass)  = grep $_->name =~ /Password/, @inputs;
    my ($login) = grep $_->name =~ /Login/,    @inputs;
    
    $email->value($username);
    $pass->value($password);
    $mech->click_button(value => 'Login');
    
    print $mech->title, "\n";
    
    输出
    EODData-我的下载
    
    这不是Javascript问题,因为可以在禁用Javascript的情况下登录网站。您的初始登录代码不工作,这是问题的一部分——如果您在浏览器中执行登录并检查传输的数据(例如,使用Chrome的inspector面板),您将看到,提交表单时发送的数据中有许多输入没有出现在您的登录请求中,并且成功登录后浏览器中设置的cookie与LWP::UA的cookie jar中设置的cookie非常不同。
    my $resp = $http->post(
        'http://www.eoddata.com/default.aspx',
        Content => 'ctl00_tsm_HiddenField=&__EVENTTARGET=&__EVENTARGUMENT=&__VIEWSTATE=%2FwEPDwUJNTgzMTIzMjMyD2QWAmYPZBYCAgMPZBYCAgcPZBYCAh0PZBYEAgMPZBYCAgcPDxYCHgRUZXh0ZWRkAgcPDxYCHgdWaXNpYmxlaGRkGAEFHl9fQ29udHJvbHNSZXF1aXJlUG9zdEJhY2tLZXlfXxYBBRpjdGwwMCRjcGgxJGxnMSRjaGtSZW1lbWJlcuq72b0jSSSEoSOAcZlLZzWMmsYqjOMTbPl%2FOp1ToVKf&__VIEWSTATEGENERATOR=CA0B0334&__PREVIOUSPAGE=72Ep8BrmYqNbOSb65afxljULshovHpRLBJcMC0funBrM2g0qkkpORQb_wqNsu_2SbA5JbxbwNkpXlR_SZWwgPwwbGdBP4YGDoNJCDtPRQS81&__EVENTVALIDATION=%2FwEdAAvsaJw1zF2h8PWbp8tJHjaFx%2BCzKn9gssNaJswg1PWksJd223BvmKj73tdq9M98Zo0JWPh42opnSCw9zAHys7YwDyn98qMl4Da8RNKOYtjmMtj1Nek%2FA8Dky1WNDflwB7GO1vgbcIR7aON1c4Cm5wJw0r2yvex8d7TohORX6QMo1j8IRvmRE3IYRPV0S4fj4csX1838LMsOJxqMoksh8zNIRuOmXf1pY8AyXSwvWgp1mYRx4mHFI6oep3qpPKhhA22Mc6tB5KOFIqkGgyvucIby&ctl00%24Menu1%24s1%24txtSearch=&ctl00%24cph1%24lg1%24txtEmail=xcytt&ctl00%24cph1%24lg1%24txtPassword=321pass&ctl00%24cph1%24lg1%24btnLogin=Login' );
    
    if ($resp->is_success) {    
        my $get_file = $http->get("http://www.eoddata.com/Data/symbollist.aspx?e=NYSE");
    }
    
    use strict;
    use warnings;
    
    use WWW::Mechanize;
    
    my $base_url = 'http://www.eoddata.com/default.aspx';
    my $username = 'xcytt';
    my $password = '321pass';
    
    my $mech = WWW::Mechanize->new;
    
    $mech->get($base_url);
    
    my $form = $mech->form_id('aspnetForm');
    
    my @inputs  = $form->inputs;
    my ($email) = grep $_->name =~ /Email/,    @inputs;
    my ($pass)  = grep $_->name =~ /Password/, @inputs;
    my ($login) = grep $_->name =~ /Login/,    @inputs;
    
    $email->value($username);
    $pass->value($password);
    $mech->click_button(value => 'Login');
    
    print $mech->title, "\n";