Warning: file_get_contents(/data/phpspider/zhask/data//catemap/1/php/288.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Php 如何对受登录保护的页面进行爬网?_Php - Fatal编程技术网

Php 如何对受登录保护的页面进行爬网?

Php 如何对受登录保护的页面进行爬网?,php,Php,我想抓取一个网站的搜索信息,我的脚本正常工作的网页,但如何抓取登录保护的网页?我有登录信息如何用url发送此信息? 我的代码: 爬网页面(“http://www.site.com/v3/search/results?start=1&sortCol=MyDefault"); 函数爬网页面($url){ $html=文件内容($url); preg_match_all(“~~”,$html,$matches); $str=”http://www.site.com/v3/features/id?Pro

我想抓取一个网站的搜索信息,我的脚本正常工作的网页,但如何抓取登录保护的网页?我有登录信息如何用url发送此信息? 我的代码:

爬网页面(“http://www.site.com/v3/search/results?start=1&sortCol=MyDefault");
函数爬网页面($url){
$html=文件内容($url);
preg_match_all(“~~”,$html,$matches);
$str=”http://www.site.com/v3/features/id?Profile=";
foreach($newurl与[1]匹配){
if(strpos($newurl,$str)!==false){
$my_file='link.txt';
$handle=fopen($my_文件,'a')或die($my_文件:');
$numberNewline=$newurl.PHP\u EOL;
fwrite($handle,$numberNewline);
}
}
}
有什么帮助吗
谢谢。

这在很大程度上取决于所使用的身份验证方法。最简单的是HTTP基本身份验证。对于该方法,您只需构建如下上下文:

$context = stream_context_create(array(
    'http' => array(
        'header'  => "Authorization: Basic " . base64_encode("$username:$password")
    )
));
$data = file_get_contents($url, false, $context);
这样,file\u get\u内容将使用HTTP基本身份验证


其他身份验证方法可能需要更多的工作,如通过POST向登录页面发送密码和存储会话cookies。

我的答案仅适用于表单身份验证(这是最常见的身份验证形式)

基本上,当你浏览一个网站时,你会在上面打开一个“会话”。当您登录网站时,您的会话将获得“身份验证”,并基于此授予您访问任何地方的权限

您的浏览器通过存储在cookie中的会话Id识别与服务器对应的会话

因此,您必须先浏览登录页面,然后再浏览所需页面,而不要忘记在该过程中发送cookie。cookie是您浏览的所有页面之间的链接

实际上,我遇到了与您刚才遇到的问题相同的问题,并且编写了一个类来实现这一点,而不必记住这个cookie

快速看一下这门课,这并不重要,但请仔细看下面的例子。它允许您提交实施CSRF保护的表单

该类基本上具有以下特点: -符合基于CSRF令牌的保护 -发送一个“公共”用户代理。有些网站拒绝不与用户代理通信的查询 -发送推荐人标题。一些网站拒绝不与推荐人沟通的查询(这是另一种反csrf保护) -跨调用存储cookie

文件:WebClient.php

<?php
/**
 * Webclient
 *
 * Helper class to browse the web
 *
 * @author Bgi
 */

class WebClient
{
    private $ch;
    private $cookie = '';
    private $html;

    public function Navigate($url, $post = array()) 
    {
        curl_setopt($this->ch, CURLOPT_URL, $url);
        curl_setopt($this->ch, CURLOPT_COOKIE, $this->cookie);
        if (!empty($post)) {
            curl_setopt($this->ch, CURLOPT_POST, TRUE);
            curl_setopt($this->ch, CURLOPT_POSTFIELDS, $post);
        }
        $response = $this->exec();
        if ($response['Code'] !== 200) {
            return FALSE;
        }
        //echo curl_getinfo($this->ch, CURLINFO_HEADER_OUT);
        return $response['Html'];
    }

    public function getInputs() 
    {
        $return = array();

        $dom = new DOMDocument();
        @$dom->loadHtml($this->html);
        $inputs = $dom->getElementsByTagName('input');
        foreach($inputs as $input)
        {
            if ($input->hasAttributes() && $input->attributes->getNamedItem('name') !== NULL)
            {
                if ($input->attributes->getNamedItem('value') !== NULL)
                    $return[$input->attributes->getNamedItem('name')->value] = $input->attributes->getNamedItem('value')->value;
                else
                    $return[$input->attributes->getNamedItem('name')->value] = NULL;
            }
        }

        return $return;
    }

    public function __construct()
    {
        $this->init();
    }

    public function __destruct()
    {
        $this->close();
    }

    private function init() 
    {
        $this->ch = curl_init();
        curl_setopt($this->ch, CURLOPT_USERAGENT, "Mozilla/6.0 (Windows NT 6.2; WOW64; rv:16.0.1) Gecko/20121011 Firefox/16.0.1");
        curl_setopt($this->ch, CURLOPT_FOLLOWLOCATION, TRUE);
        curl_setopt($this->ch, CURLOPT_MAXREDIRS, 5);
        curl_setopt($this->ch, CURLINFO_HEADER_OUT, TRUE);
        curl_setopt($this->ch, CURLOPT_HEADER, TRUE);
        curl_setopt($this->ch, CURLOPT_AUTOREFERER, TRUE);
    }

    private function exec() 
    {
        $headers = array();
        $html = '';

        ob_start();
        curl_exec($this->ch);
        $output = ob_get_contents();
        ob_end_clean(); 

        $retcode = curl_getinfo($this->ch, CURLINFO_HTTP_CODE);

        if ($retcode == 200) {
            $separator = strpos($output, "\r\n\r\n");

            $html = substr($output, $separator);

            $h = trim(substr($output,0,$separator));
            $lines = explode("\n", $h);
            foreach($lines as $line) {
                $kv = explode(':',$line);

                if (count($kv) == 2) {
                    $k = trim($kv[0]);
                    $v = trim($kv[1]);
                    $headers[$k] = $v;
                }
            }
        }

        // TODO: it would deserve to be tested extensively.
        if (!empty($headers['Set-Cookie']))
            $this->cookie = $headers['Set-Cookie'];

        $this->html = $html;

        return array('Code' => $retcode, 'Headers' => $headers, 'Html' => $html);
    }

    private function close()
    {
        curl_close($this->ch);
    }
}

将用户名和密码组合功能放入爬虫。使用很长的随机不可用密码。非常确定,没有登录就无法抓取安全页面。那会破坏目的,自动调情?听起来你想滥用这个系统。我还有用户名和密码。听起来令人毛骨悚然,几乎肯定是违反了条款。阅读网站的ToS,查找他们的API,获取密钥和实现我使用您的代码,但其显示“无法发布凭据”。是否有任何方法使用浏览器cookie?您确定正确识别了输入的名称吗(在我的例子中是username和passwd,但在您的站点上可能不同)…您应该尝试调试标题
<?php
/**
 * Webclient
 *
 * Helper class to browse the web
 *
 * @author Bgi
 */

class WebClient
{
    private $ch;
    private $cookie = '';
    private $html;

    public function Navigate($url, $post = array()) 
    {
        curl_setopt($this->ch, CURLOPT_URL, $url);
        curl_setopt($this->ch, CURLOPT_COOKIE, $this->cookie);
        if (!empty($post)) {
            curl_setopt($this->ch, CURLOPT_POST, TRUE);
            curl_setopt($this->ch, CURLOPT_POSTFIELDS, $post);
        }
        $response = $this->exec();
        if ($response['Code'] !== 200) {
            return FALSE;
        }
        //echo curl_getinfo($this->ch, CURLINFO_HEADER_OUT);
        return $response['Html'];
    }

    public function getInputs() 
    {
        $return = array();

        $dom = new DOMDocument();
        @$dom->loadHtml($this->html);
        $inputs = $dom->getElementsByTagName('input');
        foreach($inputs as $input)
        {
            if ($input->hasAttributes() && $input->attributes->getNamedItem('name') !== NULL)
            {
                if ($input->attributes->getNamedItem('value') !== NULL)
                    $return[$input->attributes->getNamedItem('name')->value] = $input->attributes->getNamedItem('value')->value;
                else
                    $return[$input->attributes->getNamedItem('name')->value] = NULL;
            }
        }

        return $return;
    }

    public function __construct()
    {
        $this->init();
    }

    public function __destruct()
    {
        $this->close();
    }

    private function init() 
    {
        $this->ch = curl_init();
        curl_setopt($this->ch, CURLOPT_USERAGENT, "Mozilla/6.0 (Windows NT 6.2; WOW64; rv:16.0.1) Gecko/20121011 Firefox/16.0.1");
        curl_setopt($this->ch, CURLOPT_FOLLOWLOCATION, TRUE);
        curl_setopt($this->ch, CURLOPT_MAXREDIRS, 5);
        curl_setopt($this->ch, CURLINFO_HEADER_OUT, TRUE);
        curl_setopt($this->ch, CURLOPT_HEADER, TRUE);
        curl_setopt($this->ch, CURLOPT_AUTOREFERER, TRUE);
    }

    private function exec() 
    {
        $headers = array();
        $html = '';

        ob_start();
        curl_exec($this->ch);
        $output = ob_get_contents();
        ob_end_clean(); 

        $retcode = curl_getinfo($this->ch, CURLINFO_HTTP_CODE);

        if ($retcode == 200) {
            $separator = strpos($output, "\r\n\r\n");

            $html = substr($output, $separator);

            $h = trim(substr($output,0,$separator));
            $lines = explode("\n", $h);
            foreach($lines as $line) {
                $kv = explode(':',$line);

                if (count($kv) == 2) {
                    $k = trim($kv[0]);
                    $v = trim($kv[1]);
                    $headers[$k] = $v;
                }
            }
        }

        // TODO: it would deserve to be tested extensively.
        if (!empty($headers['Set-Cookie']))
            $this->cookie = $headers['Set-Cookie'];

        $this->html = $html;

        return array('Code' => $retcode, 'Headers' => $headers, 'Html' => $html);
    }

    private function close()
    {
        curl_close($this->ch);
    }
}
<?php
    require_once('WebClient.php');
    $url = 'http://example.com/administrator/index.php'; // This a Joomla admin

    $wc = new WebClient();
    $page = $wc->Navigate($url);
    if ($page === FALSE) {
         die('Failed to load login page.');
    }

    echo('Logging in...');

    $post = $wc->getInputs();
    $post['username'] = $username;
    $post['passwd'] = $passwd;

    $page = $wc->Navigate($url, $post);
    if ($page === FALSE) {
        die('Failed to post credentials.');
    }

  echo('Initializing installation...');

    $page = $wc->Navigate($url.'?option=com_installer');
    if ($page === FALSE) {
        die('Failed to access installer.');
    }

    echo('Installing...');

    $post = $wc->getInputs();
    $post['install_package'] = '@'.$file; // The @ specifies we are sending a file

    $page = $wc->Navigate($url.'?option=com_installer&view=install', $post);
    if ($page === FALSE) {
        die('Failed to upload file.');
    }

    echo('Done.');