Php 通过curl匿名从twitter或facebook获取页面_Php_Facebook_Curl_Twitter_Anonymous

Php 通过curl匿名从twitter或facebook获取页面

php facebook curl twitter

Php 通过curl匿名从twitter或facebook获取页面,php,facebook,curl,twitter,anonymous,Php,Facebook,Curl,Twitter,Anonymous,我正在尝试制作某种页面解析器（更具体——在页面上突出显示一些单词），但我遇到了一些问题。我使用curl从url获取整个页面数据，大多数页面都配合得很好，而其他页面则没有我的目标是获得所有页面的html，就像浏览器获得它一样，我尝试匿名使用它，就像浏览器一样。我的意思是-如果某个页面需要登录以显示我不感兴趣的浏览器数据。问题是，我无法访问Twitter或Facebook页面，而这些页面可以通过普通浏览器匿名访问，即使我设置了所有标题，就像通常从Firefox或Chrome发送一样有没有办法简单

我正在尝试制作某种页面解析器（更具体——在页面上突出显示一些单词），但我遇到了一些问题。我使用curl从url获取整个页面数据，大多数页面都配合得很好，而其他页面则没有

我的目标是获得所有页面的html，就像浏览器获得它一样，我尝试匿名使用它，就像浏览器一样。我的意思是-如果某个页面需要登录以显示我不感兴趣的浏览器数据。问题是，我无法访问Twitter或Facebook页面，而这些页面可以通过普通浏览器匿名访问，即使我设置了所有标题，就像通常从Firefox或Chrome发送一样

有没有办法简单地模拟浏览器从这些方面获取页面，或者我必须使用OAuth（有人能解释为什么浏览器不需要使用它）

编辑我找到解决办法了！如果有人对此有问题，您应该：
->尝试将协议从https切换到http
->扔掉那些垃圾元素，如果url中有一个
->对于我的curl元素，“Accept Encoding:gzip，deflate”也会引起问题。。不知道为什么，但现在一切都好了

矿山代码：

if (substr($this->url,0,5) == 'https')
        $this->url = str_replace('https://', 'http://', $this->url);

    $this->url = str_replace('/#!/', '/', $this->url);

    //check, if a valid url is provided
    if(!filter_var($this->url, FILTER_VALIDATE_URL))
        return false;

    $curl = curl_init();

    $header = array();
    $header[0] = "Accept: text/xml,application/xml,application/xhtml+xml,";
    $header[0] .= "text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5";
    // -> gives an error: $header[] = "Accept-Encoding: gzip, deflate";
    $header[] = "Accept-Language: pl,en-us;q=0.7,en;q=0.3";
    $header[] = "Cache-Control: max-age=0";
    $header[] = "Connection: keep-alive";
    $header[] = "Keep-Alive: 300";
    $header[] = "Pragma: "; // browsers keep this blank. 
    curl_setopt($curl, CURLOPT_HTTPHEADER,$header);
    curl_setopt($curl, CURLOPT_HEADER, false);

    curl_setopt($curl, CURLOPT_URL, $this->url);

    curl_setopt($curl, CURLOPT_COOKIEJAR, "cookie.txt");
    curl_setopt($curl, CURLOPT_COOKIEFILE, "cookie.txt");
    curl_setopt($curl, CURLOPT_CONNECTTIMEOUT,10);
    curl_setopt($curl, CURLOPT_COOKIESESSION,true);
    curl_setopt($curl, CURLOPT_RETURNTRANSFER,1);
    curl_setopt($curl, CURLOPT_USERAGENT, 'Mozilla/5.0 (Windows; U; Windows NT 5.1; pl; rv:1.9.0.7) Gecko/2009021910 Firefox/3.0.7 (.NET CLR 3.5.30729)');
    curl_setopt($curl, CURLOPT_FOLLOWLOCATION, 1);

    $response = curl_exec($curl);
    curl_close($curl);

    if ($response) return $response;

    return false;

if (substr($this->url,0,5) == 'https')
        $this->url = str_replace('https://', 'http://', $this->url);

    $this->url = str_replace('/#!/', '/', $this->url);

    //check, if a valid url is provided
    if(!filter_var($this->url, FILTER_VALIDATE_URL))
        return false;

    $curl = curl_init();

    $header = array();
    $header[0] = "Accept: text/xml,application/xml,application/xhtml+xml,";
    $header[0] .= "text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5";
    // -> gives an error: $header[] = "Accept-Encoding: gzip, deflate";
    $header[] = "Accept-Language: pl,en-us;q=0.7,en;q=0.3";
    $header[] = "Cache-Control: max-age=0";
    $header[] = "Connection: keep-alive";
    $header[] = "Keep-Alive: 300";
    $header[] = "Pragma: "; // browsers keep this blank. 
    curl_setopt($curl, CURLOPT_HTTPHEADER,$header);
    curl_setopt($curl, CURLOPT_HEADER, false);

    curl_setopt($curl, CURLOPT_URL, $this->url);

    curl_setopt($curl, CURLOPT_COOKIEJAR, "cookie.txt");
    curl_setopt($curl, CURLOPT_COOKIEFILE, "cookie.txt");
    curl_setopt($curl, CURLOPT_CONNECTTIMEOUT,10);
    curl_setopt($curl, CURLOPT_COOKIESESSION,true);
    curl_setopt($curl, CURLOPT_RETURNTRANSFER,1);
    curl_setopt($curl, CURLOPT_USERAGENT, 'Mozilla/5.0 (Windows; U; Windows NT 5.1; pl; rv:1.9.0.7) Gecko/2009021910 Firefox/3.0.7 (.NET CLR 3.5.30729)');
    curl_setopt($curl, CURLOPT_FOLLOWLOCATION, 1);

    $response = curl_exec($curl);
    curl_close($curl);

    if ($response) return $response;

    return false;

所有的都在课堂上，但是你可以很容易地提取代码。对我来说，twitter和facebook两者都很好。

是的，这可以模拟浏览器：但你需要仔细观察浏览器发送的所有http头（包括cookie），同时还要处理重定向。其中一些可以通过cUrl函数“自动化”，其余的则需要手动处理

注意：我不是在代码中谈论HTML标题；这些是浏览器发送和接收的HTTP头

发现这些问题的最简单方法是让用户监控流量。选择一个URL并在右边查找“inspect element”，您将看到发送的头和接收的头

Facebook通过一系列iFrame使这一点更加复杂，所以我建议你从一个更简单的网站开始

我找到了解决办法！如果有人对此有问题，您应该：
->尝试将协议从https切换到http
->扔掉那些垃圾元素，如果url中有一个
->对于我的curl元素，“Accept Encoding:gzip，deflate”也会引起问题。。不知道为什么，但现在一切都好了

矿山代码：

if (substr($this->url,0,5) == 'https')
        $this->url = str_replace('https://', 'http://', $this->url);

    $this->url = str_replace('/#!/', '/', $this->url);

    //check, if a valid url is provided
    if(!filter_var($this->url, FILTER_VALIDATE_URL))
        return false;

    $curl = curl_init();

    $header = array();
    $header[0] = "Accept: text/xml,application/xml,application/xhtml+xml,";
    $header[0] .= "text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5";
    // -> gives an error: $header[] = "Accept-Encoding: gzip, deflate";
    $header[] = "Accept-Language: pl,en-us;q=0.7,en;q=0.3";
    $header[] = "Cache-Control: max-age=0";
    $header[] = "Connection: keep-alive";
    $header[] = "Keep-Alive: 300";
    $header[] = "Pragma: "; // browsers keep this blank. 
    curl_setopt($curl, CURLOPT_HTTPHEADER,$header);
    curl_setopt($curl, CURLOPT_HEADER, false);

    curl_setopt($curl, CURLOPT_URL, $this->url);

    curl_setopt($curl, CURLOPT_COOKIEJAR, "cookie.txt");
    curl_setopt($curl, CURLOPT_COOKIEFILE, "cookie.txt");
    curl_setopt($curl, CURLOPT_CONNECTTIMEOUT,10);
    curl_setopt($curl, CURLOPT_COOKIESESSION,true);
    curl_setopt($curl, CURLOPT_RETURNTRANSFER,1);
    curl_setopt($curl, CURLOPT_USERAGENT, 'Mozilla/5.0 (Windows; U; Windows NT 5.1; pl; rv:1.9.0.7) Gecko/2009021910 Firefox/3.0.7 (.NET CLR 3.5.30729)');
    curl_setopt($curl, CURLOPT_FOLLOWLOCATION, 1);

    $response = curl_exec($curl);
    curl_close($curl);

    if ($response) return $response;

    return false;

if (substr($this->url,0,5) == 'https')
        $this->url = str_replace('https://', 'http://', $this->url);

    $this->url = str_replace('/#!/', '/', $this->url);

    //check, if a valid url is provided
    if(!filter_var($this->url, FILTER_VALIDATE_URL))
        return false;

    $curl = curl_init();

    $header = array();
    $header[0] = "Accept: text/xml,application/xml,application/xhtml+xml,";
    $header[0] .= "text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5";
    // -> gives an error: $header[] = "Accept-Encoding: gzip, deflate";
    $header[] = "Accept-Language: pl,en-us;q=0.7,en;q=0.3";
    $header[] = "Cache-Control: max-age=0";
    $header[] = "Connection: keep-alive";
    $header[] = "Keep-Alive: 300";
    $header[] = "Pragma: "; // browsers keep this blank. 
    curl_setopt($curl, CURLOPT_HTTPHEADER,$header);
    curl_setopt($curl, CURLOPT_HEADER, false);

    curl_setopt($curl, CURLOPT_URL, $this->url);

    curl_setopt($curl, CURLOPT_COOKIEJAR, "cookie.txt");
    curl_setopt($curl, CURLOPT_COOKIEFILE, "cookie.txt");
    curl_setopt($curl, CURLOPT_CONNECTTIMEOUT,10);
    curl_setopt($curl, CURLOPT_COOKIESESSION,true);
    curl_setopt($curl, CURLOPT_RETURNTRANSFER,1);
    curl_setopt($curl, CURLOPT_USERAGENT, 'Mozilla/5.0 (Windows; U; Windows NT 5.1; pl; rv:1.9.0.7) Gecko/2009021910 Firefox/3.0.7 (.NET CLR 3.5.30729)');
    curl_setopt($curl, CURLOPT_FOLLOWLOCATION, 1);

    $response = curl_exec($curl);
    curl_close($curl);

    if ($response) return $response;

    return false;

所有的都在课堂上，但是你可以很容易地提取代码。对我来说，这两个功能（twitter和facebook）都很好。

谢谢，但我已经发送了HTTP头以获取HTML页面=）所有功能都已解决（第一篇文章），但感谢该程序-非常有用。好东西：你应该作为答案发布（不仅仅是编辑）；你可以回答自己的问题，这样对其他人来说就更清楚了，因为我相信会有其他人也这么做！。另外，如果有“/home.php”！/，你应该去掉这个部分，因为它会迫使你进入登录页面。所以第三行应该替换为这样的内容：

$this->url=str\u replace（“/home.php”！/”，“/”，$this->url）；$this->url=str\u replace（“/！/”，“/”，$this->url）；