如何使用php检测搜索引擎机器人？_Php_Web Crawler_Bots

如何使用php检测搜索引擎机器人？

php web-crawler bots

如何使用php检测搜索引擎机器人？,php,web-crawler,bots,Php,Web Crawler,Bots,如何使用php检测搜索引擎机器人？您可以分析用户代理（$\u服务器['HTTP\u用户代理]]），或者将客户端的IP地址（$\u服务器['REMOTE\u ADDR']）与一个比较。这里是一个示例然后使用$\u SERVER['HTTP\u USER\u AGENT']检查代理是否为蜘蛛 if(strstr(strtolower($_SERVER['HTTP_USER_AGENT']), "googlebot")) { // what to do } 检查$\u服务器['HTTP\u

如何使用php检测搜索引擎机器人？

您可以分析用户代理（

$\u服务器['HTTP\u用户代理]]

），或者将客户端的IP地址（

$\u服务器['REMOTE\u ADDR']

）与一个比较。

这里是一个示例

然后使用

$\u SERVER['HTTP\u USER\u AGENT']

检查代理是否为蜘蛛

if(strstr(strtolower($_SERVER['HTTP_USER_AGENT']), "googlebot"))
{
    // what to do
}

检查

$\u服务器['HTTP\u USER\u AGENT']

以了解此处列出的一些字符串：

或者更具体地说，对于爬虫：

如果你想记录最常见的搜索引擎爬虫的访问次数，你可以使用

$interestingCrawlers = array( 'google', 'yahoo' );
$pattern = '/(' . implode('|', $interestingCrawlers) .')/';
$matches = array();
$numMatches = preg_match($pattern, strtolower($_SERVER['HTTP_USER_AGENT']), $matches, 'i');
if($numMatches > 0) // Found a match
{
  // $matches[1] contains an array of all text matches to either 'google' or 'yahoo'
}

这将是蜘蛛理想的隐身方式。它来自一个名为[YACG]的开源脚本

需要做一点工作，但肯定是要做的。

我使用了以下代码，似乎工作正常：

function _bot_detected() {

  return (
    isset($_SERVER['HTTP_USER_AGENT'])
    && preg_match('/bot|crawl|slurp|spider|mediapartners/i', $_SERVER['HTTP_USER_AGENT'])
  );
}

更新16-06-2017

添加了mediapartners

您可以检查它是否是具有此功能的搜索引擎：

<?php
function crawlerDetect($USER_AGENT)
{
$crawlers = array(
'Google' => 'Google',
'MSN' => 'msnbot',
      'Rambler' => 'Rambler',
      'Yahoo' => 'Yahoo',
      'AbachoBOT' => 'AbachoBOT',
      'accoona' => 'Accoona',
      'AcoiRobot' => 'AcoiRobot',
      'ASPSeek' => 'ASPSeek',
      'CrocCrawler' => 'CrocCrawler',
      'Dumbot' => 'Dumbot',
      'FAST-WebCrawler' => 'FAST-WebCrawler',
      'GeonaBot' => 'GeonaBot',
      'Gigabot' => 'Gigabot',
      'Lycos spider' => 'Lycos',
      'MSRBOT' => 'MSRBOT',
      'Altavista robot' => 'Scooter',
      'AltaVista robot' => 'Altavista',
      'ID-Search Bot' => 'IDBot',
      'eStyle Bot' => 'eStyle',
      'Scrubby robot' => 'Scrubby',
      'Facebook' => 'facebookexternalhit',
  );
  // to get crawlers string used in function uncomment it
  // it is better to save it in string than use implode every time
  // global $crawlers
   $crawlers_agents = implode('|',$crawlers);
  if (strpos($crawlers_agents, $USER_AGENT) === false)
      return false;
    else {
    return TRUE;
    }
}
?>

然后你可以像这样使用它：

<?php $USER_AGENT = $_SERVER['HTTP_USER_AGENT'];
  if(crawlerDetect($USER_AGENT)) return "no need to lang redirection";?>

因为任何客户端都可以将用户代理设置为他们想要的，所以查找“谷歌机器人”、“bingbot”等只需要完成一半的工作

第二部分是验证客户端的IP。在过去，这需要维护IP列表。你在网上找到的所有列表都过时了。正如谷歌和必应所解释的那样，顶级搜索引擎正式支持通过DNS进行验证

首先，对客户端IP执行反向DNS查找。对于谷歌来说，这会在googlebot.com下添加一个主机名，而对于必应来说，则会在search.msn.com下添加一个主机名。然后，因为有人可以在他的IP上设置这样一个反向DNS，所以您需要在该主机名上使用正向DNS查找进行验证。如果生成的IP与站点访问者的IP相同，那么您可以确定它是来自该搜索引擎的爬虫

我已经用Java编写了一个库来为您执行这些检查。请随意将其移植到PHP。它在GitHub上：

使用设备检测器开源库，它提供了一个isBot（）函数：

我正在使用这段代码，非常好。你将很容易知道用户代理访问你的网站。此代码正在打开一个文件并将用户代理写入该文件。您可以每天访问

yourdomain.com/useragent.txt

查看此文件，了解新的用户代理，并将其置于if子句的条件中

$user_agent = strtolower($_SERVER['HTTP_USER_AGENT']);
if(!preg_match("/Googlebot|MJ12bot|yandexbot/i", $user_agent)){
    // if not meet the conditions then
    // do what you need

    // here open a file and write the user_agent down the file. You can check each day this file useragent.txt and know about new user_agents and put them in your condition of if clause
    if($user_agent!=""){
        $myfile = fopen("useragent.txt", "a") or die("Unable to open file useragent.txt!");
        fwrite($myfile, $user_agent);
        $user_agent = "\n";
        fwrite($myfile, $user_agent);
        fclose($myfile);
    }
}

这是useragent.txt的内容

Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)
Mozilla/5.0 (compatible; MJ12bot/v1.4.6; http://mj12bot.com/)Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)
Mozilla/5.0 (Linux; Android 6.0.1; Nexus 5X Build/MMB29P) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2272.96 Mobile Safari/537.36 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)mozilla/5.0 (compatible; yandexbot/3.0; +http://yandex.com/bots)
mozilla/5.0 (compatible; yandexbot/3.0; +http://yandex.com/bots)
mozilla/5.0 (compatible; yandexbot/3.0; +http://yandex.com/bots)
mozilla/5.0 (compatible; yandexbot/3.0; +http://yandex.com/bots)
mozilla/5.0 (compatible; yandexbot/3.0; +http://yandex.com/bots)
mozilla/5.0 (iphone; cpu iphone os 9_3 like mac os x) applewebkit/601.1.46 (khtml, like gecko) version/9.0 mobile/13e198 safari/601.1
mozilla/5.0 (windows nt 6.1; wow64) applewebkit/537.36 (khtml, like gecko) chrome/53.0.2785.143 safari/537.36
mozilla/5.0 (compatible; linkdexbot/2.2; +http://www.linkdex.com/bots/)
mozilla/5.0 (windows nt 6.1; wow64; rv:49.0) gecko/20100101 firefox/49.0
mozilla/5.0 (windows nt 6.1; wow64; rv:33.0) gecko/20100101 firefox/33.0
mozilla/5.0 (windows nt 6.1; wow64; rv:49.0) gecko/20100101 firefox/49.0
mozilla/5.0 (windows nt 6.1; wow64; rv:33.0) gecko/20100101 firefox/33.0
mozilla/5.0 (windows nt 6.1; wow64; rv:49.0) gecko/20100101 firefox/49.0
mozilla/5.0 (windows nt 6.1; wow64; rv:33.0) gecko/20100101 firefox/33.0
mozilla/5.0 (windows nt 6.1; wow64; rv:49.0) gecko/20100101 firefox/49.0
mozilla/5.0 (windows nt 6.1; wow64; rv:33.0) gecko/20100101 firefox/33.0
mozilla/5.0 (windows nt 6.1; wow64) applewebkit/537.36 (khtml, like gecko) chrome/53.0.2785.143 safari/537.36
mozilla/5.0 (windows nt 6.1; wow64) applewebkit/537.36 (khtml, like gecko) chrome/53.0.2785.143 safari/537.36
mozilla/5.0 (compatible; baiduspider/2.0; +http://www.baidu.com/search/spider.html)
zoombot (linkbot 1.0 http://suite.seozoom.it/bot.html)
mozilla/5.0 (windows nt 10.0; wow64) applewebkit/537.36 (khtml, like gecko) chrome/44.0.2403.155 safari/537.36 opr/31.0.1889.174
mozilla/5.0 (windows nt 10.0; wow64) applewebkit/537.36 (khtml, like gecko) chrome/44.0.2403.155 safari/537.36 opr/31.0.1889.174
sogou web spider/4.0(+http://www.sogou.com/docs/help/webmasters.htm#07)
mozilla/5.0 (windows nt 10.0; wow64) applewebkit/537.36 (khtml, like gecko) chrome/44.0.2403.155 safari/537.36 opr/31.0.1889.174

我用这个来检测机器人：

if (preg_match('/bot|crawl|curl|dataprovider|search|get|spider|find|java|majesticsEO|google|yahoo|teoma|contaxe|yandex|libwww-perl|facebookexternalhit/i', $_SERVER['HTTP_USER_AGENT'])) {
    // is bot
}

此外，我使用白名单阻止不需要的机器人：

if (preg_match('/apple|baidu|bingbot|facebookexternalhit|googlebot|-google|ia_archiver|msnbot|naverbot|pingdom|seznambot|slurp|teoma|twitter|yandex|yeti/i', $_SERVER['HTTP_USER_AGENT'])) {
    // allowed bot
}

一个不需要的机器人（=假阳性用户）随后能够解决验证码问题，从而在24小时内解除自己的锁定。由于没有人解决这个验证码，我知道它不会产生误报。所以机器人检测似乎工作得很好

注意：我的白名单基于。

我使用此功能。。。regex的一部分来自prestashop，但我添加了更多的机器人

    public function isBot()
{
    $bot_regex = '/BotLink|bingbot|AhrefsBot|ahoy|AlkalineBOT|anthill|appie|arale|araneo|AraybOt|ariadne|arks|ATN_Worldwide|Atomz|bbot|Bjaaland|Ukonline|borg\-bot\/0\.9|boxseabot|bspider|calif|christcrawler|CMC\/0\.01|combine|confuzzledbot|CoolBot|cosmos|Internet Cruiser Robot|cusco|cyberspyder|cydralspider|desertrealm, desert realm|digger|DIIbot|grabber|downloadexpress|DragonBot|dwcp|ecollector|ebiness|elfinbot|esculapio|esther|fastcrawler|FDSE|FELIX IDE|ESI|fido|H�m�h�kki|KIT\-Fireball|fouineur|Freecrawl|gammaSpider|gazz|gcreep|golem|googlebot|griffon|Gromit|gulliver|gulper|hambot|havIndex|hotwired|htdig|iajabot|INGRID\/0\.1|Informant|InfoSpiders|inspectorwww|irobot|Iron33|JBot|jcrawler|Teoma|Jeeves|jobo|image\.kapsi\.net|KDD\-Explorer|ko_yappo_robot|label\-grabber|larbin|legs|Linkidator|linkwalker|Lockon|logo_gif_crawler|marvin|mattie|mediafox|MerzScope|NEC\-MeshExplorer|MindCrawler|udmsearch|moget|Motor|msnbot|muncher|muninn|MuscatFerret|MwdSearch|sharp\-info\-agent|WebMechanic|NetScoop|newscan\-online|ObjectsSearch|Occam|Orbsearch\/1\.0|packrat|pageboy|ParaSite|patric|pegasus|perlcrawler|phpdig|piltdownman|Pimptrain|pjspider|PlumtreeWebAccessor|PortalBSpider|psbot|Getterrobo\-Plus|Raven|RHCS|RixBot|roadrunner|Robbie|robi|RoboCrawl|robofox|Scooter|Search\-AU|searchprocess|Senrigan|Shagseeker|sift|SimBot|Site Valet|skymob|SLCrawler\/2\.0|slurp|ESI|snooper|solbot|speedy|spider_monkey|SpiderBot\/1\.0|spiderline|nil|suke|http:\/\/www\.sygol\.com|tach_bw|TechBOT|templeton|titin|topiclink|UdmSearch|urlck|Valkyrie libwww\-perl|verticrawl|Victoria|void\-bot|Voyager|VWbot_K|crawlpaper|wapspider|WebBandit\/1\.0|webcatcher|T\-H\-U\-N\-D\-E\-R\-S\-T\-O\-N\-E|WebMoose|webquest|webreaper|webs|webspider|WebWalker|wget|winona|whowhere|wlm|WOLP|WWWC|none|XGET|Nederland\.zoek|AISearchBot|woriobot|NetSeer|Nutch|YandexBot|YandexMobileBot|SemrushBot|FatBot|MJ12bot|DotBot|AddThis|baiduspider|SeznamBot|mod_pagespeed|CCBot|openstat.ru\/Bot|m2e/i';
    $userAgent = empty($_SERVER['HTTP_USER_AGENT']) ? FALSE : $_SERVER['HTTP_USER_AGENT'];
    $isBot = !$userAgent || preg_match($bot_regex, $userAgent);

    return $isBot;
}

无论如何，要小心一些机器人使用类似浏览器的用户代理来伪造他们的身份
（我的网站上有许多俄罗斯ip具有这种行为）

大多数机器人的一个显著特点是，它们不携带任何cookie，因此没有会话连接到它们。

（我不知道怎么做，但这肯定是跟踪它们的最佳方式）

我为此制作了一个又好又快的函数

function is_bot(){

        if(isset($_SERVER['HTTP_USER_AGENT']))
        {
            return preg_match('/rambler|abacho|acoi|accona|aspseek|altavista|estyle|scrubby|lycos|geona|ia_archiver|alexa|sogou|skype|facebook|twitter|pinterest|linkedin|naver|bing|google|yahoo|duckduckgo|yandex|baidu|teoma|xing|java\/1.7.0_45|bot|crawl|slurp|spider|mediapartners|\sask\s|\saol\s/i', $_SERVER['HTTP_USER_AGENT']);
        }

        return false;
    }

这涵盖了99%的所有可能的机器人、搜索引擎等。

100%的机器人检测器工作。它正在我的网站上成功运行

function isBotDetected() {

    if ( preg_match('/abacho|accona|AddThis|AdsBot|ahoy|AhrefsBot|AISearchBot|alexa|altavista|anthill|appie|applebot|arale|araneo|AraybOt|ariadne|arks|aspseek|ATN_Worldwide|Atomz|baiduspider|baidu|bbot|bingbot|bing|Bjaaland|BlackWidow|BotLink|bot|boxseabot|bspider|calif|CCBot|ChinaClaw|christcrawler|CMC\/0\.01|combine|confuzzledbot|contaxe|CoolBot|cosmos|crawler|crawlpaper|crawl|curl|cusco|cyberspyder|cydralspider|dataprovider|digger|DIIbot|DotBot|downloadexpress|DragonBot|DuckDuckBot|dwcp|EasouSpider|ebiness|ecollector|elfinbot|esculapio|ESI|esther|eStyle|Ezooms|facebookexternalhit|facebook|facebot|fastcrawler|FatBot|FDSE|FELIX IDE|fetch|fido|find|Firefly|fouineur|Freecrawl|froogle|gammaSpider|gazz|gcreep|geona|Getterrobo-Plus|get|girafabot|golem|googlebot|\-google|grabber|GrabNet|griffon|Gromit|gulliver|gulper|hambot|havIndex|hotwired|htdig|HTTrack|ia_archiver|iajabot|IDBot|Informant|InfoSeek|InfoSpiders|INGRID\/0\.1|inktomi|inspectorwww|Internet Cruiser Robot|irobot|Iron33|JBot|jcrawler|Jeeves|jobo|KDD\-Explorer|KIT\-Fireball|ko_yappo_robot|label\-grabber|larbin|legs|libwww-perl|linkedin|Linkidator|linkwalker|Lockon|logo_gif_crawler|Lycos|m2e|majesticsEO|marvin|mattie|mediafox|mediapartners|MerzScope|MindCrawler|MJ12bot|mod_pagespeed|moget|Motor|msnbot|muncher|muninn|MuscatFerret|MwdSearch|NationalDirectory|naverbot|NEC\-MeshExplorer|NetcraftSurveyAgent|NetScoop|NetSeer|newscan\-online|nil|none|Nutch|ObjectsSearch|Occam|openstat.ru\/Bot|packrat|pageboy|ParaSite|patric|pegasus|perlcrawler|phpdig|piltdownman|Pimptrain|pingdom|pinterest|pjspider|PlumtreeWebAccessor|PortalBSpider|psbot|rambler|Raven|RHCS|RixBot|roadrunner|Robbie|robi|RoboCrawl|robofox|Scooter|Scrubby|Search\-AU|searchprocess|search|SemrushBot|Senrigan|seznambot|Shagseeker|sharp\-info\-agent|sift|SimBot|Site Valet|SiteSucker|skymob|SLCrawler\/2\.0|slurp|snooper|solbot|speedy|spider_monkey|SpiderBot\/1\.0|spiderline|spider|suke|tach_bw|TechBOT|TechnoratiSnoop|templeton|teoma|titin|topiclink|twitterbot|twitter|UdmSearch|Ukonline|UnwindFetchor|URL_Spider_SQL|urlck|urlresolver|Valkyrie libwww\-perl|verticrawl|Victoria|void\-bot|Voyager|VWbot_K|wapspider|WebBandit\/1\.0|webcatcher|WebCopier|WebFindBot|WebLeacher|WebMechanic|WebMoose|webquest|webreaper|webspider|webs|WebWalker|WebZip|wget|whowhere|winona|wlm|WOLP|woriobot|WWWC|XGET|xing|yahoo|YandexBot|YandexMobileBot|yandex|yeti|Zeus/i', $_SERVER['HTTP_USER_AGENT'])
    ) {
        return true; // 'Above given bots detected'
    }

    return false;

} // End :: isBotDetected()

对于谷歌，我使用这种方法

function is_google() {
    $ip   = $_SERVER['REMOTE_ADDR'];
    $host = gethostbyaddr( $ip );
    if ( strpos( $host, '.google.com' ) !== false || strpos( $host, '.googlebot.com' ) !== false ) {

        $forward_lookup = gethostbyname( $host );

        if ( $forward_lookup == $ip ) {
            return true;
        }

        return false;
    } else {
        return false;
    }

}

var_dump( is_google() );

信用证：

如果你真的需要检测谷歌引擎机器人，你应该永远不要依赖“用户代理”或“IP”地址，因为“用户代理”是可以更改的，根据谷歌在：
验证Googlebot是否为调用方：
1.使用host命令，从日志中对访问IP地址运行反向DNS查找
2.验证域名是否位于googlebot.com或google.com
3.使用主机命令对在步骤1中检索到的域名运行前向DNS查找。验证它是否与日志中的原始访问IP地址相同
以下是我的测试代码：

<?php $remote_add=$_SERVER['REMOTE_ADDR']; $hostname = gethostbyaddr($remote_add); $googlebot = 'googlebot.com'; $google = 'google.com'; if (stripos(strrev($hostname), strrev($googlebot)) === 0 or stripos(strrev($hostname),strrev($google)) === 0 ) { //add your code } ?>

在这段代码中，我们检查“hostname”，它应该在“hostname”的末尾包含“googlebot.com”或“google.com”，这对于检查确切的域而不是子域非常重要。
我希望你喜欢；）
可能会迟到，但隐藏的链接呢。所有机器人都将使用rel属性follow，只有坏机器人才会使用NOFOLLER rel属性

<a style="display:none;" rel="follow" href="javascript:void(0);" onclick="isabot();">.</a> function isabot(){ //define a variable to pass with ajax to php // || send bots info direct to where ever. isabot = true; }

函数isabot（）{ //定义一个用ajax传递给php的变量 //| |将机器人信息直接发送到任何地方。 isabot=真； }
对于坏机器人，您可以使用以下方法：

<a style="display:none;" href="javascript:void(0);" rel="nofollow" onclick="isBadbot();">.</a>

对于特定于PHP，您可以删除onclick属性，并使用指向ip检测器/bot检测器的链接替换href属性，如下所示：

<a style="display:none;" rel="follow" href="https://somedomain.com/botdetector.php">.</a>

或

你可以使用它，也许两者都可以使用，一个检测到一个机器人，而另一个证明它是一个坏机器人

希望您能发现这个有用的
如果（（eregi（“yahoo”、$this->USER_AGENT））&（$this->USER_AGENT”）{$this->Browser=“yahoo！slurp”；$this->Type=“robot”；}这个工作正常吗？因为strop可以返回0（位置），strstrstrstr失败时返回FALSE，如果添加一个！==在结束时进行错误检查。Erm，
strpos
在失败时也返回
false
。它更快、更高效（没有预处理，也没有O（m）存储）。那么假冒的用户代理呢？！如果有人可以用假名更改他的用户代理，并将其命名为“谷歌机器人”呢？我认为检查ip范围更值得信赖！这是否假设机器人会暴露自己？投票否决，用户代理可以在chrome设置中更改，firefox，是的，用户代理可以更改，但如果有人将其更改为包含“机器人”、“爬网”
<a style="display:none;" href="javascript:void(0);" rel="nofollow" onclick="isBadbot();">.</a>

<a style="display:none;" rel="follow" href="https://somedomain.com/botdetector.php">.</a>

<a style="display:none;" rel="nofollow" href="https://somedomain.com/badbotdetector.php">.</a>