C# 检测诚实的网络爬虫

C# 检测诚实的网络爬虫,c#,web-crawler,bots,C#,Web Crawler,Bots,我想检测(在服务器端)哪些请求来自机器人。在这一点上,我不关心恶意的机器人,只关心那些玩得很好的机器人。我见过一些方法,主要涉及将用户代理字符串与诸如“bot”之类的关键字进行匹配。但这似乎很尴尬、不完整、无法维护。那么有人有更可靠的方法吗?如果没有,您是否有任何资源用于与所有友好的用户代理保持同步 如果你好奇的话:我不想做任何违背搜索引擎政策的事情。我们在网站上有一个部分,其中一个用户会随机看到一个页面的几个稍微不同的版本中的一个。但是,如果检测到网络爬虫,我们将始终为它们提供相同的版本,以便

我想检测(在服务器端)哪些请求来自机器人。在这一点上,我不关心恶意的机器人,只关心那些玩得很好的机器人。我见过一些方法,主要涉及将用户代理字符串与诸如“bot”之类的关键字进行匹配。但这似乎很尴尬、不完整、无法维护。那么有人有更可靠的方法吗?如果没有,您是否有任何资源用于与所有友好的用户代理保持同步

如果你好奇的话:我不想做任何违背搜索引擎政策的事情。我们在网站上有一个部分,其中一个用户会随机看到一个页面的几个稍微不同的版本中的一个。但是,如果检测到网络爬虫,我们将始终为它们提供相同的版本,以便索引保持一致


我也在使用Java,但我认为这种方法对于任何服务器端技术都是类似的。

任何进入页面为/robots.txt的访问者都可能是机器人。

你可以在robotstxt.org上找到一个关于已知“好”网络爬虫的非常全面的数据数据库。利用这些数据比在用户代理中匹配bot要有效得多。

一个建议是在页面上创建一个只有bot会跟随的空锚。普通用户看不到链接,留下蜘蛛和机器人跟随。例如,指向子文件夹的空锚定标记将在日志中记录get请求

<a href="dontfollowme.aspx"></a>


许多人在运行蜜罐时使用此方法捕获不遵循robots.txt文件的恶意机器人。在我写的一篇文章中,我使用空锚方法来捕获和阻止那些令人毛骨悚然的爬虫…

像这样快速而肮脏的东西可能是一个好的开始:

return if request.user_agent =~ /googlebot|msnbot|baidu|curl|wget|Mediapartners-Google|slurp|ia_archiver|Gigabot|libwww-perl|lwp-trivial/i

注意:rails代码,但regex通常是适用的。

我很确定大部分机器人程序不使用robots.txt,但这是我的第一个想法


在我看来,检测bot的最佳方法是使用请求之间的时间间隔,如果请求之间的时间间隔始终很快,那么它就是bot。

您说在“bot”上匹配用户代理可能会很尴尬,但我们发现这是一个非常好的匹配。我们的研究表明,它将覆盖你收到的点击量的98%。我们也没有遇到任何假阳性匹配。如果你想将这一比例提高到99.9%,你可以加入一些其他著名的匹配项,如“crawler”、“baiduspider”、“ia_archiver”、“curl”等。我们已经在我们的生产系统上测试了数百万次

以下是一些c#解决方案:

1) 简单的 在处理未命中时是最快的。i、 e.来自非机器人的流量–普通用户。 捕获99%以上的爬虫

bool iscrawler = Regex.IsMatch(Request.UserAgent, @"bot|crawler|baiduspider|80legs|ia_archiver|voyager|curl|wget|yahoo! slurp|mediapartners-google", RegexOptions.IgnoreCase);
List<string> Crawlers3 = new List<string>()
{
    "bot","crawler","spider","80legs","baidu","yahoo! slurp","ia_archiver","mediapartners-google",
    "lwp-trivial","nederland.zoek","ahoy","anthill","appie","arale","araneo","ariadne",            
    "atn_worldwide","atomz","bjaaland","ukonline","calif","combine","cosmos","cusco",
    "cyberspyder","digger","grabber","downloadexpress","ecollector","ebiness","esculapio",
    "esther","felix ide","hamahakki","kit-fireball","fouineur","freecrawl","desertrealm",
    "gcreep","golem","griffon","gromit","gulliver","gulper","whowhere","havindex","hotwired",
    "htdig","ingrid","informant","inspectorwww","iron33","teoma","ask jeeves","jeeves",
    "image.kapsi.net","kdd-explorer","label-grabber","larbin","linkidator","linkwalker",
    "lockon","marvin","mattie","mediafox","merzscope","nec-meshexplorer","udmsearch","moget",
    "motor","muncher","muninn","muscatferret","mwdsearch","sharp-info-agent","webmechanic",
    "netscoop","newscan-online","objectssearch","orbsearch","packrat","pageboy","parasite",
    "patric","pegasus","phpdig","piltdownman","pimptrain","plumtreewebaccessor","getterrobo-plus",
    "raven","roadrunner","robbie","robocrawl","robofox","webbandit","scooter","search-au",
    "searchprocess","senrigan","shagseeker","site valet","skymob","slurp","snooper","speedy",
    "curl_image_client","suke","www.sygol.com","tach_bw","templeton","titin","topiclink","udmsearch",
    "urlck","valkyrie libwww-perl","verticrawl","victoria","webscout","voyager","crawlpaper",
    "webcatcher","t-h-u-n-d-e-r-s-t-o-n-e","webmoose","pagesinventory","webquest","webreaper",
    "webwalker","winona","occam","robi","fdse","jobo","rhcs","gazz","dwcp","yeti","fido","wlm",
    "wolp","wwwc","xget","legs","curl","webs","wget","sift","cmc"
};
string ua = Request.UserAgent.ToLower();
bool iscrawler = Crawlers3.Exists(x => ua.Contains(x));
2) 中等 在处理命中时是最快的。i、 e.来自机器人的流量。对于失误也很快。 捕获近100%的爬虫。 预先匹配“机器人”、“爬虫”、“蜘蛛”。 您可以添加任何其他已知的爬虫程序

bool iscrawler = Regex.IsMatch(Request.UserAgent, @"bot|crawler|baiduspider|80legs|ia_archiver|voyager|curl|wget|yahoo! slurp|mediapartners-google", RegexOptions.IgnoreCase);
List<string> Crawlers3 = new List<string>()
{
    "bot","crawler","spider","80legs","baidu","yahoo! slurp","ia_archiver","mediapartners-google",
    "lwp-trivial","nederland.zoek","ahoy","anthill","appie","arale","araneo","ariadne",            
    "atn_worldwide","atomz","bjaaland","ukonline","calif","combine","cosmos","cusco",
    "cyberspyder","digger","grabber","downloadexpress","ecollector","ebiness","esculapio",
    "esther","felix ide","hamahakki","kit-fireball","fouineur","freecrawl","desertrealm",
    "gcreep","golem","griffon","gromit","gulliver","gulper","whowhere","havindex","hotwired",
    "htdig","ingrid","informant","inspectorwww","iron33","teoma","ask jeeves","jeeves",
    "image.kapsi.net","kdd-explorer","label-grabber","larbin","linkidator","linkwalker",
    "lockon","marvin","mattie","mediafox","merzscope","nec-meshexplorer","udmsearch","moget",
    "motor","muncher","muninn","muscatferret","mwdsearch","sharp-info-agent","webmechanic",
    "netscoop","newscan-online","objectssearch","orbsearch","packrat","pageboy","parasite",
    "patric","pegasus","phpdig","piltdownman","pimptrain","plumtreewebaccessor","getterrobo-plus",
    "raven","roadrunner","robbie","robocrawl","robofox","webbandit","scooter","search-au",
    "searchprocess","senrigan","shagseeker","site valet","skymob","slurp","snooper","speedy",
    "curl_image_client","suke","www.sygol.com","tach_bw","templeton","titin","topiclink","udmsearch",
    "urlck","valkyrie libwww-perl","verticrawl","victoria","webscout","voyager","crawlpaper",
    "webcatcher","t-h-u-n-d-e-r-s-t-o-n-e","webmoose","pagesinventory","webquest","webreaper",
    "webwalker","winona","occam","robi","fdse","jobo","rhcs","gazz","dwcp","yeti","fido","wlm",
    "wolp","wwwc","xget","legs","curl","webs","wget","sift","cmc"
};
string ua = Request.UserAgent.ToLower();
bool iscrawler = Crawlers3.Exists(x => ua.Contains(x));
List Crawlers3=新列表()
{
“机器人”、“爬虫”、“蜘蛛”、“80legs”、“百度”、“雅虎slurp”、“ia_archiver”、“mediapartners谷歌”,
“lwp琐碎”、“荷兰·佐克”、“阿霍”、“蚁丘”、“阿皮”、“阿拉勒”、“阿拉尼奥”、“阿里阿德涅”,
“atn_worldwide”、“atomz”、“bjaaland”、“ukonline”、“calif”、“combine”、“cosmos”、“cusco”,
“cyberspyder”、“digger”、“grabber”、“downloadexpress”、“ecollector”、“ebiness”、“esculapio”,
“以斯帖”、“费利克斯·艾德”、“哈马哈基”、“吉特·火球”、“福伊内尔”、“自由爬行”、“沙漠王国”,
“gcreep”、“傀儡”、“狮鹫”、“格罗米特”、“格列佛”、“gulper”、“whowhere”、“havindex”、“hotwired”,
“htdig”、“ingrid”、“告密者”、“检查员www”、“iron33”、“teoma”、“询问jeeves”、“jeeves”,
“image.kapsi.net”、“kdd资源管理器”、“标签抓取器”、“larbin”、“linkidator”、“linkwalker”,
“lockon”、“marvin”、“mattie”、“mediafox”、“merzscope”、“nec meshexplorer”、“udmsearch”、“moget”,
“摩托”、“芒彻”、“穆宁”、“麝香雪貂”、“mwdsearch”、“夏普信息代理”、“网络机械师”,
“netscoop”、“newscan online”、“objectssearch”、“orbsearch”、“packrat”、“寻呼机”、“寄生虫”,
“patric”、“pegasus”、“phpdig”、“piltdownman”、“pimptrain”、“plumtreewebaccessor”、“getterrobo plus”,
“乌鸦”、“路行者”、“robbie”、“robocrawl”、“robofox”、“webbandit”、“scooter”、“搜索区”,
“搜索过程”、“senrigan”、“shagseeker”、“站点代客”、“skymob”、“slurp”、“snooper”、“speedy”,
“curl\u image\u client”、“suke”、“www.sygol.com”、“tach\u bw”、“templeton”、“titin”、“topiclink”、“udmsearch”,
“urlck”、“valkyrie libwww perl”、“verticrawl”、“victoria”、“WebCout”、“voyager”、“crawlpaper”,
“webcatcher”、“t-h-u-n-d-e-r-s-t-o-n-e”、“webmoose”、“pagesinventory”、“webquest”、“webreaper”,
“网络步行者”、“维诺娜”、“奥卡姆”、“机器人”、“fdse”、“jobo”、“rhcs”、“gazz”、“dwcp”、“雪人”、“fido”、“wlm”,
“wolp”、“wwwc”、“xget”、“legs”、“curl”、“webs”、“wget”、“sift”、“cmc”
};
字符串ua=Request.UserAgent.ToLower();
bool iscrawler=Crawlers3.Exists(x=>ua.Contains(x));
3) 偏执狂 速度相当快,但比选项1和2慢一点。 它是最准确的,并且允许您根据需要维护列表。 如果您担心将来出现误报,可以维护一个单独的名称列表,其中包含“bot”。 如果我们得到一个短匹配,我们会记录它并检查它是否为假阳性

// crawlers that have 'bot' in their useragent
List<string> Crawlers1 = new List<string>()
{
    "googlebot","bingbot","yandexbot","ahrefsbot","msnbot","linkedinbot","exabot","compspybot",
    "yesupbot","paperlibot","tweetmemebot","semrushbot","gigabot","voilabot","adsbot-google",
    "botlink","alkalinebot","araybot","undrip bot","borg-bot","boxseabot","yodaobot","admedia bot",
    "ezooms.bot","confuzzledbot","coolbot","internet cruiser robot","yolinkbot","diibot","musobot",
    "dragonbot","elfinbot","wikiobot","twitterbot","contextad bot","hambot","iajabot","news bot",
    "irobot","socialradarbot","ko_yappo_robot","skimbot","psbot","rixbot","seznambot","careerbot",
    "simbot","solbot","mail.ru_bot","spiderbot","blekkobot","bitlybot","techbot","void-bot",
    "vwbot_k","diffbot","friendfeedbot","archive.org_bot","woriobot","crystalsemanticsbot","wepbot",
    "spbot","tweetedtimes bot","mj12bot","who.is bot","psbot","robot","jbot","bbot","bot"
};

// crawlers that don't have 'bot' in their useragent
List<string> Crawlers2 = new List<string>()
{
    "baiduspider","80legs","baidu","yahoo! slurp","ia_archiver","mediapartners-google","lwp-trivial",
    "nederland.zoek","ahoy","anthill","appie","arale","araneo","ariadne","atn_worldwide","atomz",
    "bjaaland","ukonline","bspider","calif","christcrawler","combine","cosmos","cusco","cyberspyder",
    "cydralspider","digger","grabber","downloadexpress","ecollector","ebiness","esculapio","esther",
    "fastcrawler","felix ide","hamahakki","kit-fireball","fouineur","freecrawl","desertrealm",
    "gammaspider","gcreep","golem","griffon","gromit","gulliver","gulper","whowhere","portalbspider",
    "havindex","hotwired","htdig","ingrid","informant","infospiders","inspectorwww","iron33",
    "jcrawler","teoma","ask jeeves","jeeves","image.kapsi.net","kdd-explorer","label-grabber",
    "larbin","linkidator","linkwalker","lockon","logo_gif_crawler","marvin","mattie","mediafox",
    "merzscope","nec-meshexplorer","mindcrawler","udmsearch","moget","motor","muncher","muninn",
    "muscatferret","mwdsearch","sharp-info-agent","webmechanic","netscoop","newscan-online",
    "objectssearch","orbsearch","packrat","pageboy","parasite","patric","pegasus","perlcrawler",
    "phpdig","piltdownman","pimptrain","pjspider","plumtreewebaccessor","getterrobo-plus","raven",
    "roadrunner","robbie","robocrawl","robofox","webbandit","scooter","search-au","searchprocess",
    "senrigan","shagseeker","site valet","skymob","slcrawler","slurp","snooper","speedy",
    "spider_monkey","spiderline","curl_image_client","suke","www.sygol.com","tach_bw","templeton",
    "titin","topiclink","udmsearch","urlck","valkyrie libwww-perl","verticrawl","victoria",
    "webscout","voyager","crawlpaper","wapspider","webcatcher","t-h-u-n-d-e-r-s-t-o-n-e",
    "webmoose","pagesinventory","webquest","webreaper","webspider","webwalker","winona","occam",
    "robi","fdse","jobo","rhcs","gazz","dwcp","yeti","crawler","fido","wlm","wolp","wwwc","xget",
    "legs","curl","webs","wget","sift","cmc"
};

string ua = Request.UserAgent.ToLower();
string match = null;

if (ua.Contains("bot")) match = Crawlers1.FirstOrDefault(x => ua.Contains(x));
else match = Crawlers2.FirstOrDefault(x => ua.Contains(x));

if (match != null && match.Length < 5) Log("Possible new crawler found: ", ua);

bool iscrawler = match != null;
//在其用户代理中具有“bot”的爬虫程序
列表爬虫1=新列表()
{
“googlebot”、“bingbot”、“yandexbot”、“ahrefsbot”、“msnbot”、“linkedinbot”、“exabot”、“compspybot”,
“yesupbot”、“paperlibot”、“tweetmemebot”、“semrushbot”、“gigabot”、“voilabot”、“adsbot谷歌”,
“botlink”、“碱性机器人”、“araybot”、“undrip机器人”、“borg机器人”、“boxseabot”、“yodaobot”、“admedia机器人”,
“ezooms.bot”、“Confzzledbot”、“coolbot”、“internet巡洋舰机器人”、“yolinkbot”、“diibot”、“musobot”,
“龙机器人”、“精灵机器人”、“维基机器人”、“推特机器人”、“上下文机器人”、“哈姆机器人”、“雅加布机器人”、“新闻机器人”,
“irobot”、“socialradarbot”、“ko_yappo_机器人”、“skimbot”、“psbot”、“rixbot”、“seznambot”、“careerbot”,
“simbot”、“solbot”、“mail.ru_bot”、“spiderbot”、“blekkobot”、“bitlybot”、“techbot”、“void bot”,