C# 检测诚实的网络爬虫_C#_Web Crawler_Bots

C# 检测诚实的网络爬虫

c# web-crawler bots

C# 检测诚实的网络爬虫,c#,web-crawler,bots,C#,Web Crawler,Bots,我想检测（在服务器端）哪些请求来自机器人。在这一点上，我不关心恶意的机器人，只关心那些玩得很好的机器人。我见过一些方法，主要涉及将用户代理字符串与诸如“bot”之类的关键字进行匹配。但这似乎很尴尬、不完整、无法维护。那么有人有更可靠的方法吗？如果没有，您是否有任何资源用于与所有友好的用户代理保持同步如果你好奇的话：我不想做任何违背搜索引擎政策的事情。我们在网站上有一个部分，其中一个用户会随机看到一个页面的几个稍微不同的版本中的一个。但是，如果检测到网络爬虫，我们将始终为它们提供相同的版本，以便

我想检测（在服务器端）哪些请求来自机器人。在这一点上，我不关心恶意的机器人，只关心那些玩得很好的机器人。我见过一些方法，主要涉及将用户代理字符串与诸如“bot”之类的关键字进行匹配。但这似乎很尴尬、不完整、无法维护。那么有人有更可靠的方法吗？如果没有，您是否有任何资源用于与所有友好的用户代理保持同步

如果你好奇的话：我不想做任何违背搜索引擎政策的事情。我们在网站上有一个部分，其中一个用户会随机看到一个页面的几个稍微不同的版本中的一个。但是，如果检测到网络爬虫，我们将始终为它们提供相同的版本，以便索引保持一致

我也在使用Java，但我认为这种方法对于任何服务器端技术都是类似的。

任何进入页面为/robots.txt的访问者都可能是机器人。

你可以在robotstxt.org上找到一个关于已知“好”网络爬虫的非常全面的数据数据库。利用这些数据比在用户代理中匹配bot要有效得多。

一个建议是在页面上创建一个只有bot会跟随的空锚。普通用户看不到链接，留下蜘蛛和机器人跟随。例如，指向子文件夹的空锚定标记将在日志中记录get请求

<a href="dontfollowme.aspx"></a>

许多人在运行蜜罐时使用此方法捕获不遵循robots.txt文件的恶意机器人。在我写的一篇文章中，我使用空锚方法来捕获和阻止那些令人毛骨悚然的爬虫…

像这样快速而肮脏的东西可能是一个好的开始：

return if request.user_agent =~ /googlebot|msnbot|baidu|curl|wget|Mediapartners-Google|slurp|ia_archiver|Gigabot|libwww-perl|lwp-trivial/i

注意：rails代码，但regex通常是适用的。

我很确定大部分机器人程序不使用robots.txt，但这是我的第一个想法

在我看来，检测bot的最佳方法是使用请求之间的时间间隔，如果请求之间的时间间隔始终很快，那么它就是bot。

您说在“bot”上匹配用户代理可能会很尴尬，但我们发现这是一个非常好的匹配。我们的研究表明，它将覆盖你收到的点击量的98%。我们也没有遇到任何假阳性匹配。如果你想将这一比例提高到99.9%，你可以加入一些其他著名的匹配项，如“crawler”、“baiduspider”、“ia_archiver”、“curl”等。我们已经在我们的生产系统上测试了数百万次

以下是一些c#解决方案：

1）简单的在处理未命中时是最快的。i、 e.来自非机器人的流量–普通用户。捕获99%以上的爬虫

bool iscrawler = Regex.IsMatch(Request.UserAgent, @"bot|crawler|baiduspider|80legs|ia_archiver|voyager|curl|wget|yahoo! slurp|mediapartners-google", RegexOptions.IgnoreCase);

List<string> Crawlers3 = new List<string>()
{
    "bot","crawler","spider","80legs","baidu","yahoo! slurp","ia_archiver","mediapartners-google",
    "lwp-trivial","nederland.zoek","ahoy","anthill","appie","arale","araneo","ariadne",            
    "atn_worldwide","atomz","bjaaland","ukonline","calif","combine","cosmos","cusco",
    "cyberspyder","digger","grabber","downloadexpress","ecollector","ebiness","esculapio",
    "esther","felix ide","hamahakki","kit-fireball","fouineur","freecrawl","desertrealm",
    "gcreep","golem","griffon","gromit","gulliver","gulper","whowhere","havindex","hotwired",
    "htdig","ingrid","informant","inspectorwww","iron33","teoma","ask jeeves","jeeves",
    "image.kapsi.net","kdd-explorer","label-grabber","larbin","linkidator","linkwalker",
    "lockon","marvin","mattie","mediafox","merzscope","nec-meshexplorer","udmsearch","moget",
    "motor","muncher","muninn","muscatferret","mwdsearch","sharp-info-agent","webmechanic",
    "netscoop","newscan-online","objectssearch","orbsearch","packrat","pageboy","parasite",
    "patric","pegasus","phpdig","piltdownman","pimptrain","plumtreewebaccessor","getterrobo-plus",
    "raven","roadrunner","robbie","robocrawl","robofox","webbandit","scooter","search-au",
    "searchprocess","senrigan","shagseeker","site valet","skymob","slurp","snooper","speedy",
    "curl_image_client","suke","www.sygol.com","tach_bw","templeton","titin","topiclink","udmsearch",
    "urlck","valkyrie libwww-perl","verticrawl","victoria","webscout","voyager","crawlpaper",
    "webcatcher","t-h-u-n-d-e-r-s-t-o-n-e","webmoose","pagesinventory","webquest","webreaper",
    "webwalker","winona","occam","robi","fdse","jobo","rhcs","gazz","dwcp","yeti","fido","wlm",
    "wolp","wwwc","xget","legs","curl","webs","wget","sift","cmc"
};
string ua = Request.UserAgent.ToLower();
bool iscrawler = Crawlers3.Exists(x => ua.Contains(x));

2）中等在处理命中时是最快的。i、 e.来自机器人的流量。对于失误也很快。捕获近100%的爬虫。预先匹配“机器人”、“爬虫”、“蜘蛛”。您可以添加任何其他已知的爬虫程序

bool iscrawler = Regex.IsMatch(Request.UserAgent, @"bot|crawler|baiduspider|80legs|ia_archiver|voyager|curl|wget|yahoo! slurp|mediapartners-google", RegexOptions.IgnoreCase);

List<string> Crawlers3 = new List<string>()
{
    "bot","crawler","spider","80legs","baidu","yahoo! slurp","ia_archiver","mediapartners-google",
    "lwp-trivial","nederland.zoek","ahoy","anthill","appie","arale","araneo","ariadne",            
    "atn_worldwide","atomz","bjaaland","ukonline","calif","combine","cosmos","cusco",
    "cyberspyder","digger","grabber","downloadexpress","ecollector","ebiness","esculapio",
    "esther","felix ide","hamahakki","kit-fireball","fouineur","freecrawl","desertrealm",
    "gcreep","golem","griffon","gromit","gulliver","gulper","whowhere","havindex","hotwired",
    "htdig","ingrid","informant","inspectorwww","iron33","teoma","ask jeeves","jeeves",
    "image.kapsi.net","kdd-explorer","label-grabber","larbin","linkidator","linkwalker",
    "lockon","marvin","mattie","mediafox","merzscope","nec-meshexplorer","udmsearch","moget",
    "motor","muncher","muninn","muscatferret","mwdsearch","sharp-info-agent","webmechanic",
    "netscoop","newscan-online","objectssearch","orbsearch","packrat","pageboy","parasite",
    "patric","pegasus","phpdig","piltdownman","pimptrain","plumtreewebaccessor","getterrobo-plus",
    "raven","roadrunner","robbie","robocrawl","robofox","webbandit","scooter","search-au",
    "searchprocess","senrigan","shagseeker","site valet","skymob","slurp","snooper","speedy",
    "curl_image_client","suke","www.sygol.com","tach_bw","templeton","titin","topiclink","udmsearch",
    "urlck","valkyrie libwww-perl","verticrawl","victoria","webscout","voyager","crawlpaper",
    "webcatcher","t-h-u-n-d-e-r-s-t-o-n-e","webmoose","pagesinventory","webquest","webreaper",
    "webwalker","winona","occam","robi","fdse","jobo","rhcs","gazz","dwcp","yeti","fido","wlm",
    "wolp","wwwc","xget","legs","curl","webs","wget","sift","cmc"
};
string ua = Request.UserAgent.ToLower();
bool iscrawler = Crawlers3.Exists(x => ua.Contains(x));

List Crawlers3=新列表（）
{
“机器人”、“爬虫”、“蜘蛛”、“80legs”、“百度”、“雅虎slurp”、“ia_archiver”、“mediapartners谷歌”，
“lwp琐碎”、“荷兰·佐克”、“阿霍”、“蚁丘”、“阿皮”、“阿拉勒”、“阿拉尼奥”、“阿里阿德涅”，
“atn_worldwide”、“atomz”、“bjaaland”、“ukonline”、“calif”、“combine”、“cosmos”、“cusco”，
“cyberspyder”、“digger”、“grabber”、“downloadexpress”、“ecollector”、“ebiness”、“esculapio”，
“以斯帖”、“费利克斯·艾德”、“哈马哈基”、“吉特·火球”、“福伊内尔”、“自由爬行”、“沙漠王国”，
“gcreep”、“傀儡”、“狮鹫”、“格罗米特”、“格列佛”、“gulper”、“whowhere”、“havindex”、“hotwired”，
“htdig”、“ingrid”、“告密者”、“检查员www”、“iron33”、“teoma”、“询问jeeves”、“jeeves”，
“image.kapsi.net”、“kdd资源管理器”、“标签抓取器”、“larbin”、“linkidator”、“linkwalker”，
“lockon”、“marvin”、“mattie”、“mediafox”、“merzscope”、“nec meshexplorer”、“udmsearch”、“moget”，
“摩托”、“芒彻”、“穆宁”、“麝香雪貂”、“mwdsearch”、“夏普信息代理”、“网络机械师”，
“netscoop”、“newscan online”、“objectssearch”、“orbsearch”、“packrat”、“寻呼机”、“寄生虫”，
“patric”、“pegasus”、“phpdig”、“piltdownman”、“pimptrain”、“plumtreewebaccessor”、“getterrobo plus”，
“乌鸦”、“路行者”、“robbie”、“robocrawl”、“robofox”、“webbandit”、“scooter”、“搜索区”，
“搜索过程”、“senrigan”、“shagseeker”、“站点代客”、“skymob”、“slurp”、“snooper”、“speedy”，
“curl\u image\u client”、“suke”、“www.sygol.com”、“tach\u bw”、“templeton”、“titin”、“topiclink”、“udmsearch”，
“urlck”、“valkyrie libwww perl”、“verticrawl”、“victoria”、“WebCout”、“voyager”、“crawlpaper”，
“webcatcher”、“t-h-u-n-d-e-r-s-t-o-n-e”、“webmoose”、“pagesinventory”、“webquest”、“webreaper”，
“网络步行者”、“维诺娜”、“奥卡姆”、“机器人”、“fdse”、“jobo”、“rhcs”、“gazz”、“dwcp”、“雪人”、“fido”、“wlm”，
“wolp”、“wwwc”、“xget”、“legs”、“curl”、“webs”、“wget”、“sift”、“cmc”
};
字符串ua=Request.UserAgent.ToLower（）；
bool iscrawler=Crawlers3.Exists（x=>ua.Contains（x））；

3）偏执狂速度相当快，但比选项1和2慢一点。它是最准确的，并且允许您根据需要维护列表。如果您担心将来出现误报，可以维护一个单独的名称列表，其中包含“bot”。如果我们得到一个短匹配，我们会记录它并检查它是否为假阳性

// crawlers that have 'bot' in their useragent
List<string> Crawlers1 = new List<string>()
{
    "googlebot","bingbot","yandexbot","ahrefsbot","msnbot","linkedinbot","exabot","compspybot",
    "yesupbot","paperlibot","tweetmemebot","semrushbot","gigabot","voilabot","adsbot-google",
    "botlink","alkalinebot","araybot","undrip bot","borg-bot","boxseabot","yodaobot","admedia bot",
    "ezooms.bot","confuzzledbot","coolbot","internet cruiser robot","yolinkbot","diibot","musobot",
    "dragonbot","elfinbot","wikiobot","twitterbot","contextad bot","hambot","iajabot","news bot",
    "irobot","socialradarbot","ko_yappo_robot","skimbot","psbot","rixbot","seznambot","careerbot",
    "simbot","solbot","mail.ru_bot","spiderbot","blekkobot","bitlybot","techbot","void-bot",
    "vwbot_k","diffbot","friendfeedbot","archive.org_bot","woriobot","crystalsemanticsbot","wepbot",
    "spbot","tweetedtimes bot","mj12bot","who.is bot","psbot","robot","jbot","bbot","bot"
};

// crawlers that don't have 'bot' in their useragent
List<string> Crawlers2 = new List<string>()
{
    "baiduspider","80legs","baidu","yahoo! slurp","ia_archiver","mediapartners-google","lwp-trivial",
    "nederland.zoek","ahoy","anthill","appie","arale","araneo","ariadne","atn_worldwide","atomz",
    "bjaaland","ukonline","bspider","calif","christcrawler","combine","cosmos","cusco","cyberspyder",
    "cydralspider","digger","grabber","downloadexpress","ecollector","ebiness","esculapio","esther",
    "fastcrawler","felix ide","hamahakki","kit-fireball","fouineur","freecrawl","desertrealm",
    "gammaspider","gcreep","golem","griffon","gromit","gulliver","gulper","whowhere","portalbspider",
    "havindex","hotwired","htdig","ingrid","informant","infospiders","inspectorwww","iron33",
    "jcrawler","teoma","ask jeeves","jeeves","image.kapsi.net","kdd-explorer","label-grabber",
    "larbin","linkidator","linkwalker","lockon","logo_gif_crawler","marvin","mattie","mediafox",
    "merzscope","nec-meshexplorer","mindcrawler","udmsearch","moget","motor","muncher","muninn",
    "muscatferret","mwdsearch","sharp-info-agent","webmechanic","netscoop","newscan-online",
    "objectssearch","orbsearch","packrat","pageboy","parasite","patric","pegasus","perlcrawler",
    "phpdig","piltdownman","pimptrain","pjspider","plumtreewebaccessor","getterrobo-plus","raven",
    "roadrunner","robbie","robocrawl","robofox","webbandit","scooter","search-au","searchprocess",
    "senrigan","shagseeker","site valet","skymob","slcrawler","slurp","snooper","speedy",
    "spider_monkey","spiderline","curl_image_client","suke","www.sygol.com","tach_bw","templeton",
    "titin","topiclink","udmsearch","urlck","valkyrie libwww-perl","verticrawl","victoria",
    "webscout","voyager","crawlpaper","wapspider","webcatcher","t-h-u-n-d-e-r-s-t-o-n-e",
    "webmoose","pagesinventory","webquest","webreaper","webspider","webwalker","winona","occam",
    "robi","fdse","jobo","rhcs","gazz","dwcp","yeti","crawler","fido","wlm","wolp","wwwc","xget",
    "legs","curl","webs","wget","sift","cmc"
};

string ua = Request.UserAgent.ToLower();
string match = null;

if (ua.Contains("bot")) match = Crawlers1.FirstOrDefault(x => ua.Contains(x));
else match = Crawlers2.FirstOrDefault(x => ua.Contains(x));

if (match != null && match.Length < 5) Log("Possible new crawler found: ", ua);

bool iscrawler = match != null;

//在其用户代理中具有“bot”的爬虫程序
列表爬虫1=新列表（）
{
“googlebot”、“bingbot”、“yandexbot”、“ahrefsbot”、“msnbot”、“linkedinbot”、“exabot”、“compspybot”，
“yesupbot”、“paperlibot”、“tweetmemebot”、“semrushbot”、“gigabot”、“voilabot”、“adsbot谷歌”，
“botlink”、“碱性机器人”、“araybot”、“undrip机器人”、“borg机器人”、“boxseabot”、“yodaobot”、“admedia机器人”，
“ezooms.bot”、“Confzzledbot”、“coolbot”、“internet巡洋舰机器人”、“yolinkbot”、“diibot”、“musobot”，
“龙机器人”、“精灵机器人”、“维基机器人”、“推特机器人”、“上下文机器人”、“哈姆机器人”、“雅加布机器人”、“新闻机器人”，
“irobot”、“socialradarbot”、“ko_yappo_机器人”、“skimbot”、“psbot”、“rixbot”、“seznambot”、“careerbot”，
“simbot”、“solbot”、“mail.ru_bot”、“spiderbot”、“blekkobot”、“bitlybot”、“techbot”、“void bot”，