如何阻止所有机器人爬网我的twitter链接

如何阻止所有机器人爬网我的twitter链接,twitter,web-scraping,bots,robots.txt,web-crawler,Twitter,Web Scraping,Bots,Robots.txt,Web Crawler,当我在tweet中发布一个url我自己的自定义url缩短器时,我会有很多爬虫/蜘蛛在我的页面上爬行,请参见下面的访问日志 我最近将我的robots.txt更新为: #Code to not allow any search engines! User-agent: * Disallow: / 如何阻止他们访问我的url。问题是每次有人访问我的url时,我都会增加视图计数,这会使我的计数向上倾斜。你认为也许我可以寻找这些IP地址,如果其中一个不在我的计数中 访问日志: 69.164.197.15

当我在tweet中发布一个url我自己的自定义url缩短器时,我会有很多爬虫/蜘蛛在我的页面上爬行,请参见下面的访问日志

我最近将我的robots.txt更新为:

#Code to not allow any search engines!
User-agent: *
Disallow: /
如何阻止他们访问我的url。问题是每次有人访问我的url时,我都会增加视图计数,这会使我的计数向上倾斜。你认为也许我可以寻找这些IP地址,如果其中一个不在我的计数中

访问日志:

69.164.197.15 - - [21/Aug/2014:16:43:09 -0400] "GET /378 HTTP/1.1" 302 4 "http://api.twitter.com/1/statuses/show/502556534631854081.json" "InAGist URL Resolver (http://inagist.com)"

46.236.24.53 - - [21/Aug/2014:16:43:10 -0400] "GET /378 HTTP/1.1" 302 30 "-" "-"

107.22.95.116 - - [21/Aug/2014:16:43:10 -0400] "GET /378 HTTP/1.1" 302 4 "-" "help@dataminr.com"

50.18.102.132 - - [21/Aug/2014:16:43:11 -0400] "HEAD /378 HTTP/1.1" 302 - "-" "Google-HTTP-Java-Client/1.17.0-rc (gzip)"

23.23.112.130 - - [21/Aug/2014:16:43:11 -0400] "GET /378 HTTP/1.1" 302 4 "-" "Java/1.7.0_25"

50.18.102.132 - - [21/Aug/2014:16:43:11 -0400] "HEAD /378 HTTP/1.1" 302 - "-" "Google-HTTP-Java-Client/1.17.0-rc (gzip)"

38.88.172.134 - - [21/Aug/2014:16:43:12 -0400] "GET /user/ HTTP/1.1" 200 3880 "-" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.153 Safari/537.36"

38.88.172.134 - - [21/Aug/2014:16:43:13 -0400] "GET /user/%3C?php%20echo%20$config[ HTTP/1.1" 404 719 "http://awgo.to/user/" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.153 Safari/537.36"

65.52.244.38 - - [21/Aug/2014:16:43:22 -0400] "GET /378 HTTP/1.1" 302 30 "-" "Mozilla/5.0 (compatible; MSIE 10.0; Windows NT 6.1; Trident/6.0)"

38.88.172.134 - - [21/Aug/2014:16:43:23 -0400] "GET /user/ HTTP/1.1" 200 3880 "-" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.153 Safari/537.36"

38.88.172.134 - - [21/Aug/2014:16:43:23 -0400] "GET /user/%3C?php%20echo%20$config[ HTTP/1.1" 404 719 "http://awgo.to/user/" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.153 Safari/537.36"

199.16.156.126 - - [21/Aug/2014:16:43:24 -0400] "GET /robots.txt HTTP/1.1" 200 82 "-" "Twitterbot/1.0"

37.59.16.161 - - [21/Aug/2014:16:43:28 -0400] "GET /378 HTTP/1.1" 302 30 "-" "Mozilla/5.0 (compatible; PaperLiBot/2.1; http://support.paper.li/entries/20023257-what-is-paper-li)"

38.88.172.134 - - [21/Aug/2014:16:43:31 -0400] "GET /user/ HTTP/1.1" 200 3883 "-" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.153 Safari/537.36"

38.88.172.134 - - [21/Aug/2014:16:43:32 -0400] "GET /user/%3C?php%20echo%20$config[ HTTP/1.1" 404 719 "http://awgo.to/user/" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.153 Safari/537.36"

107.155.73.2 - - [21/Aug/2014:16:43:33 -0400] "HEAD /378 HTTP/1.1" 302 - "-" "MetaURI API/2.0 +metauri.com"
107.155.73.2 - - [21/Aug/2014:16:43:33 -0400] "HEAD /378 HTTP/1.1" 302 - "-" "MetaURI API/2.0 +metauri.com"

199.16.156.126 - - [21/Aug/2014:16:43:45 -0400] "GET /robots.txt HTTP/1.1" 200 82 "-" "Twitterbot/1.0"

142.4.216.19 - - [21/Aug/2014:16:44:02 -0400] "HEAD /378 HTTP/1.1" 302 - "-" "Mozilla/5.0 (compatible; OpenHoseBot/2.1; +http://www.openhose.org/bot.html)"

142.4.216.19 - - [21/Aug/2014:16:44:03 -0400] "GET /378 HTTP/1.1" 302 30 "-" "Mozilla/5.0 (compatible; OpenHoseBot/2.1; +http://www.openhose.org/bot.html)"

74.112.131.246 - - [21/Aug/2014:16:44:09 -0400] "GET /378 HTTP/1.1" 302 30 "-" "Mozilla/5.0 ()"

199.59.148.210 - - [21/Aug/2014:16:44:09 -0400] "GET /robots.txt HTTP/1.1" 200 82 "-" "Twitterbot/1.0"

38.88.172.134 - - [21/Aug/2014:16:44:27 -0400] "GET /admin/url HTTP/1.1" 200 3051 "-" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1985.143 Safari/537.36"

54.167.203.95 - - [21/Aug/2014:16:44:51 -0400] "HEAD /378 HTTP/1.1" 302 - "-" "Mozilla/5.0 (Windows; U; Windows NT 5.1; de; rv:1.9) Gecko/2008052906 Firefox/3.0"

199.59.148.209 - - [21/Aug/2014:16:45:36 -0400] "GET /robots.txt HTTP/1.1" 200 82 "-" "Twitterbot/1.0"

54.226.56.247 - - [21/Aug/2014:16:45:43 -0400] "GET /378 HTTP/1.1" 302 4 "http://awgo.to/378" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_6_8) AppleWebKit/534.30 (KHTML, like Gecko) Chrome/12.0.742.112 Safari/534.30"

109.68.166.70 - - [21/Aug/2014:16:45:49 -0400] "HEAD /378 HTTP/1.1" 302 - "-" "-"

94.23.27.149 - - [21/Aug/2014:16:47:15 -0400] "GET /robots.txt HTTP/1.1" 200 82 "-" "Mozilla/5.0 (compatible; Kraken/0.1; http://linkfluence.net/; bot@linkfluence.net)"

54.237.153.130 - - [21/Aug/2014:16:47:16 -0400] "HEAD /378 HTTP/1.1" 302 - "-" "Mozilla/5.0 (Windows; U; Windows NT 5.1; de; rv:1.9) Gecko/2008052906 Firefox/3.0"

184.73.210.145 - - [21/Aug/2014:16:48:25 -0400] "GET /378 HTTP/1.1" 302 4 "-" "Python-urllib/2.7"

104.130.132.62 - - [21/Aug/2014:16:48:29 -0400] "HEAD /378 HTTP/1.1" 302 - "-" "Jakarta Commons-HttpClient/3.0.1"

64.95.141.10 - - [21/Aug/2014:16:48:59 -0400] "GET /356 HTTP/1.1" 302 4 "-" "Java/1.7.0_51"

23.94.19.179 - - [21/Aug/2014:16:49:29 -0400] "GET / HTTP/1.0" 200 2588 "http://www.awgo.to/" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.31 (KHTML, like Gecko) Chrome/26.0.1410.63 Safari/537.31"

23.94.19.179 - - [21/Aug/2014:16:49:30 -0400] "POST /?url HTTP/1.1" 200 2664 "http://www.awgo.to/" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.31 (KHTML, like Gecko) Chrome/26.0.1410.63 Safari/537.31"

94.23.30.222 - - [21/Aug/2014:16:49:31 -0400] "GET /nEjjM HTTP/1.1" 302 4 "-" "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; .NET CLR 2.0.50727; .NET CLR 1.1.4322)"

94.23.30.222 - - [21/Aug/2014:16:49:31 -0400] "GET /KVDpK HTTP/1.1" 302 4 "-" "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; .NET CLR 2.0.50727; .NET CLR 1.1.4322)"

117.169.1.86 - - [21/Aug/2014:16:49:55 -0400] "GET /?lang=en HTTP/1.1" 200 2588 "-" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.31 (KHTML, like Gecko) Chrome/26.0.1410.63 Safari/537.31"

50.28.51.253 - - [21/Aug/2014:16:50:36 -0400] "HEAD /378 HTTP/1.1" 302 - "-" "Mozilla/5.0 (Windows NT 5.1; rv:10.0.1) Gecko/20100101 Firefox/10.0.1"

189.166.175.145 - - [21/Aug/2014:16:50:46 -0400] "GET /378 HTTP/1.1" 302 30 "-" "Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1985.143 Safari/537.36"

相关:您可以维护一个坏IP地址列表,并按照您的建议修改日志代码以忽略它们。没有办法阻止万无一失的机器人。robots.txt文件只是一个建议;没有任何东西会迫使机器人或其开发人员尊重那里的设置。@DWRoelands谢谢!这就是我所想的,但我有一种感觉,那些IP改变了,所以我不确定我是否能够永远正确。