Warning: file_get_contents(/data/phpspider/zhask/data//catemap/8/file/3.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Web crawler 网络爬虫的典型礼貌因素?_Web Crawler_Website Admin - Fatal编程技术网

Web crawler 网络爬虫的典型礼貌因素?

Web crawler 网络爬虫的典型礼貌因素?,web-crawler,website-admin,Web Crawler,Website Admin,网络爬虫的典型礼貌因素是什么 除了始终遵守robot.txt之外 “不允许:”和非标准“爬网延迟”: 但是如果站点没有指定显式爬网延迟,那么默认值应该设置为什么?我们使用的算法是: // If we are blocked by robots.txt // Make sure it is obeyed. // Our bots user-agent string contains a link to a html page explaining this. // Also an email ad

网络爬虫的典型礼貌因素是什么

除了始终遵守robot.txt之外 “不允许:”和非标准“爬网延迟”:


但是如果站点没有指定显式爬网延迟,那么默认值应该设置为什么?

我们使用的算法是:

// If we are blocked by robots.txt
// Make sure it is obeyed.
// Our bots user-agent string contains a link to a html page explaining this.
// Also an email address to be added to so that we never even consider their domain in the future

// If we receive more that 5 consecutive responses with HTTP response code of 500+ (or timeouts)
// Then we assume the domain is either under heavy load and does not need us adding to it.
// Or the URL we are crawling are completely wrong and causing problems
// Wither way we suspend crawling from this domain for 4 hours.

// There is a non-standard parameter in robots.txt that defines a min crawl delay
// If it exists then obey it.
//
//    see: http://www.searchtools.com/robots/robots-txt-elements.html
double PolitenssFromRobotsTxt = getRobotPolitness();


// Work Size politeness
// Large popular domains are designed to handle load so we can use a
// smaller delay on these sites then for smaller domains (thus smaller domains hosted by
// mom and pops by the family PC under the desk in the office are crawled slowly).
//
// But the max delay here is 5 seconds:
//
//    domainSize => Range 0 -> 10
//
double workSizeTime = std::min(exp(2.52166863221 + -0.530185027289 * log(domainSize)), 5);
//
// You can find out how important we think your site is here:
//      http://www.opensiteexplorer.org
// Look at the Domain Authority and diveide by 10.
// Note: This is not exactly the number we use but the two numbers are highly corelated
//       Thus it will usually give you a fair indication.



// Take into account the response time of the last request.
// If the server is under heavy load and taking a long time to respond
// then we slow down the requests. Note time-outs are handled above
double responseTime = pow(0.203137637588 + 0.724386103344 * lastResponseTime, 2);

// Use the slower of the calculated times
double result = std::max(workSizeTime, responseTime);

//Never faster than the crawl-delay directive
result = std::max(result, PolitenssFromRobotsTxt);


// Set a minimum delays
// So never hit a site more than every 10th of a second
result = std::max(result, 0.1);

// The maximum delay we have is every 2 minutes.
result = std::min(result, 120.0)

这与礼貌的数字有关()还是仅仅是词语的巧合?@DashWinterson只是巧合