Web scraping 机器人不允许的变通方法
我正在尝试创建一个刮板,但是当我运行下面的代码时,它说不允许机器人。这是出于我的内部目的,但是否存在禁止机器人的解决办法 我正在这个网站上测试它:但它没有通过Web scraping 机器人不允许的变通方法,web-scraping,robots.txt,Web Scraping,Robots.txt,我正在尝试创建一个刮板,但是当我运行下面的代码时,它说不允许机器人。这是出于我的内部目的,但是否存在禁止机器人的解决办法 我正在这个网站上测试它:但它没有通过 // This is sample code for building a web scraper. // // For this sample, we use // http://www.houzz.com/pro/jeff-halper/exterior-worlds-landscaping-and-design // as a sa
// This is sample code for building a web scraper.
//
// For this sample, we use
// http://www.houzz.com/pro/jeff-halper/exterior-worlds-landscaping-and-design
// as a sample listing we want to scrape.
//
// For the full crawler, we will assume the crawl
// starts from http://www.houzz.com/professionals/
var EightyApp = function() {
this.processDocument = function(html, url, headers, status, jQuery) {
// We only want to collect data from listing pages
if (url.match("/pro/") {
// First we construct an HTML object so we can use Jquery
var app = this;
$ = jQuery;
var $html = app.parseHtml(html, $);
var object = {};
// Then we use JQuery to find all the attributes we want
object.name = $html.find('h1').text();
object.address = $html.find('span[itemprop="streetAddress"]').text();
object.city = $html.find('span[itemprop="addressLocality"]').text();
object.state = $html.find('span[itemprop="addressRegion"]').text();
object.postalcode = $html.find('span[itemprop="postalCode"]').text();
object.contact = $html.find('dt:contains("Contact:")').next().text();
// Finally, we return the object as a string
return JSON.stringify(object);
}
}
this.parseLinks = function(html, url, headers, status, jQuery) {
// We construct the HTML object for Jquery again
var app = this;
var $ = jQuery;
var $html = app.parseHtml(html, $);
var links = [];
// We add all the pages in the directory
$html.find('a.pageNumber').each(function(i, obj) {
var link = app.makeLink(url, $(this).attr('href'));
if (link != null) {
links.push(link);
}
});
// We add all the listings in the directory
$html.find('a.pro-title').each(function(i, obj) {
var link = app.makeLink(url, $(this).attr('href'));
if (link != null) {
links.push(link);
}
});
return links;
}
}
try {
module.exports = function(EightyAppBase) {
EightyApp.prototype = new EightyAppBase();
return new EightyApp();
}
} catch(e) {
console.log("Eighty app exists.");
EightyApp.prototype = new EightyAppBase();
}
可能不会。即使您更改了代理名称,也有其他技术(如IP过滤)来检测不需要的报废程序。是否更改用户代理?