Javascript 木偶演员只需刮取大约200页,don';我不能继续
出于某种原因,我不明白我的节点应用程序在几分钟后停止了抓取,没有任何错误,而只是抓取,顺便说一句,这是一个无限滚动的网站。。。 代码如下:Javascript 木偶演员只需刮取大约200页,don';我不能继续,javascript,node.js,json,nodes,puppeteer,Javascript,Node.js,Json,Nodes,Puppeteer,出于某种原因,我不明白我的节点应用程序在几分钟后停止了抓取,没有任何错误,而只是抓取,顺便说一句,这是一个无限滚动的网站。。。 代码如下: const fs = require('fs'); (async() => { // start the browser const browser = await puppeteer.launch({ args: ['--no-sandbox'] }); // open a new page const page =
const fs = require('fs');
(async() => {
// start the browser
const browser = await puppeteer.launch({ args: ['--no-sandbox'] });
// open a new page
const page = await browser.newPage();
const pageURL = 'http://www.yad4.co.il/dogs//////////////#1';
try {
// try to go to URL
await page.goto(pageURL);
console.log(`opened the page: ${pageURL}`);
await page.setViewport({
width: 1200,
height: 800
});
await autoScroll(page);
} catch (error) {
console.log(`failed to open the page: ${pageURL} with the error: ${error}`);
}
// Find all links to dogs
const postsSelector = '.yd-search-page .container .row .col-md-9 .yd-gallery .search-handler-yd .col-xs-12 #dogs_more .col-md-4 .yd-dog-img .yd-mask a';
await page.waitForSelector(postsSelector);
const postUrls = await page.$$eval(postsSelector, postLinks => postLinks.map(link => link.href));
// Visit each page one by one
for (let postUrl of postUrls) {
// open the page
try {
await page.goto(postUrl);
console.log('opened the page: ', postUrl);
} catch (error) {
console.log(error);
console.log('failed to open the page: ', postUrl);
}
// get the name of the dog
const dogSelector = '.adopt.yd-amuta .container .yd-dog-cont .col-xs-12 .adopt-head .row .col-sm-6 .adopt-breadcrumb-title h2 span';
// await page.waitForSelector(dogSelector);
const dogName = await page.$eval(dogSelector, dogSelector => dogSelector.innerHTML);
// Writing the news inside a json file
fs.appendFile("dogtest4.json", JSON.stringify({dogName},), function(err) {
if (err) throw err;
console.log("Saved!");
});
}
// all done, close the browser
await browser.close();
async function autoScroll(page){
await page.evaluate(async () => {
await new Promise((resolve, reject) => {
var totalHeight = 0;
var distance = 100;
var timer = setInterval(() => {
var scrollHeight = document.body.scrollHeight;
window.scrollBy(0, distance);
totalHeight += distance;
if(totalHeight >= scrollHeight){
clearInterval(timer);
resolve();
}
}, 100);
});
});
}
process.exit()
})();
所以它给了我信息,但随机的,我的意思是,有时它给我115页,有时300页,有时仅仅90页,我不明白为什么,
请帮帮我
谢谢。我无法发表评论,但我想这可能与达到内存限制有关,这会减慢速度
您可以尝试在fs.appendFile(…)前面添加“await”,可能对您有用浏览器的超时时间是多少?在启动选项中将超时设置为0,然后查看是否有任何区别。另外,在滚动之间添加一些时间,持续3-5秒,您可能会因恶意活动而被网站阻止