Javascript Puppeter：从使用延迟加载的页面获取整个html_Javascript_Node.js_Web Scraping_Puppeteer

Javascript Puppeter：从使用延迟加载的页面获取整个html

javascript node.js web-scraping

Javascript Puppeter：从使用延迟加载的页面获取整个html,javascript,node.js,web-scraping,puppeteer,Javascript,Node.js,Web Scraping,Puppeteer,我试图在使用延迟加载的网页上获取整个html。我尝试的是一直滚动到底部，然后使用page.content（）。在滚动到页面底部后，我还尝试了滚动到页面顶部，然后使用page.content（）。这两种方法都会抓住表中的一些行，但不是全部，这是我的主要目标。我相信web页面使用react.js的延迟加载 const puppeteer = require('puppeteer'); const url = 'https://www.torontopearson.com/en/departures'

我试图在使用延迟加载的网页上获取整个html。我尝试的是一直滚动到底部，然后使用page.content（）。在滚动到页面底部后，我还尝试了滚动到页面顶部，然后使用page.content（）。这两种方法都会抓住表中的一些行，但不是全部，这是我的主要目标。我相信web页面使用react.js的延迟加载

const puppeteer = require('puppeteer');
const url = 'https://www.torontopearson.com/en/departures';
const fs = require('fs');

puppeteer.launch().then(async browser => {
    const page = await browser.newPage();
    await page.goto(url);
    await page.waitFor(300);

    //scroll to bottom
    await autoScroll(page);
    await page.waitFor(2500);

    //scroll to top of page
    await page.evaluate(() => window.scrollTo(0, 50));

    let html = await page.content();

    await fs.writeFile('scrape.html', html, function(err){
        if (err) throw err;
        console.log("Successfully Written to File.");
    });
    await browser.close();
});

//method used to scroll to bottom, referenced from user visualxcode on https://github.com/GoogleChrome/puppeteer/issues/305
async function autoScroll(page){ 
    await page.evaluate(async () => {
        await new Promise((resolve, reject) => {
            var totalHeight = 0;
            var distance = 300;
            var timer = setInterval(() => {
                var scrollHeight = document.body.scrollHeight;
                window.scrollBy(0, distance);
                totalHeight += distance;

                if(totalHeight >= scrollHeight){
                    clearInterval(timer);
                    resolve();
                }
            }, 100);
        });
    });
}

我在这方面不是很好，但在搜索了这么长时间后，我发现一个解决方案对我的要求给出了很好的结果。下面是我用来处理延迟加载场景的代码片段

const bodyHandle = await page.$('body');
const { height } = await bodyHandle.boundingBox();
await bodyHandle.dispose();
console.log('Handling viewport...')
const viewportHeight = page.viewport().height;
let viewportIncr = 0;
while (viewportIncr + viewportHeight < height) {
await page.evaluate(_viewportHeight => {
window.scrollBy(0, _viewportHeight);
}, viewportHeight);
await wait(30);
viewportIncr = viewportIncr + viewportHeight;
}
console.log('Handling Scroll operations')
await page.evaluate(_ => {
window.scrollTo(0, 0);
});
await wait(100);  
await page.screenshot({path: 'GoogleHome.jpg', fullPage: true});

constbodyhandle=wait page.$（'body'）；
const{height}=wait-bodyHandle.boundingBox（）；
等待bodyHandle.dispose（）；
console.log（'处理视口…'）
const viewportHeight=page.viewport（）.height；
让viewportIncr=0；
同时（视口NCR+视口高度<高度）{
等待页面。评估（_viewportHeight=>{
滚动窗口（0，_视口高度）；
}，视窗高度）；
等待等待（30）；
viewportIncr=viewportIncr+视口高度；
}
console.log（'处理滚动操作'）
等待页面。评估（=>{
滚动到（0，0）；
});
等待等待（100）；
等待页面。截图（{path:'GoogleHome.jpg'，fullPage:true}）；

从这一点上，我甚至可以拍摄长屏幕截图。希望这对您有所帮助。

问题是链接页面正在使用库。此库仅呈现网站的可见部分。因此，你不能一次得到整张桌子。爬行到表的底部只会将表的底部放在DOM中

要检查页面从何处加载其内容，应检查DevTools的“网络”选项卡。您会注意到页面的内容是从中加载的，这似乎以JSON格式提供了DOM的完美表示。因此，实际上没有必要从页面中刮取这些数据。您可以只使用URL。

也许可以通过检查DevTools来完成一些爬行作业，但许多ppl认为使用pptr的优势在于，我们不需要关心DevTools网络检查，只需通过人工方式即可：-）。所以，如果像你说的那样，pptr将无法提供帮助？