Javascript 如何使用Puppeter下载页面上的图像?

Javascript 如何使用Puppeter下载页面上的图像?,javascript,web-scraping,puppeteer,google-chrome-headless,Javascript,Web Scraping,Puppeteer,Google Chrome Headless,我不熟悉网络抓取,希望使用Puppeter下载网页上的所有图像: const puppeteer = require('puppeteer'); let scrape = async () => { // Actual Scraping goes Here... const browser = await puppeteer.launch({headless: false}); const page = await browser.newPage(); await pa

我不熟悉网络抓取,希望使用Puppeter下载网页上的所有图像:

const puppeteer = require('puppeteer');

let scrape = async () => {
  // Actual Scraping goes Here...

  const browser = await puppeteer.launch({headless: false});
  const page = await browser.newPage();
  await page.goto('https://memeculture69.tumblr.com/');

  //   Right click and save images

};

scrape().then((value) => {
    console.log(value); // Success!
});

我已经看过了,但想不出如何做到这一点。谢谢你的帮助

我认为逻辑很简单。您只需要做一个函数,它将获取图像的url并将其保存到您的目录中。木偶演员只需抓取图像url并将其传递给downloader函数。以下是一个例子:

const puppeteer = require('puppeteer');
const fs = require('fs');
const request = require('request');

//  This is main download function which takes the url of your image
function download(uri, filename) {
  return new Promise((resolve, reject) => {
    request.head(uri, function (err, res, body) {
      request(uri).pipe(fs.createWriteStream(filename)).on('close', resolve);
    });
  });
}

let main = async () => {
  const browser = await puppeteer.launch({ headless: false });
  const page = await browser.newPage();
  await page.goto('https://memeculture69.tumblr.com/');
  await page.waitFor(1000);
  const imageUrl = await page.evaluate(
    // here we got the image url from the selector.
    () => document.querySelector('img.image')
  );
  // Now just simply pass the image url
  // to the downloader function to download  the image.
  await download(imageUrl, 'image.png');
};

main();

这里是另一个例子。它进入谷歌的普通搜索,并下载左上角的谷歌图片

const puppeteer = require('puppeteer');
const fs = require('fs');

async function run() {
    const browser = await puppeteer.launch({
        headless: false
    });
    const page = await browser.newPage();
    await page.setViewport({ width: 1200, height: 1200 });
    await page.goto('https://www.google.com/search?q=.net+core&rlz=1C1GGRV_enUS785US785&oq=.net+core&aqs=chrome..69i57j69i60l3j69i65j69i60.999j0j7&sourceid=chrome&ie=UTF-8');

    const IMAGE_SELECTOR = '#tsf > div:nth-child(2) > div > div.logo > a > img';
    let imageHref = await page.evaluate((sel) => {
        return document.querySelector(sel).getAttribute('src').replace('/', '');
    }, IMAGE_SELECTOR);

    console.log("https://www.google.com/" + imageHref);
    var viewSource = await page.goto("https://www.google.com/" + imageHref);
    fs.writeFile(".googles-20th-birthday-us-5142672481189888-s.png", await viewSource.buffer(), function (err) {
    if (err) {
        return console.log(err);
    }

    console.log("The file was saved!");
});

    browser.close();
}

run();

如果您有一个要下载的图像列表,则可以根据需要更改选择器以编程方式进行更改,并在图像列表中一次下载一个图像。

您可以使用以下命令来刮取页面上所有图像的所有
src
属性的数组:

const images = await page.evaluate(() => Array.from(document.images, e => e.src));
然后可以使用and或下载每个图像

完整示例:

'use strict';

const fs = require('fs');
const https = require('https');
const puppeteer = require('puppeteer');

/* ============================================================
  Promise-Based Download Function
============================================================ */

const download = (url, destination) => new Promise((resolve, reject) => {
  const file = fs.createWriteStream(destination);

  https.get(url, response => {
    response.pipe(file);

    file.on('finish', () => {
      file.close(resolve(true));
    });
  }).on('error', error => {
    fs.unlink(destination);

    reject(error.message);
  });
});

/* ============================================================
  Download All Images
============================================================ */

(async () => {
  const browser = await puppeteer.launch();
  const page = await browser.newPage();

  let result;

  await page.goto('https://www.example.com/');

  const images = await page.evaluate(() => Array.from(document.images, e => e.src));

  for (let i = 0; i < images.length; i++) {
    result = await download(images[i], `image-${i}.png`);

    if (result === true) {
      console.log('Success:', images[i], 'has been downloaded successfully.');
    } else {
      console.log('Error:', images[i], 'was not downloaded.');
      console.error(result);
    }
  }

  await browser.close();
})();
“严格使用”;
常数fs=要求('fs');
常量https=require('https');
const puppeter=require('puppeter');
/* ============================================================
基于承诺的下载功能
============================================================ */
const下载=(url,目的地)=>新承诺((解析,拒绝)=>{
const file=fs.createWriteStream(目标);
https.get(url,响应=>{
管道(文件);
file.on('finish',()=>{
关闭(解析(true));
});
}).on('error',error=>{
fs.取消链接(目的地);
拒绝(错误消息);
});
});
/* ============================================================
下载所有图片
============================================================ */
(异步()=>{
const browser=wait puppeter.launch();
const page=wait browser.newPage();
让结果;
等待页面。转到('https://www.example.com/');
const images=wait page.evaluate(()=>Array.from(document.images,e=>e.src));
for(设i=0;i
如果要跳过手动dom遍历,可以直接从页面响应将图像写入磁盘

例如:

const puppeteer = require('puppeteer');
const fs = require('fs');
const path = require('path');

(async () => {
    const browser = await puppeteer.launch();
    const page = await browser.newPage();
    page.on('response', async response => {
        const url = response.url();
        if (response.request().resourceType() === 'image') {
            response.buffer().then(file => {
                const fileName = url.split('/').pop();
                const filePath = path.resolve(__dirname, fileName);
                const writeStream = fs.createWriteStream(filePath);
                writeStream.write(file);
            });
        }
    });
    await page.goto('https://memeculture69.tumblr.com/');
    await browser.close();
})();

此代码将页面上找到的所有图像保存到图像文件夹中

page.on('response', async (response) => {
  const matches = /.*\.(jpg|png|svg|gif)$/.exec(response.url());
  if (matches && (matches.length === 2)) {
    const extension = matches[1];
    const buffer = await response.buffer();
    fs.writeFileSync(`images/${matches[0]}.${extension}`, buffer, 'base64');
  }
});

对于通过其选择器下载的图像,我执行了以下操作:

  • 使用选择器获取图像的uri
  • 已将uri传递给下载函数

    const puppeteer = require('puppeteer');
    const fs = require('fs');
    var request = require('request');
    
    //download function
    var download = function (uri, filename, callback) {
        request.head(uri, function (err, res, body) {
            console.log('content-type:', res.headers['content-type']);
            console.log('content-length:', res.headers['content-length']);
            request(uri).pipe(fs.createWriteStream(filename)).on('close', callback);
        });
    };
    
    (async () => {
         const browser = await puppeteer.launch({
          headless: true,
          args: ['--no-sandbox', '--disable-setuid-sandbox'], //for no sandbox
        });
        const page = await browser.newPage();
        await page.goto('http://example.com');// your url here
    
        let imageLink = await page.evaluate(() => {
            const image = document.querySelector('#imageId');
            return image.src;
        })
    
        await download(imageLink, 'myImage.png', function () {
            console.log('done');
        });
    
        ...
    })();
    

  • 资源:

    无需单独访问每个url即可获取所有图像。您需要侦听对服务器的所有请求:

    await page.setRequestInterception(true)
    await page.on('request', function (request) {
       request.continue()
    })
    await page.on('response', async function (response) {
       // Filter those responses that are interesting
       const data = await response.buffer()
       // data contains the img information
    })
    

    通常,您会为图像设置一个选择器/id,然后可以获取url。然后用url做类似的事情是的,我已经看到了这个问题,但无法利用它。你能用密码详细说明你的答案吗?我贴了一个答案。这就是我开始学习使用木偶演员的地方。它介绍了在元素中循环并从元素中获取信息的基本知识,它在这里达到了:需要单击
    Accept
    继续。如何处理?我只是手动到达,没有任何按钮来接受任何东西。我刚得到一张src的图片。您可以等待按钮,当它出现时,只需使用
    页面单击该按钮。单击(选择器)
    ,然后从dom获取图像src。好的,同意页面为我显示(可能是因为在欧洲?),然后我获得
    (节点:31793)未经处理的PromisejectionWarning:错误:options.uri是一个必需参数
    在我可以单击按钮Accept之前,我明白了,您可以通过gist发送当前代码吗?这样我就可以在本地使用Europe proxy了?嘿,很好奇,但是变量“document”来自哪里?这看起来很有趣,你能详细说明一下吗?@M4hd1我相信,与其像~everyone~这里的大多数人那样等待页面加载然后查询选择它们,他截取接收到的所有文件的标题,然后过滤图像格式。我认为这肯定会更快,因为它消除了通过DOM树而不是通过数组进行搜索。我认为。另一点是,当你等待页面加载,查询页面上的图像并下载它们时,你下载了两次图像。如果您拦截所有请求并编写以图像响应的请求,则只需下载一次。(我想,我还没有检查过)。这个答案与相同。这不是每个图像下载两次吗?一次呈现页面,一次保存页面?这就是我一直在寻找的答案。链接到文档:它能处理更大的文件吗?它只节省1KB。如何保存视频?为什么它不能处理更大的文件?这不管用