Javascript 如何使用Puppeter下载页面上的图像?
我不熟悉网络抓取,希望使用Puppeter下载网页上的所有图像:Javascript 如何使用Puppeter下载页面上的图像?,javascript,web-scraping,puppeteer,google-chrome-headless,Javascript,Web Scraping,Puppeteer,Google Chrome Headless,我不熟悉网络抓取,希望使用Puppeter下载网页上的所有图像: const puppeteer = require('puppeteer'); let scrape = async () => { // Actual Scraping goes Here... const browser = await puppeteer.launch({headless: false}); const page = await browser.newPage(); await pa
const puppeteer = require('puppeteer');
let scrape = async () => {
// Actual Scraping goes Here...
const browser = await puppeteer.launch({headless: false});
const page = await browser.newPage();
await page.goto('https://memeculture69.tumblr.com/');
// Right click and save images
};
scrape().then((value) => {
console.log(value); // Success!
});
我已经看过了,但想不出如何做到这一点。谢谢你的帮助 我认为逻辑很简单。您只需要做一个函数,它将获取图像的url并将其保存到您的目录中。木偶演员只需抓取图像url并将其传递给downloader函数。以下是一个例子:
const puppeteer = require('puppeteer');
const fs = require('fs');
const request = require('request');
// This is main download function which takes the url of your image
function download(uri, filename) {
return new Promise((resolve, reject) => {
request.head(uri, function (err, res, body) {
request(uri).pipe(fs.createWriteStream(filename)).on('close', resolve);
});
});
}
let main = async () => {
const browser = await puppeteer.launch({ headless: false });
const page = await browser.newPage();
await page.goto('https://memeculture69.tumblr.com/');
await page.waitFor(1000);
const imageUrl = await page.evaluate(
// here we got the image url from the selector.
() => document.querySelector('img.image')
);
// Now just simply pass the image url
// to the downloader function to download the image.
await download(imageUrl, 'image.png');
};
main();
这里是另一个例子。它进入谷歌的普通搜索,并下载左上角的谷歌图片
const puppeteer = require('puppeteer');
const fs = require('fs');
async function run() {
const browser = await puppeteer.launch({
headless: false
});
const page = await browser.newPage();
await page.setViewport({ width: 1200, height: 1200 });
await page.goto('https://www.google.com/search?q=.net+core&rlz=1C1GGRV_enUS785US785&oq=.net+core&aqs=chrome..69i57j69i60l3j69i65j69i60.999j0j7&sourceid=chrome&ie=UTF-8');
const IMAGE_SELECTOR = '#tsf > div:nth-child(2) > div > div.logo > a > img';
let imageHref = await page.evaluate((sel) => {
return document.querySelector(sel).getAttribute('src').replace('/', '');
}, IMAGE_SELECTOR);
console.log("https://www.google.com/" + imageHref);
var viewSource = await page.goto("https://www.google.com/" + imageHref);
fs.writeFile(".googles-20th-birthday-us-5142672481189888-s.png", await viewSource.buffer(), function (err) {
if (err) {
return console.log(err);
}
console.log("The file was saved!");
});
browser.close();
}
run();
如果您有一个要下载的图像列表,则可以根据需要更改选择器以编程方式进行更改,并在图像列表中一次下载一个图像。您可以使用以下命令来刮取页面上所有图像的所有
src
属性的数组:
const images = await page.evaluate(() => Array.from(document.images, e => e.src));
然后可以使用and或下载每个图像
完整示例:
'use strict';
const fs = require('fs');
const https = require('https');
const puppeteer = require('puppeteer');
/* ============================================================
Promise-Based Download Function
============================================================ */
const download = (url, destination) => new Promise((resolve, reject) => {
const file = fs.createWriteStream(destination);
https.get(url, response => {
response.pipe(file);
file.on('finish', () => {
file.close(resolve(true));
});
}).on('error', error => {
fs.unlink(destination);
reject(error.message);
});
});
/* ============================================================
Download All Images
============================================================ */
(async () => {
const browser = await puppeteer.launch();
const page = await browser.newPage();
let result;
await page.goto('https://www.example.com/');
const images = await page.evaluate(() => Array.from(document.images, e => e.src));
for (let i = 0; i < images.length; i++) {
result = await download(images[i], `image-${i}.png`);
if (result === true) {
console.log('Success:', images[i], 'has been downloaded successfully.');
} else {
console.log('Error:', images[i], 'was not downloaded.');
console.error(result);
}
}
await browser.close();
})();
“严格使用”;
常数fs=要求('fs');
常量https=require('https');
const puppeter=require('puppeter');
/* ============================================================
基于承诺的下载功能
============================================================ */
const下载=(url,目的地)=>新承诺((解析,拒绝)=>{
const file=fs.createWriteStream(目标);
https.get(url,响应=>{
管道(文件);
file.on('finish',()=>{
关闭(解析(true));
});
}).on('error',error=>{
fs.取消链接(目的地);
拒绝(错误消息);
});
});
/* ============================================================
下载所有图片
============================================================ */
(异步()=>{
const browser=wait puppeter.launch();
const page=wait browser.newPage();
让结果;
等待页面。转到('https://www.example.com/');
const images=wait page.evaluate(()=>Array.from(document.images,e=>e.src));
for(设i=0;i
如果要跳过手动dom遍历,可以直接从页面响应将图像写入磁盘
例如:
const puppeteer = require('puppeteer');
const fs = require('fs');
const path = require('path');
(async () => {
const browser = await puppeteer.launch();
const page = await browser.newPage();
page.on('response', async response => {
const url = response.url();
if (response.request().resourceType() === 'image') {
response.buffer().then(file => {
const fileName = url.split('/').pop();
const filePath = path.resolve(__dirname, fileName);
const writeStream = fs.createWriteStream(filePath);
writeStream.write(file);
});
}
});
await page.goto('https://memeculture69.tumblr.com/');
await browser.close();
})();
此代码将页面上找到的所有图像保存到图像文件夹中
page.on('response', async (response) => {
const matches = /.*\.(jpg|png|svg|gif)$/.exec(response.url());
if (matches && (matches.length === 2)) {
const extension = matches[1];
const buffer = await response.buffer();
fs.writeFileSync(`images/${matches[0]}.${extension}`, buffer, 'base64');
}
});
对于通过其选择器下载的图像,我执行了以下操作:
const puppeteer = require('puppeteer');
const fs = require('fs');
var request = require('request');
//download function
var download = function (uri, filename, callback) {
request.head(uri, function (err, res, body) {
console.log('content-type:', res.headers['content-type']);
console.log('content-length:', res.headers['content-length']);
request(uri).pipe(fs.createWriteStream(filename)).on('close', callback);
});
};
(async () => {
const browser = await puppeteer.launch({
headless: true,
args: ['--no-sandbox', '--disable-setuid-sandbox'], //for no sandbox
});
const page = await browser.newPage();
await page.goto('http://example.com');// your url here
let imageLink = await page.evaluate(() => {
const image = document.querySelector('#imageId');
return image.src;
})
await download(imageLink, 'myImage.png', function () {
console.log('done');
});
...
})();
资源:无需单独访问每个url即可获取所有图像。您需要侦听对服务器的所有请求:
await page.setRequestInterception(true)
await page.on('request', function (request) {
request.continue()
})
await page.on('response', async function (response) {
// Filter those responses that are interesting
const data = await response.buffer()
// data contains the img information
})
通常,您会为图像设置一个选择器/id,然后可以获取url。然后用url做类似的事情是的,我已经看到了这个问题,但无法利用它。你能用密码详细说明你的答案吗?我贴了一个答案。这就是我开始学习使用木偶演员的地方。它介绍了在元素中循环并从元素中获取信息的基本知识,它在这里达到了:需要单击
Accept
继续。如何处理?我只是手动到达,没有任何按钮来接受任何东西。我刚得到一张src的图片。您可以等待按钮,当它出现时,只需使用页面单击该按钮。单击(选择器)
,然后从dom获取图像src。好的,同意页面为我显示(可能是因为在欧洲?),然后我获得(节点:31793)未经处理的PromisejectionWarning:错误:options.uri是一个必需参数
在我可以单击按钮Accept之前,我明白了,您可以通过gist发送当前代码吗?这样我就可以在本地使用Europe proxy了?嘿,很好奇,但是变量“document”来自哪里?这看起来很有趣,你能详细说明一下吗?@M4hd1我相信,与其像~everyone~这里的大多数人那样等待页面加载然后查询选择它们,他截取接收到的所有文件的标题,然后过滤图像格式。我认为这肯定会更快,因为它消除了通过DOM树而不是通过数组进行搜索。我认为。另一点是,当你等待页面加载,查询页面上的图像并下载它们时,你下载了两次图像。如果您拦截所有请求并编写以图像响应的请求,则只需下载一次。(我想,我还没有检查过)。这个答案与相同。这不是每个图像下载两次吗?一次呈现页面,一次保存页面?这就是我一直在寻找的答案。链接到文档:它能处理更大的文件吗?它只节省1KB。如何保存视频?为什么它不能处理更大的文件?这不管用