Apify 从多个sitemap.xml文件抓取URL
我正在为一个页面构建一个apify actor,其中所有需要的URL都存储在不同的sitemap.xml文件中。文件名是静态的,但无法确定如何将多个sitemap.xml文件添加到actor 下面是使用1个xml文件的工作代码。不知何故,需要在多个url上对每个url进行爬网,但由于大约有600个url,最好是从csv中提取所有站点地图,然后对每个url进行爬网,然后对每个url进行爬网Apify 从多个sitemap.xml文件抓取URL,apify,Apify,我正在为一个页面构建一个apify actor,其中所有需要的URL都存储在不同的sitemap.xml文件中。文件名是静态的,但无法确定如何将多个sitemap.xml文件添加到actor 下面是使用1个xml文件的工作代码。不知何故,需要在多个url上对每个url进行爬网,但由于大约有600个url,最好是从csv中提取所有站点地图,然后对每个url进行爬网,然后对每个url进行爬网 const Apify = require('apify'); const cheerio = requir
const Apify = require('apify');
const cheerio = require('cheerio');
const requestPromised = require('request-promise-native');
Apify.main(async () => {
const xml = await requestPromised({
url: 'https://www.website.com/sitemap1.xml’, // <- This part needs to accept input of about 600 sitemap.xml urls in total
headers: {
'User-Agent': 'curl/7.54.0'
}
});
// Parse sitemap and create RequestList from it
const $ = cheerio.load(xml);
const sources = [];
$('loc').each(function (val) {
const url = $(this).text().trim();
sources.push({
url,
headers: {
// NOTE: Otherwise the target doesn't allow to download the page!
'User-Agent': 'curl/7.54.0',
}
});
});
const requestList = new Apify.RequestList({
sources,
});
await requestList.initialize();
// Crawl each page from sitemap
const crawler = new Apify.CheerioCrawler({
requestList,
handlePageFunction: async ({ $, request }) => {
await Apify.pushData({
url: request.url
});
},
});
await crawler.run();
console.log('Done.');
});
const Apify=require('Apify');
const cheerio=需要(“cheerio”);
const requestPromised=require('request-promise-native');
main(异步()=>{
const xml=等待请求({
网址:'https://www.website.com/sitemap1.xml“,//最可靠的方法是使用的功能。当然有很多方法来处理这个问题
最简单的解决方案是使用一个CheerioCrawler
,并在handlePageFunction
中对站点地图URL和最终URL使用单独的逻辑。不幸的是,CheerioCrawler
无法解析XML(可能在不久的将来会修复),因此我们将不得不使用两个爬虫
对于XML解析的第一部分,我们将使用BasicCrawler
。它是Apify最通用的爬虫程序,因此它可以轻松地使用您已有的代码。我们将把提取的URL推送到a,并在第二个爬虫程序中处理它们,该爬虫程序可以保持原样
const Apify = require('apify');
const cheerio = require('cheerio');
const requestPromised = require('request-promise-native');
Apify.main(async () => {
// Here we will push the URLs found in the sitemaps
const requestQueue = await Apify.openRequestQueue();
// This would be better passed via INPUT as `const xmlUrls = await Apify.getInput().then((input => input.xmlUrls))`
const xmlUrls = [
'https://www.website.com/sitemap1.xml',
// ...
]
const xmlRequestList = new Apify.RequestList({
sources: xmlUrls.map((url) => ({ url })) // We make smiple request object from the URLs
})
await xmlRequestList.initialize();
const xmlCrawler = new Apify.BasicCrawler({
requestList: xmlRequestList,
handleRequestFunction: async ({ request }) => {
// This is basically the same code you have, we just have to push the sources to the queue
const xml = await requestPromised({
url: request.url,
headers: {
'User-Agent': 'curl/7.54.0'
}
});
const $ = cheerio.load(xml);
const sources = [];
$('loc').each(function (val) {
const url = $(this).text().trim();
sources.push({
url,
headers: {
// NOTE: Otherwise the target doesn't allow to download the page!
'User-Agent': 'curl/7.54.0',
}
});
});
for (const finalRequest of sources) {
await requestQueue.addRequest(finalRequest);
}
}
})
await xmlCrawler.run()
// Crawl each page from sitemap
const crawler = new Apify.CheerioCrawler({
requestQueue,
handlePageFunction: async ({ $, request }) => {
// Add your logic for final URLs
await Apify.pushData({
url: request.url
});
},
});
await crawler.run();
console.log('Done.');
});
最可靠的方法是使用的权力。当然有很多方法来处理这个问题
最简单的解决方案是使用一个CheerioCrawler
,并在handlePageFunction
中对站点地图URL和最终URL使用单独的逻辑。不幸的是,CheerioCrawler
无法解析XML(可能在不久的将来会修复),因此我们将不得不使用两个爬虫
对于XML解析的第一部分,我们将使用BasicCrawler
。它是Apify最通用的爬虫程序,因此它可以轻松地使用您已有的代码。我们将把提取的URL推送到a,并在第二个爬虫程序中处理它们,该爬虫程序可以保持原样
const Apify = require('apify');
const cheerio = require('cheerio');
const requestPromised = require('request-promise-native');
Apify.main(async () => {
// Here we will push the URLs found in the sitemaps
const requestQueue = await Apify.openRequestQueue();
// This would be better passed via INPUT as `const xmlUrls = await Apify.getInput().then((input => input.xmlUrls))`
const xmlUrls = [
'https://www.website.com/sitemap1.xml',
// ...
]
const xmlRequestList = new Apify.RequestList({
sources: xmlUrls.map((url) => ({ url })) // We make smiple request object from the URLs
})
await xmlRequestList.initialize();
const xmlCrawler = new Apify.BasicCrawler({
requestList: xmlRequestList,
handleRequestFunction: async ({ request }) => {
// This is basically the same code you have, we just have to push the sources to the queue
const xml = await requestPromised({
url: request.url,
headers: {
'User-Agent': 'curl/7.54.0'
}
});
const $ = cheerio.load(xml);
const sources = [];
$('loc').each(function (val) {
const url = $(this).text().trim();
sources.push({
url,
headers: {
// NOTE: Otherwise the target doesn't allow to download the page!
'User-Agent': 'curl/7.54.0',
}
});
});
for (const finalRequest of sources) {
await requestQueue.addRequest(finalRequest);
}
}
})
await xmlCrawler.run()
// Crawl each page from sitemap
const crawler = new Apify.CheerioCrawler({
requestQueue,
handlePageFunction: async ({ $, request }) => {
// Add your logic for final URLs
await Apify.pushData({
url: request.url
});
},
});
await crawler.run();
console.log('Done.');
});
太棒了,非常感谢!第一部分完美地处理了多个xml!输入应该如何格式化才能工作?尝试了不同的变体:{“url”:[{“url”:“},{“url”:“}]},很乐意提供帮助。尝试如下:INPUT.json可以有如下结构:{xmlURL:["https://www.website.com/sitemap1.xml", "https://www.website.com/sitemap2.xml“]}
然后您可以使用我在输入中显示的代码加载它const-input=wait-Apify.getInput();
const{xmlURL}=输入;
太棒了,非常感谢!第一部分完美地处理了多个xml!输入应该如何格式化以使其工作?尝试了不同的变体:{“url”:[{“url”:“},{“url”:“}]}很乐意提供帮助。尝试如下:input.json可以有如下结构:{xmlURL:[”https://www.website.com/sitemap1.xml", "https://www.website.com/sitemap2.xml“]}
然后您可以使用我在输入中显示的代码加载它const-input=wait-Apify.getInput();
const{xmlUrls}=input;