Node.js page.evaluate puppeter函数的参数,用于动态web抓取
我想传递page.evaluate()函数()的参数以进行动态刮片,但我什么也做不到。 有人能帮我吗?我正在尝试使用page.evaluate的参数函数来刮取大量页面,但从pharmavida开始。我想通过参数传递每个页面的主Url,从页面中提取每个会话,并从每个会话中提取数据,但我无法将参数传递给包含page.evaluate的函数。。因为那样的话,我想让它的动态刮削每一页的部分,以刮。。。我还尝试在page.evaluate之外放置一个let,并将节的父类的selector类的元素传递给querySelectorAll(),但它表示未定义此变量。当我将其作为字符串而不是参数放置时,一切对我来说都很好,但我的想法是,刮取都是动态的 : 例如:Node.js page.evaluate puppeter函数的参数,用于动态web抓取,node.js,puppeteer,screen-scraping,Node.js,Puppeteer,Screen Scraping,我想传递page.evaluate()函数()的参数以进行动态刮片,但我什么也做不到。 有人能帮我吗?我正在尝试使用page.evaluate的参数函数来刮取大量页面,但从pharmavida开始。我想通过参数传递每个页面的主Url,从页面中提取每个会话,并从每个会话中提取数据,但我无法将参数传递给包含page.evaluate的函数。。因为那样的话,我想让它的动态刮削每一页的部分,以刮。。。我还尝试在page.evaluate之外放置一个let,并将节的父类的selector类的元素传递给qu
const data = await page.evaluate(function(params){
const myData = querySelectorAll(params.firstEleemntClass)
return{
data:myData
}
})
console.warn(data)//good data ruturn
但我所做的一切都不适合我。。。我想为多个页面和部分创建动态网页抓取:
const FarmaVidaHome = 'https://drogueriasfarmavida.com'
const FarmaTodoHome = 'https://www.farmatodo.com.ve'
const CruzVerde = 'https://www.cruzverde.com.co'
const LaBotica = 'https://www.tudrogueriavirtual.com/?v=9293'
module.exports = {
sites:[
{homeUrl:FarmaVidaHome, navigationType:'navbar',
fatherSectionClass:'.nav-top-link',
///////////////////////////////////////
data:{
productCardClass:'.product-type-simple',
paginationClass:'.woocommerce-pagination',
idClass:'.image-fade_in_back a',
product_nameClass:'.product-title',
imageClass:'.attachment-woocommerce_thumbnail',
categoryClass:'.product-cat',
priceClass:'.woocommerce-Price-amount'
}
}
]
}
const puppeteer = require('puppeteer')
const {sites} = require('./sites')
const {exploringPages} = require('./src/navigation/index')
const startScraping = async (datas) =>{
console.warn('THIS IS THE SITES-->', datas)
let dataAgruped = []
for (let i = 0; i < datas.length; i++) {
const pageItem = datas[i];
const response = await exploringPages(pageItem)
dataAgruped.push(pageItem)
}
// await exploringPages(datas)
}
startScraping(sites)
const exploringPages = async(thePage) =>{
console.warn('QUE VIENE AQUIII-->', thePage)
let myPage = thePage
const browser = await puppeteer.launch()
const page = await browser.newPage()
//await page.type('#selector', 'lo que quieres buscaar')
await page.goto(thePage.homeUrl)
let thisItem = thePage
const dataNavigation = await page.evaluate( ({thisItem})=>{
console.warn('PAGE thisItem EN IVALUATE-->', thisItem)
const $sections = document.querySelectorAll(thisItem.fatherSectionClass)
const data = []
$sections.forEach(($section) => {
data.push({
path:$section.getAttribute('href'),
// data:thisItem.data
})
});
return{
sections:data
}
}, {thisItem})
console.warn('this is the sections--->', dataNavigation)
// await exploringSections(dataNavigation.sections)
//await browser.close()
}
module.exports = {
exploringPages
}
node:22759) UnhandledPromiseRejectionWarning: TimeoutError: Navigation timeout of 30000 ms exceeded
at /Users/devios/Downloads/work/tests/node_modules/puppeteer/lib/cjs/puppeteer/common/LifecycleWatcher.js:106:111
(Use `node --trace-warnings ...` to show where the warning was created)
(node:22759) UnhandledPromiseRejectionWarning: Unhandled promise rejection. This error originated either by throwing inside of an async function without a catch block, or by rejecting a promise which was not handled with .catch(). To terminate the node process on unhandled promise rejection, use the CLI flag `--unhandled-rejections=strict` (see https://nodejs.org/api/cli.html#cli_unhandled_rejections_mode). (rejection id: 1)
(node:22759) [DEP0018] DeprecationWarning: Unhandled promise rejections are deprecated. In the future, promise rejections that are not handled will terminate the Node.js process with a non-zero exit code.