Warning: file_get_contents(/data/phpspider/zhask/data//catemap/9/javascript/412.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Javascript Webscraper被阻止--如何进行傀儡IP地址轮换?_Javascript_Node.js_Web Scraping_Proxy_Puppeteer - Fatal编程技术网

Javascript Webscraper被阻止--如何进行傀儡IP地址轮换?

Javascript Webscraper被阻止--如何进行傀儡IP地址轮换?,javascript,node.js,web-scraping,proxy,puppeteer,Javascript,Node.js,Web Scraping,Proxy,Puppeteer,因此,在我的web scraper函数中,我有以下代码行: let portList = [9050, 9052, 9053, 9054, 9055, 9056, 9057, 9058, 9059, 9060]; let spoofPort = portList[Math.floor(Math.random()*portList.length)]; console.log("The chosen port was " + spoofPort); const browser

因此,在我的web scraper函数中,我有以下代码行:

let portList = [9050, 9052, 9053, 9054, 9055, 9056, 9057, 9058, 9059, 9060];
let spoofPort = portList[Math.floor(Math.random()*portList.length)];
console.log("The chosen port was " + spoofPort);

const browser = await puppeteerExtra.launch({ headless: true, args: [                
'--no-sandbox', '--disable-setuid-sandbox', '--proxy-server=socks5://127.0.0.1:' + spoofPort                                               
]});

const page = await browser.newPage();

const userAgent = 'Mozilla/5.0 (X11; Linux x86_64)' +           
      'AppleWebKit/537.36 (KHTML, like Gecko) Chrome/64.0.3282.39 Safari/537.36';

await page.setUserAgent(userAgent);
我正在尝试为每个请求旋转IP地址(包含此代码的函数基本上是在客户机的每个请求上调用的),这样我就不会很快被抓取的网站阻止。我得到以下错误:

2021-05-17T12:08:19.625349+00:00 app[web.1]: The chosen port was 9050
2021-05-17T12:08:20.042016+00:00 app[web.1]: Error: net::ERR_PROXY_CONNECTION_FAILED at https://expampleDomanPlaceholder.com
2021-05-17T12:08:20.042018+00:00 app[web.1]: at navigate (/app/node_modules/puppeteer/lib/cjs/puppeteer/common/FrameManager.js:115:23)
2021-05-17T12:08:20.042018+00:00 app[web.1]: at processTicksAndRejections (internal/process/task_queues.js:93:5)
2021-05-17T12:08:20.042019+00:00 app[web.1]: at async FrameManager.navigateFrame (/app/node_modules/puppeteer/lib/cjs/puppeteer/common/FrameManager.js:90:21)
2021-05-17T12:08:20.042020+00:00 app[web.1]: at async Frame.goto (/app/node_modules/puppeteer/lib/cjs/puppeteer/common/FrameManager.js:416:16)
2021-05-17T12:08:20.042021+00:00 app[web.1]: at async Page.goto (/app/node_modules/puppeteer/lib/cjs/puppeteer/common/Page.js:819:16)
2021-05-17T12:08:20.042021+00:00 app[web.1]: at async /app/app.js:174:9
我已经尝试过这些帖子中详细介绍的解决方案,但问题可能出在我的userAgent上:


更新:我试图使用这个buildpack(),但它一直导致我的web动态中断(返回了一个“H14”错误,这意味着您必须清除构建包并重新添加它们)。我不知道如何从这里开始,因为这似乎是我能遇到的唯一解决方案。

因此有几个问题

  • 发布的错误消息缺少占位符
  • 该请求由于拼写错误而失败
  • 您必须实际为浏览器对象提供代理服务器。它必须初始化
  • 下面是柬埔寨的代理服务器示例

    We will use SOCKS4 proxy and IP location of this proxy at Cambodia.
    Proxy IP address 96.9.77.192 and port 55796 (not sure if it still works) 
    
    
    const puppeteer = require('puppeteer');
    
    (async () => {
        let launchOptions = { headless: false, 
                              args: ['--start-maximized',
                                     '--proxy-server=socks4://96.9.77.192:55796'] // this is where we set the proxy
                            };
    
        const browser = await puppeteer.launch(launchOptions);
        const page = await browser.newPage();
    
        // set viewport and user agent (just in case for nice viewing)
        await page.setViewport({width: 1366, height: 768});
        await page.setUserAgent('Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.108 Safari/537.36');
    
        // go to whatismycountry.com to see if proxy works (based on geography location)
        await page.goto('https://whatismycountry.com');
    
        // close the browser
        await browser.close();
    })();
    
    
    参考资料: 感谢@Grant Miller进行TOR测试


    日志中的错误是正确的:
    net::ERR\u PROXY\u CONNECTION\u失败
    似乎Tor未配置为不工作。@Vaviloff对于某些上下文,我正在部署到Heroku并在Mac上的Node.js环境中工作。看看这个链接(),你似乎说我没有下载Tor是对的。但如果我要部署到Heroku,如何确保tor工作?我是否安装了此软件包或其他内容:@Vaviloff您有什么建议吗?我建议您搜索类似的内容,然后调整您的应用程序accordingly@Vaviloff因此,我尝试将Tor BuildPack从您的链接添加到我的Heroku应用程序,但仍然无法使我的代码正常工作。我还尝试了其他一些“免费代理”掩码,但都不起作用(包括Puppeter页面代理和get free https代理)。你知道有谁在把Tor部署到Heroku之前就可以让我联系到谁吗?占位符是一个我不想放入堆栈的google搜索结果。我不认为您提供的IP/端口组合已经启动并运行了,所以我只是查找了一个基于美国的免费IP/端口组合并找到了(socks4://98.162.25.23:4145)。我尝试将其放入--proxy-server标记中,但现在在.goto()行得到一个“Error:net::ERR\u SSL\u PROTOCOL\u Error”。我尝试了多个代理并尝试查看ignoreHTTPSErrors:true。我在想,谷歌会阻止socks5传输协议吗?你有什么建议吗?还有,我实际上如何安装npm Tor(或者我必须做的任何事情)?下载Tor for heroku/Node.js的文档似乎非常有限。。。
    We will use SOCKS4 proxy and IP location of this proxy at Cambodia.
    Proxy IP address 96.9.77.192 and port 55796 (not sure if it still works) 
    
    
    const puppeteer = require('puppeteer');
    
    (async () => {
        let launchOptions = { headless: false, 
                              args: ['--start-maximized',
                                     '--proxy-server=socks4://96.9.77.192:55796'] // this is where we set the proxy
                            };
    
        const browser = await puppeteer.launch(launchOptions);
        const page = await browser.newPage();
    
        // set viewport and user agent (just in case for nice viewing)
        await page.setViewport({width: 1366, height: 768});
        await page.setUserAgent('Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.108 Safari/537.36');
    
        // go to whatismycountry.com to see if proxy works (based on geography location)
        await page.goto('https://whatismycountry.com');
    
        // close the browser
        await browser.close();
    })();
    
    
    #Proxy Issue
    If the proxy host requires AUTH then the example below would be more fitting. 
    
    
    'use strict';
    
    const puppeteer = require('puppeteer');
    
    (async () => {
      const username = process.env.USER
      const password = process.env.PASS
      const url = 'https://www.google.com'
    
      const browser = await puppeteer.launch({
        # proxy host must be correct.
        args: [
          '--proxy-server=socks5://proxyhost:8000',
        ],
      });
    
      const page = await browser.newPage();
    
      await page.authenticate({
        username,
        password,
      });
    
      await page.goto(url);
    
      await browser.close();
    })();
    
    this worked with tor. 
     Tor ('--proxy-server=socks5://localhost:9050')