Python 3.x 如何使用pyppeteer(一个浏览器多个选项卡)异步获取url

Python 3.x 如何使用pyppeteer(一个浏览器多个选项卡)异步获取url,python-3.x,python-asyncio,pyppeteer,Python 3.x,Python Asyncio,Pyppeteer,我希望我的脚本 打开3个标签 异步获取url(每个选项卡相同) 保存响应 睡4秒钟 使用正则表达式解析响应(我尝试了BeautifulSoup,但速度太慢),并返回一个令牌 在3个选项卡中循环数次 我的问题是2。我有一个示例脚本,但它同步获取url。我想让它异步 from pyppeteer import launch urls = ['https://www.example.com'] async def main(): browser = await launch(

我希望我的脚本

  • 打开3个标签

  • 异步获取url(每个选项卡相同)

  • 保存响应

  • 睡4秒钟

  • 使用正则表达式解析响应(我尝试了BeautifulSoup,但速度太慢),并返回一个令牌

  • 在3个选项卡中循环数次

  • 我的问题是2。我有一个示例脚本,但它同步获取url。我想让它异步

    from pyppeteer import launch
    
    urls = ['https://www.example.com']
    
    
    async def main():
        browser = await launch(headless=False)
        for url in urls:
            page1 = await browser.newPage()
            page2 = await browser.newPage()
            page3 = await browser.newPage()
    
            await page1.goto(url)
            await page2.goto(url)
            await page3.goto(url)
    
            title1= await page1.title()
            title2= await page2.title()
            title3= await page3.title()
    
            print(title1)
            print(title2)
            print(title3)
    
        #await browser.close()
    
    
    asyncio.get_event_loop().run_until_complete(main())
    
    # cat test.py
    import asyncio
    import time
    from pyppeteer import launch
    
    WEBSITE_LIST = [
        'http://envato.com',
        'http://amazon.co.uk',
        'http://example.com',
    ]
    
    start = time.time()
    
    async def fetch(url):
        browser = await launch(headless=False, args=['--no-sandbox'])
        page = await browser.newPage()
        await page.goto(f'{url}', {'waitUntil': 'load'})
        print(f'{url}')
        await asyncio.sleep(1)
        await page.close()
        #await browser.close()
    
    async def run():
        tasks = []
    
        for url in WEBSITE_LIST:
            task = asyncio.ensure_future(fetch(url))
            tasks.append(task)
    
        responses = await asyncio.gather(*tasks)
    
    loop = asyncio.get_event_loop()
    future = asyncio.ensure_future(run())
    loop.run_until_complete(future)
    
    print(f'It took {time.time()-start} seconds.')
    
    此外,正如您所看到的,代码并不那么简洁。如何使其异步化

    如果有帮助的话,我还有其他pyppeteer脚本,它们不适合我的需要,以防更容易转换它们

    import asyncio
    from pyppeteer import launch
    
    url = 'http://www.example.com'
    browser = None
    
    async def fetchUrl(url):
        # Define browser as a global variable to ensure that the browser window is only created once in the entire process
        global browser
        if browser is None:
            browser = await launch(headless=False)
    
        page = await browser.newPage()
    
        await page.goto(url)
        #await asyncio.wait([page.waitForNavigation()])
        #str = await page.content()
        #print(str)
    
     # Execute this function multiple times for testing
    asyncio.get_event_loop().run_until_complete(fetchUrl(url))
    asyncio.get_event_loop().run_until_complete(fetchUrl(url))
    
    该脚本是异步的,但它一次执行一个事件循环,因此它与同步一样好

    from pyppeteer import launch
    
    urls = ['https://www.example.com']
    
    
    async def main():
        browser = await launch(headless=False)
        for url in urls:
            page1 = await browser.newPage()
            page2 = await browser.newPage()
            page3 = await browser.newPage()
    
            await page1.goto(url)
            await page2.goto(url)
            await page3.goto(url)
    
            title1= await page1.title()
            title2= await page2.title()
            title3= await page3.title()
    
            print(title1)
            print(title2)
            print(title3)
    
        #await browser.close()
    
    
    asyncio.get_event_loop().run_until_complete(main())
    
    # cat test.py
    import asyncio
    import time
    from pyppeteer import launch
    
    WEBSITE_LIST = [
        'http://envato.com',
        'http://amazon.co.uk',
        'http://example.com',
    ]
    
    start = time.time()
    
    async def fetch(url):
        browser = await launch(headless=False, args=['--no-sandbox'])
        page = await browser.newPage()
        await page.goto(f'{url}', {'waitUntil': 'load'})
        print(f'{url}')
        await asyncio.sleep(1)
        await page.close()
        #await browser.close()
    
    async def run():
        tasks = []
    
        for url in WEBSITE_LIST:
            task = asyncio.ensure_future(fetch(url))
            tasks.append(task)
    
        responses = await asyncio.gather(*tasks)
    
    loop = asyncio.get_event_loop()
    future = asyncio.ensure_future(run())
    loop.run_until_complete(future)
    
    print(f'It took {time.time()-start} seconds.')
    

    脚本是异步的,但它会为每个url启动一个单独的浏览器,从而占用过多的资源。

    这将在单独的选项卡中打开每个url:

    import asyncio
    import traceback
    
    from pyppeteer import launch
    
    URLS = [
        "http://envato.com",
        "http://amazon.co.uk",
        "http://example.com",
    ]
    
    
    async def fetch(browser, url):
        page = await browser.newPage()
    
        try:
            await page.goto(f"{url}", {"waitUntil": "load"})
        except Exception:
            traceback.print_exc()
        else:
            html = await page.content()
            return (url, html)
        finally:
            await page.close()
    
    
    async def main():
        tasks = []
        browser = await launch(headless=True, args=["--no-sandbox"])
    
        for url in URLS:
            tasks.append(asyncio.create_task(fetch(browser, url)))
    
        for coro in asyncio.as_completed(tasks):
            url, html = await coro
            print(f"{url}: ({len(html)})")
    
        await browser.close()
    
    
    if __name__ == "__main__":
        main = asyncio.run(main())
    

    谢谢@HTF。有没有一种方法可以使用代码访问第n个url?假设我想在所有URL打开后访问第5个选项卡。您可以使用来获取所有页面/选项卡。我这样问是因为page.type似乎不适用于代码,除非它处于headless模式。如果处于非无头模式,则只有焦点选项卡工作。其他选项卡只有在您手动导航到它们时才能工作。它甚至可以用于键入,尽管它不是完全异步的(用于键入),但我将按原样使用它。我用它启动了50个选项卡并使用了任务,发现在任何时候,它都是最后一个被执行的任务。