Python 在Asyncio Web抓取应用程序中放置BeautifulSoup代码的位置_Python_Asynchronous_Beautifulsoup_Python Asyncio_Aiohttp

Python 在Asyncio Web抓取应用程序中放置BeautifulSoup代码的位置

python asynchronous

Python 在Asyncio Web抓取应用程序中放置BeautifulSoup代码的位置,python,asynchronous,beautifulsoup,python-asyncio,aiohttp,Python,Asynchronous,Beautifulsoup,Python Asyncio,Aiohttp,我需要为许多（每天5-10k）新闻文章搜集并获取正文段落的原始文本。我已经编写了一些线程代码，但是考虑到这个项目的高度I/O限制特性，我将涉足asyncio。下面的代码片段并不比单线程版本快，也远比我的线程版本差。谁能告诉我我做错了什么？谢谢大家! async def fetch(session,url): async with session.get(url) as response: return await response.text() async def sc

我需要为许多（每天5-10k）新闻文章搜集并获取正文段落的原始文本。我已经编写了一些线程代码，但是考虑到这个项目的高度I/O限制特性，我将涉足

asyncio

。下面的代码片段并不比单线程版本快，也远比我的线程版本差。谁能告诉我我做错了什么？谢谢大家!

async def fetch(session,url):
    async with session.get(url) as response:
        return await response.text()

async def scrape_urls(urls):
    results = []
    tasks = []
    async with aiohttp.ClientSession() as session:
        for url in urls:
            html = await fetch(session,url)
            soup = BeautifulSoup(html,'html.parser')
            body = soup.find('div', attrs={'class':'entry-content'})
            paras = [normalize('NFKD',para.get_text()) for para in body.find_all('p')]
            results.append(paras)
    return results

await

的意思是“等待结果准备就绪”，因此当您在每个循环迭代中等待取数时，您请求（并获得）顺序执行。要并行化抓取，您需要使用类似的方法将每个

fetch

生成到后台任务中，然后等待它们，就像使用线程一样。或者更简单地说，您可以让便利功能为您做这件事。例如（未经测试）：

如果您发现这仍然比多线程版本运行得慢，那么HTML解析可能会减慢与IO相关的工作。（默认情况下，Asyncio在单个线程中运行所有内容。）为了防止CPU绑定的代码干扰Asyncio，可以使用以下方法将解析移动到单独的线程：

请注意，

run\u in\u executor

必须等待，因为当后台线程完成给定的分配时，它返回一个“唤醒”的等待。由于此版本使用asyncio for IO和线程进行解析，它的运行速度应该与线程版本一样快，但可以扩展到更多的并行下载

最后，如果希望解析实际并行运行，使用多个核，则可以使用多处理：

_pool = concurrent.futures.ProcessPoolExecutor()

async def fetch_and_parse(session, url):
    html = await fetch(session, url)
    loop = asyncio.get_event_loop()
    # run parse(html) in a separate process, and
    # resume this coroutine when it completes
    paras = await loop.run_in_executor(pool, parse, html)
    return paras

async def fetch_and_parse(session, url):
    html = await fetch(session, url)
    loop = asyncio.get_event_loop()
    # run parse(html) in a separate thread, and
    # resume this coroutine when it completes
    paras = await loop.run_in_executor(None, parse, html)
    return paras

_pool = concurrent.futures.ProcessPoolExecutor()

async def fetch_and_parse(session, url):
    html = await fetch(session, url)
    loop = asyncio.get_event_loop()
    # run parse(html) in a separate process, and
    # resume this coroutine when it completes
    paras = await loop.run_in_executor(pool, parse, html)
    return paras