添加错误捕获&；异步Python脚本超时_Python_Asynchronous_Web Scraping_Error Handling_Timeout

添加错误捕获&；异步Python脚本超时

python asynchronous web-scraping error-handling

添加错误捕获&；异步Python脚本超时,python,asynchronous,web-scraping,error-handling,timeout,Python,Asynchronous,Web Scraping,Error Handling,Timeout,我正在编写一个脚本来批量下载和搜索html代码。因为我计划在1000个URL上使用它，所以我想知道为每个网站实现超时的最佳方法是什么，以及一些忽略它无法抓取的站点的错误处理。我目前遇到的错误是网站抛出以下错误： socket.gaierror: [Errno 11002] getaddrinfo failed 以下是我目前掌握的代码： import aiohttp import asyncio DomainListFile = 'C:/Users/joe/Documents/test.txt

我正在编写一个脚本来批量下载和搜索html代码。因为我计划在1000个URL上使用它，所以我想知道为每个网站实现超时的最佳方法是什么，以及一些忽略它无法抓取的站点的错误处理。我目前遇到的错误是网站抛出以下错误：

socket.gaierror: [Errno 11002] getaddrinfo failed

以下是我目前掌握的代码：

import aiohttp
import asyncio

DomainListFile = 'C:/Users/joe/Documents/test.txt'

with open(DomainListFile) as f:
    DomainList = f.readlines()
DomainList = [x.strip() for x in DomainList]


async def fetch(session, url, sema):
    async with sema, session.get("http://" + url) as response:
        return await response.text(), url


async def main():
    tasks = []
    sema = asyncio.BoundedSemaphore(value=100)
    async with aiohttp.ClientSession(connector=aiohttp.TCPConnector(ssl=False)) as session:
        for url in DomainList:
            tasks.append(fetch(session, url, sema))
        htmls = await asyncio.gather(*tasks)
        for html in htmls:
            if "copyright 2017" in html[0]:
                print(html[1])


if __name__ == '__main__':
    import time

    start = time.time()
    loop = asyncio.get_event_loop()
    loop.run_until_complete(main())
    end = time.time()
    print(end - start)

任何帮助都将不胜感激。谢谢大家!