Python Asyncio在多次请求后停止返回响应_Python_Web Scraping_Python Requests_Python Asyncio

Python Asyncio在多次请求后停止返回响应

python web-scraping

Python Asyncio在多次请求后停止返回响应,python,web-scraping,python-requests,python-asyncio,Python,Web Scraping,Python Requests,Python Asyncio,我有许多URL的列表，我想检查它们是否仍然处于活动状态。我在上找到了一些代码，并对其进行了一些修改。我是asyncio的初学者，所以可能会有一些混乱的编码 import asyncio import re import csv import sys from typing import IO import urllib.error import urllib.parse import json from log import logger_setup import pandas as pd i

我有许多URL的列表，我想检查它们是否仍然处于活动状态。我在上找到了一些代码，并对其进行了一些修改。我是asyncio的初学者，所以可能会有一些混乱的编码

import asyncio
import re
import csv
import sys
from typing import IO
import urllib.error
import urllib.parse
import json
from log import logger_setup
import pandas as pd

import aiofiles
import aiohttp
from aiohttp import ClientSession
from lxml.html import fromstring

global logger
logger = logger_setup('main')

async def fetch(url, session):
    """GET request wrapper to fetch page HTML.

    kwargs are passed to `session.request()`.
    """
    if type(url) != str:
        return  None, 'wrong data format', None

    if url.startswith('http'):
        pass

    else:
        return None, 'wrong data format', None

    resp = await session.request(method="GET", url=url)
    asyncio.sleep(15)
    resp.raise_for_status()
    logger.info("Got response [{}] for URL: {}".format(resp.status, url))
    html = await resp.text()

    tree = fromstring(html)
    title = tree.findtext('.//title')
    return html, resp.status, title

async def parse(url, session):
    """Find HREFs in the HTML of `url`."""
    try:
        html, status, title = await fetch(url,session)
        # asyncio.sleep(5)

    except Exception as e:
        logger.error("Exception for {}: {}".format(url, e))
        status = e
        html = None
        title = None

    finally:
        logger.info("{} [{}] {}".format(url, status, title))

        if status != 'wrong data format':
            async with aiofiles.open('~/data/lake/dormant_worker/data/{}/new/results.csv'.format(mode),'a+') as f:
                writer = csv.writer(f, delimiter=',')
                await writer.writerow([url, title, status])

        return html, status



async def write_one(url, session):
    """Write the found HREFs from `url` to `file`."""
    html, resp = await parse(url, session)
    try:
        async with aiofiles.open('~/data/{}/new/source_codes/{}'.format(mode, url.replace('/','-')), 'w') as f:
            if html == None:
                await f.write("{}".format(None))
            else:
                await f.write("{}".format(html))
            logger.info("Wrote results for source URL: {}".format(url))
    except Exception as e:
        logger.info(url, e)

async def crawl_and_write(urls):
    """Crawl & write concurrently to `file` for multiple `urls`."""
    async with ClientSession() as session:
        tasks = []
        for url in urls:
            tasks.append(
                write_one(url, session)
            )
        await asyncio.gather(*tasks)


def main(arg):
    global mode
    mode = arg
    df = pd.read_csv('~/data/{}/clearnet_{}.csv'.format(mode, mode))

    urls = df['url']
    asyncio.run(crawl_and_write(urls=urls))

main(sys.argv[1])

对于大约12k个URL，在查询2k个站点后，不会返回任何内容。回复样本如下所示：

15:57:17,82 main INFO Got response [200] for URL: https://www.zeux.com/
15:57:17,85 main INFO https://www.zeux.com/ [200] Zeux | Zeux - Where money never sleeps
15:57:17,97 main INFO Got response [200] for URL: https://www.spheroiduniverse.io/
15:57:17,103 main INFO https://www.spheroiduniverse.io/ [200] Spheroid Universe
15:57:17,163 main ERROR Exception for https://adenium.io: Cannot connect to host adenium.io:443 ssl:default [Name or service not known]
15:57:17,164 main INFO https://adenium.io [Cannot connect to host adenium.io:443 ssl:default [Name or service not known]] None
15:57:17,164 main INFO Wrote results for source URL: https://www.one.game
15:57:17,165 main INFO Wrote results for source URL: https://mvlchain.io
15:57:17,165 main INFO Wrote results for source URL: https://www.coinzark.com
15:57:17,185 main ERROR Exception for https://treasuredapp.io/: 503, message='Service Temporarily Unavailable', url=URL('https://treasuredapp.io/')
15:57:17,186 main INFO https://treasuredapp.io/ [503, message='Service Temporarily Unavailable', url=URL('https://treasuredapp.io/')] None
15:57:17,197 main INFO Got response [200] for URL: https://swapswop.io/

而日志是这样的：

16:00:57,308 main ERROR Exception for https://etherc.io/#EET-ETH: 
16:00:57,308 main INFO https://etherc.io/#EET-ETH [] None
16:00:57,309 main ERROR Exception for http://xrp2019.com/: 
16:00:57,309 main INFO http://xrp2019.com/ [] None
16:00:57,309 main ERROR Exception for https://ethx2.io/: 
16:00:57,310 main INFO https://ethx2.io/ [] None
16:00:57,310 main ERROR Exception for https://cryptomcafee.com/: 
16:00:57,311 main INFO https://cryptomcafee.com/ [] None
16:00:57,311 main ERROR Exception for https://nydig.com: 
16:00:57,312 main INFO https://nydig.com [] None
16:00:57,312 main ERROR Exception for http://beznal.pk/: 
16:00:57,313 main INFO http://beznal.pk/ [] None
16:00:57,313 main ERROR Exception for https://www.bex500.com/: 
16:00:57,313 main INFO https://www.bex500.com/ [] None
16:00:57,314 main ERROR Exception for https://alphawallet.com/: 
16:00:57,314 main INFO https://alphawallet.com/ [] None
16:00:57,315 main ERROR Exception for https://mkr.tools/:

我认为谷歌反机器人可能触发了这种行为。你是怎么想的？最终如何绕过它？asyncio.sleep（某个时间）会有不同吗？我厌倦了使用它，但没有看到任何变化——或者我可能是因为使用错误。

任何帮助都将不胜感激

很好的异常处理xD 16:00:57309主要错误异常：

empty

您需要等待

asyncio.sleep（5）

才能生效。正在检查司库dapp。。。它在云彩下。。。但是看起来你正在使用它，而网站已经死了，你可以阅读例如关于scrapy AutoThrottle扩展的文章，你在等待吗？您需要等待asyncio.sleep（15）