Python Asyncio在多次请求后停止返回响应
我有许多URL的列表,我想检查它们是否仍然处于活动状态。我在上找到了一些代码,并对其进行了一些修改。我是asyncio的初学者,所以可能会有一些混乱的编码Python Asyncio在多次请求后停止返回响应,python,web-scraping,python-requests,python-asyncio,Python,Web Scraping,Python Requests,Python Asyncio,我有许多URL的列表,我想检查它们是否仍然处于活动状态。我在上找到了一些代码,并对其进行了一些修改。我是asyncio的初学者,所以可能会有一些混乱的编码 import asyncio import re import csv import sys from typing import IO import urllib.error import urllib.parse import json from log import logger_setup import pandas as pd i
import asyncio
import re
import csv
import sys
from typing import IO
import urllib.error
import urllib.parse
import json
from log import logger_setup
import pandas as pd
import aiofiles
import aiohttp
from aiohttp import ClientSession
from lxml.html import fromstring
global logger
logger = logger_setup('main')
async def fetch(url, session):
"""GET request wrapper to fetch page HTML.
kwargs are passed to `session.request()`.
"""
if type(url) != str:
return None, 'wrong data format', None
if url.startswith('http'):
pass
else:
return None, 'wrong data format', None
resp = await session.request(method="GET", url=url)
asyncio.sleep(15)
resp.raise_for_status()
logger.info("Got response [{}] for URL: {}".format(resp.status, url))
html = await resp.text()
tree = fromstring(html)
title = tree.findtext('.//title')
return html, resp.status, title
async def parse(url, session):
"""Find HREFs in the HTML of `url`."""
try:
html, status, title = await fetch(url,session)
# asyncio.sleep(5)
except Exception as e:
logger.error("Exception for {}: {}".format(url, e))
status = e
html = None
title = None
finally:
logger.info("{} [{}] {}".format(url, status, title))
if status != 'wrong data format':
async with aiofiles.open('~/data/lake/dormant_worker/data/{}/new/results.csv'.format(mode),'a+') as f:
writer = csv.writer(f, delimiter=',')
await writer.writerow([url, title, status])
return html, status
async def write_one(url, session):
"""Write the found HREFs from `url` to `file`."""
html, resp = await parse(url, session)
try:
async with aiofiles.open('~/data/{}/new/source_codes/{}'.format(mode, url.replace('/','-')), 'w') as f:
if html == None:
await f.write("{}".format(None))
else:
await f.write("{}".format(html))
logger.info("Wrote results for source URL: {}".format(url))
except Exception as e:
logger.info(url, e)
async def crawl_and_write(urls):
"""Crawl & write concurrently to `file` for multiple `urls`."""
async with ClientSession() as session:
tasks = []
for url in urls:
tasks.append(
write_one(url, session)
)
await asyncio.gather(*tasks)
def main(arg):
global mode
mode = arg
df = pd.read_csv('~/data/{}/clearnet_{}.csv'.format(mode, mode))
urls = df['url']
asyncio.run(crawl_and_write(urls=urls))
main(sys.argv[1])
对于大约12k个URL,在查询2k个站点后,不会返回任何内容。回复样本如下所示:
15:57:17,82 main INFO Got response [200] for URL: https://www.zeux.com/
15:57:17,85 main INFO https://www.zeux.com/ [200] Zeux | Zeux - Where money never sleeps
15:57:17,97 main INFO Got response [200] for URL: https://www.spheroiduniverse.io/
15:57:17,103 main INFO https://www.spheroiduniverse.io/ [200] Spheroid Universe
15:57:17,163 main ERROR Exception for https://adenium.io: Cannot connect to host adenium.io:443 ssl:default [Name or service not known]
15:57:17,164 main INFO https://adenium.io [Cannot connect to host adenium.io:443 ssl:default [Name or service not known]] None
15:57:17,164 main INFO Wrote results for source URL: https://www.one.game
15:57:17,165 main INFO Wrote results for source URL: https://mvlchain.io
15:57:17,165 main INFO Wrote results for source URL: https://www.coinzark.com
15:57:17,185 main ERROR Exception for https://treasuredapp.io/: 503, message='Service Temporarily Unavailable', url=URL('https://treasuredapp.io/')
15:57:17,186 main INFO https://treasuredapp.io/ [503, message='Service Temporarily Unavailable', url=URL('https://treasuredapp.io/')] None
15:57:17,197 main INFO Got response [200] for URL: https://swapswop.io/
而日志是这样的:
16:00:57,308 main ERROR Exception for https://etherc.io/#EET-ETH:
16:00:57,308 main INFO https://etherc.io/#EET-ETH [] None
16:00:57,309 main ERROR Exception for http://xrp2019.com/:
16:00:57,309 main INFO http://xrp2019.com/ [] None
16:00:57,309 main ERROR Exception for https://ethx2.io/:
16:00:57,310 main INFO https://ethx2.io/ [] None
16:00:57,310 main ERROR Exception for https://cryptomcafee.com/:
16:00:57,311 main INFO https://cryptomcafee.com/ [] None
16:00:57,311 main ERROR Exception for https://nydig.com:
16:00:57,312 main INFO https://nydig.com [] None
16:00:57,312 main ERROR Exception for http://beznal.pk/:
16:00:57,313 main INFO http://beznal.pk/ [] None
16:00:57,313 main ERROR Exception for https://www.bex500.com/:
16:00:57,313 main INFO https://www.bex500.com/ [] None
16:00:57,314 main ERROR Exception for https://alphawallet.com/:
16:00:57,314 main INFO https://alphawallet.com/ [] None
16:00:57,315 main ERROR Exception for https://mkr.tools/:
我认为谷歌反机器人可能触发了这种行为。你是怎么想的?最终如何绕过它?asyncio.sleep(某个时间)会有不同吗?我厌倦了使用它,但没有看到任何变化——或者我可能是因为使用错误。
任何帮助都将不胜感激 很好的异常处理xD 16:00:57309主要错误异常:
empty
您需要等待asyncio.sleep(5)
才能生效。正在检查司库dapp。。。它在云彩下。。。但是看起来你正在使用它,而网站已经死了,你可以阅读例如关于scrapy AutoThrottle扩展的文章,你在等待吗?您需要等待asyncio.sleep(15)