S3卸载命令(python3.x包装器)无限期挂起
我有一个进程从一个巨大的S3卸载命令(python3.x包装器)无限期挂起,python,postgresql,amazon-web-services,amazon-s3,amazon-redshift,Python,Postgresql,Amazon Web Services,Amazon S3,Amazon Redshift,我有一个进程从一个巨大的Redshift表中提取,该表有40多亿行。我试图卸载到一个S3存储桶中,然后复制回另一个表。问题是,它会停下来搅拌3个小时左右,然后失败。奇怪的是,当我在桶中查看时,我可以看到59个切片和一个清单文件。但直到进程结束,它才将它们放在那里(上次我认为我得到的错误是服务器意外关闭或其他原因)。有没有办法优化这类事务,或者有没有更好的办法来执行这种类型的卸载/复制?我想知道为什么进程会停下来挂起,但当我查看我的bucket中的时间戳时,它会在几个小时前将文件上传到s3。我需要
Redshift
表中提取,该表有40多亿行。我试图卸载到一个S3
存储桶中,然后复制回另一个表。问题是,它会停下来搅拌3个小时左右,然后失败。奇怪的是,当我在桶中查看时,我可以看到59个切片和一个清单文件。但直到进程结束,它才将它们放在那里(上次我认为我得到的错误是服务器意外关闭或其他原因)。有没有办法优化这类事务,或者有没有更好的办法来执行这种类型的卸载/复制?我想知道为什么进程会停下来挂起,但当我查看我的bucket中的时间戳时,它会在几个小时前将文件上传到s3。我需要某种代码在一段时间后自动杀死它吗?这是我的密码:
from datetime import datetime
import logging
import boto3
import psycopg2 as ppg2
from inst_utils import aws
from inst_config import config3
logging.basicConfig(
level=logging.INFO,
format='%(asctime)s [%(levelname)s] - %(message)s')
if __name__ == '__main__':
# Unload step
timestamp = datetime.now()
month = timestamp.month
year = timestamp.year
s3_sesh = boto3.session.Session(**config3.S3_INFO)
s3 = s3_sesh.resource('s3')
fname = 'load_{}_{:02d}'.format(year, month)
bucket_url = ('canvas_logs/agg_canvas_logs_user_agent_types/'
'{}/'.format(fname))
unload_url = ('s3://{}/{}'.format(config3.S3_BUCKET, bucket_url))
s3.Bucket(config3.S3_BUCKET).put_object(Key=bucket_url)
table_name = 'requests_{}_{:02d}'.format(year, month - 1)
logging.info('Starting unload.')
try:
with ppg2.connect(**config3.REQUESTS_POSTGRES_INFO) as conn:
cur = conn.cursor()
# TODO add sql the sql folder to clean up this program.
unload = r'''
unload ('select
user_id
,course_id
,request_month
,user_agent_type
,count(session_id)
,\'DEV\' etl_requests_usage
,CONVERT_TIMEZONE(\'MST\', getdate()) etl_datetime_local
,\'agg_canvas_logs_user_agent_types\' etl_transformation_name
,\'N/A\' etl_pdi_version
,\'N/A\' etl_pdi_build_version
,null etl_pdi_hostname
,null etl_pdi_ipaddress
,null etl_checksum_md5
from
(select distinct
user_id
,context_id as course_id
,date_trunc(\'month\', request_timestamp) request_month
,session_id
,case
when user_agent like \'%CanvasAPI%\' then \'api\'
when user_agent like \'%candroid%\' then \'mobile_app_android\'
when user_agent like \'%iCanvas%\' then \'mobile_app_ios\'
when user_agent like \'%CanvasKit%\' then \'mobile_app_ios\'
when user_agent like \'%Windows NT%\' then \'desktop\'
when user_agent like \'%MacBook%\' then \'desktop\'
when user_agent like \'%iPhone%\' then \'mobile\'
when user_agent like \'%iPod Touch%\' then \'mobile\'
when user_agent like \'%iPad%\' then \'mobile\'
when user_agent like \'%iOS%\' then \'mobile\'
when user_agent like \'%CrOS%\' then \'desktop\'
when user_agent like \'%Android%\' then \'mobile\'
when user_agent like \'%Linux%\' then \'desktop\'
when user_agent like \'%Mac OS%\' then \'desktop\'
when user_agent like \'%Macintosh%\' then \'desktop\'
else \'other_unknown\'
end as user_agent_type
from {}
where context_type = \'Course\')
group by
user_id
,course_id
,request_month
,user_agent_type')
to '{}'
credentials 'aws_access_key_id={};aws_secret_access_key={}'
manifest
gzip
delimiter '|'
'''.format(
table_name, unload_url, config3.S3_ACCESS, config3.S3_SECRET)
cur.execute(unload)
conn.commit()
except ppg2.Error as e:
logging.critical('Error occurred during transaction: {}'.format(e))
raise Exception('{}'.format(e))
logging.info('Starting copy process.')
schema_name = 'ods_canvas_logs'
table_name = 'agg_canvas_logs_user_agent_types'
manifest_url = unload_url + 'manifest'
logging.info('Manifest url: {}'.format(manifest_url))
load = aws.RedshiftLoad(schema_name,
table_name,
manifest_url,
config3.S3_INFO,
config3.REDSHIFT_POSTGRES_INFO_PROD,
config3.REDSHIFT_POSTGRES_INFO,
safe_load=True,
truncate=True
)
load.execute()
RedshiftLoad
对象只是一个包装类,我创建它是为了简化从S3复制文件,因为它在我的工作中非常常见 是否将数据复制回同一群集?您是否尝试使用Create Table As Select(CTAS)而不是卸载和复制。否这是另一个集群中的表