Amazon s3 如何使用Amazon Athena执行查询而不耗尽资源?
我正在尝试进行此查询以获取一些数据。s3://my_datalake/my_table/year=2018/month=9/day=7/中s3上的文件大小为1.1 TB,我有10014个snappy.parquet对象Amazon s3 如何使用Amazon Athena执行查询而不耗尽资源?,amazon-s3,amazon-athena,Amazon S3,Amazon Athena,我正在尝试进行此查询以获取一些数据。s3://my_datalake/my_table/year=2018/month=9/day=7/中s3上的文件大小为1.1 TB,我有10014个snappy.parquet对象 SELECT array_join(array_agg(distinct endpoint),',') as endpoints_all, count(endpoint) as count_endpoints FROM my_datalake.my_table
SELECT array_join(array_agg(distinct endpoint),',') as endpoints_all, count(endpoint) as count_endpoints
FROM my_datalake.my_table
WHERE year=2018 and month=09 and day=07
and ts between timestamp '2018-09-07 00:00:00' and timestamp '2018-09-07 23:59:59'
and status = '2'
GROUP BY domain, size, device_id, ip
但我犯了一个错误:
在此比例因子下查询耗尽的资源
我有年、月、日和小时。如何进行此查询?我可以用Amazon Athena或者我需要使用另一个工具吗
我的表的架构是:
CREATE EXTERNAL TABLE `ssp_request_prueba`(
`version` string,
`adunit` string,
`adunit_original` string,
`brand` string,
`country` string,
`device_connection_type` string,
`device_density` string,
`device_height` string,
`device_id` string,
`device_type` string,
`device_width` string,
`domain` string,
`endpoint` string,
`endpoint_version` string,
`external_dfp_id` string,
`id_req` string,
`ip` string,
`lang` string,
`lat` string,
`lon` string,
`model` string,
`ncc` string,
`noc` string,
`non` string,
`os` string,
`osv` string,
`scc` string,
`sim_operator_code` string,
`size` string,
`soc` string,
`son` string,
`source` string,
`ts` timestamp,
`user_agent` string,
`status` string,
`delivery_network` string,
`delivery_time` string,
`delivery_status` string,
`delivery_network_key` string,
`delivery_price` string,
`device_id_original` string,
`tracking_limited` string,
`from_cache` string,
`request_price` string)
PARTITIONED BY (
`year` int,
`month` int,
`day` int,
`hour` int)
ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION
's3://my_datalake/my_table'
TBLPROPERTIES (
'has_encrypted_data'='false',
'transient_lastDdlTime'='1538747353')
问题可能与array_join和array_agg函数有关。我假设在上述情况下,Athena服务中节点的内存限制已经超过。雅典娜可能无法结合这些功能管理如此数量的数据。您能发送更多关于您的数据源的信息吗?它是在S3还是db上?数据示例。您使用什么格式的数据?它在s3上。域、大小、设备id、ip和端点字段是字符串。示例:domain=stackoverlow.com。Size=6x6,device_id=a15848sdsd,ip=127.0.0.7,endpoint=vastSDS128。您可以使用“SHOW CREATE table my_datalake.my_table;”发送表定义吗?我刚刚尝试过,但遇到了相同的问题:在此比例因子下查询耗尽的资源(运行时间:6分57秒,扫描的数据:133.91GB)对于查询中使用的这些分区,s3上的文件大小是多少?我的意思是在s3://my_datalake/my_table/year=2018/month=9/day=7/查询有效吗?结果是什么?选择count(endpoint)作为my_datalake.my_表中的count_端点,其中year=2018,month=09,day=07,时间戳'2018-09-07 00:00:00'和时间戳'2018-09-07 23:59:59',status='2',按域、大小、设备id分组,ipMy样本数据不超过10GB,但在您的情况下,可能超过了每个节点的内存限制athena服务。我还测试了该查询的几个变体,使用array_join(array_agg(distinct endpoint),',')、array_agg(distinct endpoint)和count(distinct endpoint),所有变体在s3上扫描相同数量的数据,因此可能是内存问题。我的意思是,例如,您可以将查询放入EMR并生成平面文件以供以后在Athena中进行查询。您有相当大的数据湖,因此我认为您在执行此特殊查询时会遇到很多问题。我认为您应该转换parti在Athena中执行查询之前,分析数据并创建数据子集。
CREATE EXTERNAL TABLE `ssp_request_prueba`(
`version` string,
`adunit` string,
`adunit_original` string,
`brand` string,
`country` string,
`device_connection_type` string,
`device_density` string,
`device_height` string,
`device_id` string,
`device_type` string,
`device_width` string,
`domain` string,
`endpoint` string,
`endpoint_version` string,
`external_dfp_id` string,
`id_req` string,
`ip` string,
`lang` string,
`lat` string,
`lon` string,
`model` string,
`ncc` string,
`noc` string,
`non` string,
`os` string,
`osv` string,
`scc` string,
`sim_operator_code` string,
`size` string,
`soc` string,
`son` string,
`source` string,
`ts` timestamp,
`user_agent` string,
`status` string,
`delivery_network` string,
`delivery_time` string,
`delivery_status` string,
`delivery_network_key` string,
`delivery_price` string,
`device_id_original` string,
`tracking_limited` string,
`from_cache` string,
`request_price` string)
PARTITIONED BY (
`year` int,
`month` int,
`day` int,
`hour` int)
ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION
's3://my_datalake/my_table'
TBLPROPERTIES (
'has_encrypted_data'='false',
'transient_lastDdlTime'='1538747353')