Datetime 配置单元SQL:如何按最大时间对记录组进行子集
我有按ID1和ID2分组的记录,例如:Datetime 配置单元SQL:如何按最大时间对记录组进行子集,datetime,hadoop,hive,hiveql,cloudera,Datetime,Hadoop,Hive,Hiveql,Cloudera,我有按ID1和ID2分组的记录,例如: ID1 ID2 date_time Apple pear 2020-03-09T12:11:25:622Z Apple pear 2020-03-09T12:23:36:825Z Apple lemon 2020-03-08T08:01:16:030Z Apple lemon 2020-03-09T10:11:12:930Z Apple lemon 2020
ID1 ID2 date_time
Apple pear 2020-03-09T12:11:25:622Z
Apple pear 2020-03-09T12:23:36:825Z
Apple lemon 2020-03-08T08:01:16:030Z
Apple lemon 2020-03-09T10:11:12:930Z
Apple lemon 2020-03-09T15:13:02:610Z
Lime peach 2020-03-09T07:34:06:825Z
Lime peach 2020-03-09T07:54:12:220Z
Melon banana 2020-03-09T03:54:11:041Z
Melon banana 2020-03-09T09:22:10:220Z
Orange pear 2020-03-09T11:13:36:217Z
Orange pear 2020-03-09T11:23:26:040Z
Orange pear 2020-03-09T11:43:35:721Z
我正在尝试提取通过某个时间范围的最大记录的记录,这样,如果我想提取最大时间不超过3月9日中午的记录,上述记录将被子集为:
ID1 ID2 date_time
Lime peach 2020-03-09T07:34:06:825Z
Lime peach 2020-03-09T07:54:12:220Z
Melon banana 2020-03-09T03:54:11:041Z
Melon banana 2020-03-09T09:22:10:220Z
Orange pear 2020-03-09T11:13:36:217Z
Orange pear 2020-03-09T11:23:26:040Z
Orange pear 2020-03-09T11:43:35:721Z
我使用了cast(regexp\u replace('2020-03-09T16:05:06:827Z','(.*)T(.*):([^:*?)Z$,'$1$2\\\.$3')作为时间戳)
将日期时间转换为时间戳,这多亏了@leftjoin
谢谢你的帮助
按时间对子设置后添加的顶点问题:
注:我只允许制作临时表格
第一:我通过ID1和ID2对原始数据集进行重复数据消除
create temporary table data1 as
select id1, id2, name, value,
cast(regexp_replace(oldtime,'(.*?)T(.*?):([^:]*?)Z$','$1 $2\\.$3') as timestamp) as date_time
from (
select ID1, ID2,
ROW_NUMBER() OVER (PARTITION BY id1, id2,name, value ORDER BY value) as rn,
case when name like '%_TIME' then value end as date_time
from df.data1
)undup
where undup.rn = 1 AND
value <> '' and value is not null
order by ID1, ID2, date_time ASC
使用max分析函数并通过它进行过滤:
with your_data as (
select stack (12,
'Apple', 'pear' , '2020-03-09T12:11:25:622Z',
'Apple', 'pear' , '2020-03-09T12:23:36:825Z',
'Apple', 'lemon' , '2020-03-08T08:01:16:030Z',
'Apple', 'lemon' , '2020-03-09T10:11:12:930Z',
'Apple', 'lemon' , '2020-03-09T15:13:02:610Z',
'Lime', 'peach' , '2020-03-09T07:34:06:825Z',
'Lime', 'peach' , '2020-03-09T07:54:12:220Z',
'Melon', 'banana' , '2020-03-09T03:54:11:041Z',
'Melon', 'banana' , '2020-03-09T09:22:10:220Z',
'Orange','pear' , '2020-03-09T11:13:36:217Z',
'Orange','pear' , '2020-03-09T11:23:26:040Z',
'Orange','pear' , '2020-03-09T11:43:35:721Z'
) as (ID1,ID2,date_time)
)
select ID1,ID2,date_time
from
(
select max(timestamp(regexp_replace(date_time,'(.*?)T(.*?):([^:]*?)Z$','$1 $2\\.$3'))) over (partition by ID1,ID2) as max_date,
ID1,ID2,timestamp(regexp_replace(date_time,'(.*?)T(.*?):([^:]*?)Z$','$1 $2\\.$3')) as date_time
from your_data
)s
where max_date<='2020-03-09 12:00:00'
Hi@leftjoin,我相信这段代码就是我要寻找的,但是我正在处理的数据集的大小超过了一百万条记录,每当我在子查询中按时间子集id1+id2时,我都会收到一个顶点失败错误。这是我可以通过修改代码来避免的吗?或者我需要和我的数据库管理员一起工作吗?刚刚用vertex fail更新了我的问题problem@lydias您可以删除时间戳转换并与where子句中的“2020-03-09T12:00:00:000Z”进行比较。时间戳格式也可以与max()配合使用。但我不确定这会有多大帮助。我看不出这个问题与代码有什么关系。而且100万张唱片也没什么大不了的。。。可能是别的原因。尝试解决与管理员如果这没有帮助,我会接受答案。非常感谢你。需要与系统管理员讨论内存问题。@lydias这是NullPointerException,不是内存问题。这可能是您的配置单元版本中的一些错误。查看这个jira:HIVE-18786并尝试应用tez.am.container.reuse.enabled=false
ERROR: Execute error: Error while processing statement: FAILED: Execution Error, return code 2 from
org.apache.hadoop.hive.ql.exec.tez.TezTask. Vertex failed, vertexName=Reducer 2, vertexId=vertex_1583806821840_6890_2_01,
diagnostics=[Task failed, taskId=task_1583806821840_6890_2_01_000036, diagnostics=[TaskAttempt 0 failed, info=[Error: Error
while running task ( failure ) : attempt_1583806821840_6890_2_01_000036_0:java.lang.RuntimeException:
java.lang.RuntimeException: org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime Error while processing row at
org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:296) at
org.apache.hadoop.hive.ql.exec.tez.TezProcessor.run(TezProcessor.java:250) at
org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.run(LogicalIOProcessorRuntimeTask.java:374) at
org.apache.tez.runtime.task.TaskRunner2Callable$1.run(TaskRunner2Callable.java:73) at
org.apache.tez.runtime.task.TaskRunner2Callable$1.run(TaskRunner2Callable.java:61) at
java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730) at
org.apache.tez.runtime.task.TaskRunner2Callable.callInternal(TaskRunner2Callable.java:61) at
org.apache.tez.runtime.task.TaskRunner2Callable.callInternal(TaskRunner2Callable.java:37) at
org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36) at
com.google.common.util.concurrent.TrustedListenableFutureTask$TrustedFutureInterruptibleTask.runInterruptibly(TrustedListenab
leFutureTask.java:108) at com.google.common.util.concurrent.InterruptibleTask.run(InterruptibleTask.java:41) at
com.google.common.util.concurrent.TrustedListenableFutureTask.run(TrustedListenableFutureTask.java:77) at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at
java.lang.Thread.run(Thread.java:748)Caused by: java.lang.RuntimeException:
org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime Error while processing row at
org.apache.hadoop.hive.ql.exec.tez.ReduceRecordSource.pushRecord(ReduceRecordSource.java:304) at
org.apache.hadoop.hive.ql.exec.tez.ReduceRecordProcessor.run(ReduceRecordProcessor.java:318) at
org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:267) ... 16 moreCaused by:
org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime Error while processing row at
org.apache.hadoop.hive.ql.exec.tez.ReduceRecordSource$GroupIterator.next(ReduceRecordSource.java:378) at
org.apache.hadoop.hive.ql.exec.tez.ReduceRecordSource.pushRecord(ReduceRecordSource.java:294) ... 18 moreCaused by:
java.lang.NullPointerException at
org.apache.hadoop.hive.ql.exec.persistence.PTFRowContainer.first(PTFRowContainer.java:115) at
org.apache.hadoop.hive.ql.exec.PTFPartition.iterator(PTFPartition.java:114) at
org.apache.hadoop.hive.ql.udf.ptf.BasePartitionEvaluator.getPartitionAgg(BasePartitionEvaluator.java:200) at
org.apache.hadoop.hive.ql.udf.ptf.WindowingTableFunction.evaluateFunctionOnPartition(WindowingTableFunction.java:155) at
org.apache.hadoop.hive.ql.udf.ptf.WindowingTableFunction.iterator(WindowingTableFunction.java:538) at
org.apache.hadoop.hive.ql.exec.PTFOperator$PTFInvocation.finishPartition(PTFOperator.java:349) at
org.apache.hadoop.hive.ql.exec.PTFOperator.process(PTFOperator.java:123) at
org.apache.hadoop.hive.ql.exec.Operator.baseForward(Operator.java:994) at
org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:940) at
org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:927) at
org.apache.hadoop.hive.ql.exec.SelectOperator.process(SelectOperator.java:95) at
org.apache.hadoop.hive.ql.exec.tez.ReduceRecordSource$GroupIterator.next(ReduceRecordSource.java:363) ... 19 more],
TaskAttempt 1 failed, info=[Error: Error while running task ( failure ) :
attempt_1583806821840_6890_2_01_000036_1:java.lang.RuntimeException: java.lang.RuntimeException:
org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime Error while processing row at
org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:296) at
org.apache.hadoop.hive.ql.exec.tez.TezProcessor.run(TezProcessor.java:250) at
org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.run(LogicalIOProcessorRuntimeTask.java:374) at
org.apache.tez.runtime.task.TaskRunner2Callable$1.run(TaskRunner2Callable.java:73) at
org.apache.tez.runtime.task.TaskRunner2Callable$1.run(TaskRunner2Callable.java:61) at
java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730) at
org.apache.tez.runtime.task.TaskRunner2Callable.callInternal(TaskRunner2Callable.java:61) at
org.apache.tez.runtime.task.TaskRunner2Callable.callInternal(TaskRunner2Callable.java:37) at
org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36) at
com.google.common.util.concurrent.TrustedListenableFutureTask$TrustedFutureInterruptibleTask.runInterruptibly(TrustedListenab
leFutureTask.java:108) at com.google.common.util.concurrent.InterruptibleTask.run(InterruptibleTask.java:41) at
com.google.common.util.concurrent.TrustedListenableFutureTask.run(TrustedListenableFutureTask.java:77) at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at
java.lang.Thread.run(Thread.java:748)Caused by: java.lang.RuntimeException:
org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime Error while processing row at
org.apache.hadoop.hive.ql.exec.tez.ReduceRecordSource.pushRecord(ReduceRecordSource.java:304) at
org.apache.hadoop.hive.ql.exec.tez.ReduceRecordProcessor.run(ReduceRecordProcessor.java:318) at
org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:267) ... 16 moreCaused by:
org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime Error while processing row at
org.apache.hadoop.hive.ql.exec.tez.ReduceRecordSource$GroupIterator.next(ReduceRecordSource.java:378) at
org.apache.hadoop.hive.ql.exec.tez.ReduceRecordSource.pushRecord(ReduceRecordSource.java:294) ... 18 moreCaused by:
java.lang.NullPointerException at
org.apache.hadoop.hive.ql.exec.persistence.PTFRowContainer.first(PTFRowContainer.java:115) at
org.apache.hadoop.hive.ql.exec.PTFPartition.iterator(PTFPartition.java:114) at
org.apache.hadoop.hive.ql.udf.ptf.BasePartitionEvaluator.getPartitionAgg(BasePartitionEvaluator.java:200) at
org.apache.hadoop.hive.ql.udf.ptf.WindowingTableFunction.evaluateFunctionOnPartition(WindowingTableFunction.java:155) at
org.apache.hadoop.hive.ql.udf.ptf.WindowingTableFunction.iterator(WindowingTableFunction.java:538) at
org.apache.hadoop.hive.ql.exec.PTFOperator$PTFInvocation.finishPartition(PTFOperator.java:349) at
org.apache.hadoop.hive.ql.exec.PTFOperator.process(PTFOperator.java:123) at
org.apache.hadoop.hive.ql.exec.Operator.baseForward(Operator.java:994) at
org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:940) at
org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:927) at
org.apache.hadoop.hive.ql.exec.SelectOperator.process(SelectOperator.java:95) at
org.apache.hadoop.hive.ql.exec.tez.ReduceRecordSource$GroupIterator.next(ReduceRecordSource.java:363) ... 19 more],
TaskAttempt 2 failed, info=[Error: Error while running task ( failure ) :
attempt_1583806821840_6890_2_01_000036_2:java.lang.RuntimeException: java.lang.RuntimeException:
org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime Error while processing row at
org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:296) at
org.apache.hadoop.hive.ql.exec.tez.TezProcessor.run(TezProcessor.java:250) at
org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.run(LogicalIOProcessorRuntimeTask.java:374) at
org.apache.tez.runtime.task.TaskRunner2Callable$1.run(TaskRunner2Callable
with your_data as (
select stack (12,
'Apple', 'pear' , '2020-03-09T12:11:25:622Z',
'Apple', 'pear' , '2020-03-09T12:23:36:825Z',
'Apple', 'lemon' , '2020-03-08T08:01:16:030Z',
'Apple', 'lemon' , '2020-03-09T10:11:12:930Z',
'Apple', 'lemon' , '2020-03-09T15:13:02:610Z',
'Lime', 'peach' , '2020-03-09T07:34:06:825Z',
'Lime', 'peach' , '2020-03-09T07:54:12:220Z',
'Melon', 'banana' , '2020-03-09T03:54:11:041Z',
'Melon', 'banana' , '2020-03-09T09:22:10:220Z',
'Orange','pear' , '2020-03-09T11:13:36:217Z',
'Orange','pear' , '2020-03-09T11:23:26:040Z',
'Orange','pear' , '2020-03-09T11:43:35:721Z'
) as (ID1,ID2,date_time)
)
select ID1,ID2,date_time
from
(
select max(timestamp(regexp_replace(date_time,'(.*?)T(.*?):([^:]*?)Z$','$1 $2\\.$3'))) over (partition by ID1,ID2) as max_date,
ID1,ID2,timestamp(regexp_replace(date_time,'(.*?)T(.*?):([^:]*?)Z$','$1 $2\\.$3')) as date_time
from your_data
)s
where max_date<='2020-03-09 12:00:00'
id1 id2 date_time
Lime peach 2020-03-09 07:54:12.22
Lime peach 2020-03-09 07:34:06.825
Melon banana 2020-03-09 09:22:10.22
Melon banana 2020-03-09 03:54:11.041
Orange pear 2020-03-09 11:43:35.721
Orange pear 2020-03-09 11:23:26.04
Orange pear 2020-03-09 11:13:36.217