Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/sharepoint/4.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Hive 将窗口功能应用于大数据集(如何优化?)_Hive_Bigdata_Presto - Fatal编程技术网

Hive 将窗口功能应用于大数据集(如何优化?)

Hive 将窗口功能应用于大数据集(如何优化?),hive,bigdata,presto,Hive,Bigdata,Presto,我必须对一个有4亿多行的表进行一些数据分析。我得到这个工作在一个小样本,但我相信它会在生产内存不足 表结构如下(对于数百万个序列号): 我需要获取当前状态_1=‘在途’和前一日期状态_2=‘x’的日期。应该是这样的: +-----------+---------------+------------+----------+------------+ | date_1 | serial_number | status_1 | status_2 | date_2 | +------

我必须对一个有4亿多行的表进行一些数据分析。我得到这个工作在一个小样本,但我相信它会在生产内存不足

表结构如下(对于数百万个序列号):

我需要获取当前状态_1=‘在途’和前一日期状态_2=‘x’的日期。应该是这样的:

+-----------+---------------+------------+----------+------------+
|  date_1   | serial_number |  status_1  | status_2 |   date_2   |
+-----------+---------------+------------+----------+------------+
| 11/2/2018 |           123 | in transit | x        | 10/20/2018 |
+-----------+---------------+------------+----------+------------+
我使用两个秩函数得到它,但这可能会在一个大表上阻塞

with transit as (
select 
*
from (
    select *,
    rank() over(partition by serial_number order by date desc) rnk
    from sample_t 
    order by serial_number, date asc
    ) 
where rnk=1 and status_1 = 'in transit'
),
x_type as (
select 
*
from (
    select *,
    rank() over(partition by serial_number order by date desc) rnk
    from sample_t 
    order by serial_number, date asc
    ) 
where rnk>1 and status_2 = 'x'
)
select tr.date date_1,
tr.serial_number,
tr.status_1,
x.status_2,
x.date date_2
from transit tr left join x_type x on tr.serial_number = x.serial_number

我不知道如何用一个秩函数来实现这一点。有没有更好、更有效的方法

您可以使用
lag
来执行此操作

select *
from (select t.*
      ,lag(status_2) over(partition by serial_no order by date) as prev_status_2
      ,lag(date) over(partition by serial_no order by date) as prev_date
      from tbl t  
     ) t 
where status_1 = 'in_transit' and prev_status_2 = 'x'

您可以使用
lag
来执行此操作

select *
from (select t.*
      ,lag(status_2) over(partition by serial_no order by date) as prev_status_2
      ,lag(date) over(partition by serial_no order by date) as prev_date
      from tbl t  
     ) t 
where status_1 = 'in_transit' and prev_status_2 = 'x'