Hive 使用配置单元转换用于单击路径分析的日志

Hive 使用配置单元转换用于单击路径分析的日志,hive,Hive,我有如下的点击日志: userID time URL 1 2011-03-1 12:30:01 abc.com 2 2011-03-1 12:30:04 xyz.com 1 2011-03-1 12:30:46 abc.com/new 2 2011-03-1 12:3

我有如下的点击日志:

        userID     time                    URL  
           1       2011-03-1 12:30:01      abc.com
           2       2011-03-1 12:30:04      xyz.com
           1       2011-03-1 12:30:46      abc.com/new
           2       2011-03-1 12:31:02      xyz.com/fun
           2       2011-03-1 12:36:08      xyz.com/funner
           1       2011-03-1 12:45:46      abc.com/newer
我想将其转换为按会话组织的clickpath数据(定义为自用户上次单击后10分钟间隔后开始的任何一系列单击),因为我想运行clickpath分析。以下是预期结果:

        userID     sessionStart           clicktime               Seconds       fromPage          toPage
          1        2011-03-1 12:30:01     2011-03-1 12:30:01      NULL          NULL              abc.com
          1        2011-03-1 12:30:01     2011-03-1 12:30:46      45            abc.com           abc.com/new
          1        2011-03-1 12:30:01     NULL                    NULL          abc.com/new       NULL
          1        2011-03-1 12:45:46     2011-03-1 12:45:46      NULL          NULL              abc.com/newer
          1        2011-03-1 12:45:46     NULL                    NULL          abc.com/newer     NULL
          2        2011-03-1 12:30:04     2011-03-1 12:30:04      NULL          NULL              xyz.com
          2        2011-03-1 12:30:04     2011-03-1 12:31:02      58            xyz.com           xyz.com/fun
          2        2011-03-1 12:30:04     2011-03-1 12:36:08      306           xyz.com/fun       xyz.com/funner
          2        2011-03-1 12:30:04     NULL                    NULL          xyz.com/funner    NULL
注意,由于第二次和第三次单击之间的间隔超过10分钟,用户1有两个不同的会话


我以为我找到了一个使用0.11版的配置单元的解决方案,但我正在使用0.10版,所以现在我被卡住了。

我认为您可以使用配置单元功能和自定义的reducer脚本。 您必须确保具有相同用户id的所有行都由相同的减速机使用处理,并且使用函数以升序日期顺序发送

ADD FILE hdfs:///path/to/your/scripts/reducer_script.py ;
create table clickStream as 
select 
transform (a.user_id, a.time , a.url) 
USING 'reducer_sessionizer.py' as (user_id, time, url, fromPage, toPage)
from (select user_id, time, url from rawData distribute by user_id sort by time ) a ;
例如,您的脚本(以python为例)将逐行读取数据集,并且您将处理涉及密钥更改的数据:

sessionDuration = 10 minutes

for line_out in sys.stdin:
    str = []
    line_split = line_out.strip().split('\t')
# line_split = [userId, Time, Url]


# check the duration since last action, if above the sessionDuration, we create a new session Id
# check the user (the key) is still the same too, else, we create a new session Id
# we store the userId to compare on next iteration
    if (line_split[1] - prev_time > sessionDuration OR prev_user != line_split[0]) :
        sid = uuid.uuid4().hex
        prev_url = "Null"
        sess_start = line_split[1]
    else :
        pass

    str.append(line_split[0]) # userId
    str.append('\t')
    str.append(line_split[2]) # toPage
    str.append('\t')
    str.append(sess_start) # session start time
    str.append('\t')
    str.append(prev_url) # fromPage
    print "".join(str)

# for next iteration, we keep the previous userId, url and time
    prev_user = line_split[0]
    prev_time = line_split[1]
    prev_url = line_split[2]

(我不是真正的DEV,所以把它当作伪代码,我让你添加日期处理)