Java Hadoop和Hive中的位级查询_Java_Hadoop_Hive_Mapreduce_Bit

Java Hadoop和Hive中的位级查询

java hadoop hive mapreduce

Java Hadoop和Hive中的位级查询,java,hadoop,hive,mapreduce,bit,Java,Hadoop,Hive,Mapreduce,Bit,我们有一个Hadoop中位级查询的用例。事情是这样的：给定一组包含日期/时间戳和一个或多个16位数据字的可变长度记录，返回一个日期/时间戳列表，其中一个或多个任意数据字的任意位组合被设置为查询中指定的值例如。。。鉴于以下数据： Timestamp Word 1 bits Word 2 bits ------------------ ---------------------- ---------------------

我们有一个Hadoop中位级查询的用例。事情是这样的：

给定一组包含日期/时间戳和一个或多个16位数据字的可变长度记录，返回一个日期/时间戳列表，其中一个或多个任意数据字的任意位组合被设置为查询中指定的值

例如。。。鉴于以下数据：

Timestamp             Word 1 bits                Word 2 bits
------------------    ----------------------     ---------------------          
2017-06-16 08:15:05   0010  1101  1111  0000     1011  0010  1111  0010
2017-06-16 08:15:06   0010  1110  1111  0000     ...
2017-06-16 08:15:07   0010  1101  1111  0000     ...
2017-06-16 08:15:08   0010  1110  1111  0000
2017-06-16 08:15:09   0010  1101  1111  0000
2017-06-16 08:15:10   0010  1110  1111  0000

如果查询是“返回所有时间戳，其中单词1位0为0，单词1位1为1”，则结果为

Timestamp             Word 1 bits
------------------    ----------------------
2017-06-16 08:15:06   0010  1110  1111  0000
2017-06-16 08:15:08   0010  1110  1111  0000
2017-06-16 08:15:10   0010  1110  1111  0000
                              ^^

数据以制表符分隔的形式作为十六进制值提供：

Timestamp             Word1  Word2  Word3  Word4  
------------------    ----   ----   ----   ----
2017-06-16 08:15:05   2DF0  ... a varying number of 16 bit data words continues out here.
2017-06-16 08:15:06   2EF0
2017-06-16 08:15:07   2DF0
2017-06-16 08:15:08   2EF0
2017-06-16 08:15:09   2DF0
2017-06-16 08:15:10   2EF0
...

我们一直在讨论如何在Hadoop配置单元中表示这些数据并对其进行查询。将每个数据字的每一位放入自己的整数字段似乎效率极低，但其优点是Hadoop可以直接查询，假设Hadoop服务器可以容纳每个记录中的可变列数

为了解决这个问题，我提出了将这些数据作为一级时间戳和16位无符号整数导入配置单元的建议，然后使用位提取Java函数为每个查询构造一个MapReduce作业，以构造一个临时表，该临时表具有一个时间戳字段，并且每个感兴趣的位在其自己的一级整数中。我们可以说，从临时查询中获得最终结果所需的Hadoop查询是微不足道的

然而，目前提出的想法是将十六进制文本直接保存到数据池中。我们的数据科学家似乎认为这样的安排将允许直接查询；也就是说，不需要临时表，十六进制格式提供了合理高效的存储

这是怎么回事？是否有某种方法可以索引这些文本，然后对其进行某种位级别的文本搜索，屏蔽不感兴趣的位

（我将考虑如何更好地解决这个问题的建议。）

Demo 数据。tsv

2017-06-16 08:15:05 2DF0
2017-06-16 08:15:06 2EF0    0000
2017-06-16 08:15:07 2DF0    AAAA    BBBB    CCCC
2017-06-16 08:15:08 2EF0    1111    2222
2017-06-16 08:15:09 2DF0    
2017-06-16 08:15:10 2EF0    DDDD    EEEE

替代数据结构

create external table mytable
(
    ts          timestamp
   ,word1       string
   ,word2       string
   ,word3       string
   ,word4       string
   ,word5       string
   ,word6       string
   ,word7       string
   ,word8       string
   ,word9       string
)
row format delimited
fields terminated by '\t'
stored as textfile
;

“你为什么不问问你的数据科学家？”这是个好问题。说来话长；简单的回答是，我没有现成的访问数据科学家的权限，而且我不是Hadoop专家。请提供建议

select  *
       
from    mytable
;

+----------------------------+---------------------------+
|             ts             |            words          |
+----------------------------+---------------------------+
| 2017-06-16 08:15:05.000000 | 2DF0                      |
| 2017-06-16 08:15:06.000000 | 2EF0 0000                 |
| 2017-06-16 08:15:07.000000 | 2DF0 AAAA    BBBB    CCCC |
| 2017-06-16 08:15:08.000000 | 2EF0 1111    2222         |
| 2017-06-16 08:15:09.000000 | 2DF0                      |
| 2017-06-16 08:15:10.000000 | 2EF0 DDDD    EEEE         |
+----------------------------+---------------------------+

select  ts
       ,split(words,'\\t')  as words
       
from    mytable
;

+----------------------------+-------------------------------+
|             ts             |             words             |
+----------------------------+-------------------------------+
| 2017-06-16 08:15:05.000000 | ["2DF0"]                      |
| 2017-06-16 08:15:06.000000 | ["2EF0","0000"]               |
| 2017-06-16 08:15:07.000000 | ["2DF0","AAAA","BBBB","CCCC"] |
| 2017-06-16 08:15:08.000000 | ["2EF0","1111","2222"]        |
| 2017-06-16 08:15:09.000000 | ["2DF0",""]                   |
| 2017-06-16 08:15:10.000000 | ["2EF0","DDDD","EEEE"]        |
+----------------------------+-------------------------------+

select  ts
       ,lpad(conv(split(words,'\\t')[0],16,2),16,'0')  as word1_bits
       
from    mytable
;

+----------------------------+------------------+
|             ts             |    word1_bits    |
+----------------------------+------------------+
| 2017-06-16 08:15:05.000000 | 0010110111110000 |
| 2017-06-16 08:15:06.000000 | 0010111011110000 |
| 2017-06-16 08:15:07.000000 | 0010110111110000 |
| 2017-06-16 08:15:08.000000 | 0010111011110000 |
| 2017-06-16 08:15:09.000000 | 0010110111110000 |
| 2017-06-16 08:15:10.000000 | 0010111011110000 |
+----------------------------+------------------+

select  ts
       
from    mytable

where   substr(lpad(conv(split(words,'\\t')[0],16,2),16,'0'),7,2) = '10'
;

+----------------------------+
|             ts             |
+----------------------------+
| 2017-06-16 08:15:06.000000 |
| 2017-06-16 08:15:08.000000 |
| 2017-06-16 08:15:10.000000 |
+----------------------------+

create external table mytable
(
    ts          timestamp
   ,word1       string
   ,word2       string
   ,word3       string
   ,word4       string
   ,word5       string
   ,word6       string
   ,word7       string
   ,word8       string
   ,word9       string
)
row format delimited
fields terminated by '\t'
stored as textfile
;

select * from mytable
;

+----------------------------+-------+--------+--------+--------+--------+--------+--------+--------+--------+
|             ts             | word1 | word2  | word3  | word4  | word5  | word6  | word7  | word8  | word9  |
+----------------------------+-------+--------+--------+--------+--------+--------+--------+--------+--------+
| 2017-06-16 08:15:05.000000 | 2DF0  | (null) | (null) | (null) | (null) | (null) | (null) | (null) | (null) |
| 2017-06-16 08:15:06.000000 | 2EF0  | 0000   | (null) | (null) | (null) | (null) | (null) | (null) | (null) |
| 2017-06-16 08:15:07.000000 | 2DF0  | AAAA   | BBBB   | CCCC   | (null) | (null) | (null) | (null) | (null) |
| 2017-06-16 08:15:08.000000 | 2EF0  | 1111   | 2222   | (null) | (null) | (null) | (null) | (null) | (null) |
| 2017-06-16 08:15:09.000000 | 2DF0  |        | (null) | (null) | (null) | (null) | (null) | (null) | (null) |
| 2017-06-16 08:15:10.000000 | 2EF0  | DDDD   | EEEE   | (null) | (null) | (null) | (null) | (null) | (null) |
+----------------------------+-------+--------+--------+--------+--------+--------+--------+--------+--------+