Hive 配置单元根据条件组合列值

Hive 配置单元根据条件组合列值,hive,Hive,我想知道是否可以根据条件组合列值。让我解释一下 假设我的数据是这样的 Id name offset 1 Jan 100 2 Janssen 104 3 Klaas 150 4 Jan 160 5 Janssen 164 我的输出应该是这样的 Id fullname offsets 1 Jan Janssen [ 100, 160 ] 我想合并两行中的名称值,其中两行的偏移量不超过1个字符 我的问题是,这种类型的数据操作是否可能,如果可能,是否有人可以共享一些代码和解释 请温柔一点,但这段代码

我想知道是否可以根据条件组合列值。让我解释一下

假设我的数据是这样的

Id name offset
1 Jan 100
2 Janssen 104
3 Klaas 150
4 Jan 160
5 Janssen 164
我的输出应该是这样的

Id fullname offsets
1 Jan Janssen [ 100, 160 ]
我想合并两行中的名称值,其中两行的偏移量不超过1个字符

我的问题是,这种类型的数据操作是否可能,如果可能,是否有人可以共享一些代码和解释

请温柔一点,但这段代码返回了一些我想要的

    ArrayList<String> persons = new ArrayList<String>();

    // write your code here
    String _previous = "";

    //Sample output form entities.txt
    //USER.A-GovDocs-f83c6ca3-9585-4c66-b9b0-f4c3bd57ccf4,Berkowitz,PERSON,9,10660
    //USER.A-GovDocs-f83c6ca3-9585-4c66-b9b0-f4c3bd57ccf4,Marottoli,PERSON,9,10685
    File file = new File("entities.txt");

    try {
        //
        // Create a new Scanner object which will read the data
        // from the file passed in. To check if there are more
        // line to read from it we check by calling the
        // scanner.hasNextLine() method. We then read line one
        // by one till all line is read.
        //
        Scanner scanner = new Scanner(file);
        while (scanner.hasNextLine()) {

            if(_previous == "" || _previous == null)
                _previous = scanner.nextLine();

            String _current = scanner.nextLine();
            //Compare the lines, if there offset is = 1
            int x = Integer.parseInt(_previous.split(",")[3]) + Integer.parseInt(_previous.split(",")[4]);
            int y = Integer.parseInt(_current.split(",")[4]);
            if(y-x == 1){
                persons.add(_previous.split(",")[1] + " " + _current.split(",")[1]);
                if(scanner.hasNextLine()){
                    _current = scanner.nextLine();
                }
            }else{
                persons.add(_previous.split(",")[1]);
            }
            _previous = _current;
        }
    } catch (Exception e) {
        e.printStackTrace();
    }

    for(String person : persons){
        System.out.println(person);
    }
它产生这个输出

Richard Marottoli
Marottoli
Marottoli
Marottoli
Berkowitz
Berkowitz
Marottoli
Lea
Lea
Ken
Marottoli
Berkowitz
Lea
Stephanie Putt

使用下面的“创建表格”加载表格

drop table if exists default.stack;
create external table default.stack
(junk string,
  name string,
 cat string,
 len int,
 off int
 )
 ROW FORMAT DELIMITED
 FIELDS terminated  by ','
 STORED AS INPUTFORMAT                                                  
  'org.apache.hadoop.mapred.TextInputFormat'                           
OUTPUTFORMAT                                                           
  'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat' 
 location 'hdfs://nameservice1/....';
使用下面的查询获得所需的输出

select max(name), off from (
select CASE when b.name is not null then
            concat(b.name," ",a.name)
            else
            a.name
       end as name
       ,Case WHEN b.off1 is not null
             then b.off1
             else a.off
        end as off
from default.stack a
left outer join (select name 
                       ,len+off+ 1 as off
                       ,off as off1
                 from default.stack) b
on a.off = b.off ) a
group by off
order by off;

我已经对此进行了测试,它会生成您想要的结果。

我对输出是如何导出的有点困惑,但我认为这与我在hive中使用自定义映射/减少的回答非常相似。我用一段java代码、样本数据和输出编辑我的问题。我想将java代码转换为配置单元代码。你知道这是否可行吗?抱歉,你的附加代码仍然没有明确说明你要完成什么。较新的代码/数据看起来像是要将表加载到配置单元中并提取列(这是很可能的),而前者是以某种方式组合行。
select max(name), off from (
select CASE when b.name is not null then
            concat(b.name," ",a.name)
            else
            a.name
       end as name
       ,Case WHEN b.off1 is not null
             then b.off1
             else a.off
        end as off
from default.stack a
left outer join (select name 
                       ,len+off+ 1 as off
                       ,off as off1
                 from default.stack) b
on a.off = b.off ) a
group by off
order by off;