Hive 配置单元根据条件组合列值
我想知道是否可以根据条件组合列值。让我解释一下 假设我的数据是这样的Hive 配置单元根据条件组合列值,hive,Hive,我想知道是否可以根据条件组合列值。让我解释一下 假设我的数据是这样的 Id name offset 1 Jan 100 2 Janssen 104 3 Klaas 150 4 Jan 160 5 Janssen 164 我的输出应该是这样的 Id fullname offsets 1 Jan Janssen [ 100, 160 ] 我想合并两行中的名称值,其中两行的偏移量不超过1个字符 我的问题是,这种类型的数据操作是否可能,如果可能,是否有人可以共享一些代码和解释 请温柔一点,但这段代码
Id name offset
1 Jan 100
2 Janssen 104
3 Klaas 150
4 Jan 160
5 Janssen 164
我的输出应该是这样的
Id fullname offsets
1 Jan Janssen [ 100, 160 ]
我想合并两行中的名称值,其中两行的偏移量不超过1个字符
我的问题是,这种类型的数据操作是否可能,如果可能,是否有人可以共享一些代码和解释
请温柔一点,但这段代码返回了一些我想要的
ArrayList<String> persons = new ArrayList<String>();
// write your code here
String _previous = "";
//Sample output form entities.txt
//USER.A-GovDocs-f83c6ca3-9585-4c66-b9b0-f4c3bd57ccf4,Berkowitz,PERSON,9,10660
//USER.A-GovDocs-f83c6ca3-9585-4c66-b9b0-f4c3bd57ccf4,Marottoli,PERSON,9,10685
File file = new File("entities.txt");
try {
//
// Create a new Scanner object which will read the data
// from the file passed in. To check if there are more
// line to read from it we check by calling the
// scanner.hasNextLine() method. We then read line one
// by one till all line is read.
//
Scanner scanner = new Scanner(file);
while (scanner.hasNextLine()) {
if(_previous == "" || _previous == null)
_previous = scanner.nextLine();
String _current = scanner.nextLine();
//Compare the lines, if there offset is = 1
int x = Integer.parseInt(_previous.split(",")[3]) + Integer.parseInt(_previous.split(",")[4]);
int y = Integer.parseInt(_current.split(",")[4]);
if(y-x == 1){
persons.add(_previous.split(",")[1] + " " + _current.split(",")[1]);
if(scanner.hasNextLine()){
_current = scanner.nextLine();
}
}else{
persons.add(_previous.split(",")[1]);
}
_previous = _current;
}
} catch (Exception e) {
e.printStackTrace();
}
for(String person : persons){
System.out.println(person);
}
它产生这个输出
Richard Marottoli
Marottoli
Marottoli
Marottoli
Berkowitz
Berkowitz
Marottoli
Lea
Lea
Ken
Marottoli
Berkowitz
Lea
Stephanie Putt
使用下面的“创建表格”加载表格
drop table if exists default.stack;
create external table default.stack
(junk string,
name string,
cat string,
len int,
off int
)
ROW FORMAT DELIMITED
FIELDS terminated by ','
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
location 'hdfs://nameservice1/....';
使用下面的查询获得所需的输出
select max(name), off from (
select CASE when b.name is not null then
concat(b.name," ",a.name)
else
a.name
end as name
,Case WHEN b.off1 is not null
then b.off1
else a.off
end as off
from default.stack a
left outer join (select name
,len+off+ 1 as off
,off as off1
from default.stack) b
on a.off = b.off ) a
group by off
order by off;
我已经对此进行了测试,它会生成您想要的结果。我对输出是如何导出的有点困惑,但我认为这与我在hive中使用自定义映射/减少的回答非常相似。我用一段java代码、样本数据和输出编辑我的问题。我想将java代码转换为配置单元代码。你知道这是否可行吗?抱歉,你的附加代码仍然没有明确说明你要完成什么。较新的代码/数据看起来像是要将表加载到配置单元中并提取列(这是很可能的),而前者是以某种方式组合行。
select max(name), off from (
select CASE when b.name is not null then
concat(b.name," ",a.name)
else
a.name
end as name
,Case WHEN b.off1 is not null
then b.off1
else a.off
end as off
from default.stack a
left outer join (select name
,len+off+ 1 as off
,off as off1
from default.stack) b
on a.off = b.off ) a
group by off
order by off;