使用Spark对Hbase进行SQl查询
我试图使用Spark在Hbase上运行类似SQl的查询我能够运行这些查询,但是在非常小的数据集上,数据集大小增加的那一刻Spark需要很长时间才能完成这项工作 HBase表行数-300万 请查找下面的代码-使用Spark对Hbase进行SQl查询,hbase,apache-spark,Hbase,Apache Spark,我试图使用Spark在Hbase上运行类似SQl的查询我能够运行这些查询,但是在非常小的数据集上,数据集大小增加的那一刻Spark需要很长时间才能完成这项工作 HBase表行数-300万 请查找下面的代码- JavaPairRDD<ImmutableBytesWritable, Result> pairRdd = ctx .newAPIHadoopRDD(conf, TableInputFormat.class, Imm
JavaPairRDD<ImmutableBytesWritable, Result> pairRdd = ctx
.newAPIHadoopRDD(conf, TableInputFormat.class,
ImmutableBytesWritable.class,
org.apache.hadoop.hbase.client.Result.class).filter(new Function<Tuple2<ImmutableBytesWritable,Result>, Boolean>() {
public Boolean call(
Tuple2<ImmutableBytesWritable, Result> v1)
throws Exception {
long time=Bytes.toLong(v1._2.getValue(
Bytes.toBytes("si"), Bytes.toBytes("at")));
if(time>1407314522 && time<1407814522){
return true;
}
return false;
}
});
JavaRDD people=pairRdd
.Map新功能,个人>{
public Person call(Tuple2<ImmutableBytesWritable, Result> v1)
throws Exception {
System.out.println("comming");
Person person = new Person();
person.setCalling(Bytes.toLong(v1._2.getRow()));
person.setCalled(Bytes.toLong(v1._2.getValue(
Bytes.toBytes("si"), Bytes.toBytes("called"))));
person.setTime(Bytes.toLong(v1._2.getValue(
Bytes.toBytes("si"), Bytes.toBytes("at"))));
return person;
}
});
JavaSchemaRDD schemaPeople = sqlCtx.applySchema(people, Person.class);
schemaPeople.registerAsTable("people");
// SQL can be run over RDDs that have been registered as tables.
JavaSchemaRDD teenagers = sqlCtx
.sql("SELECT calling,called FROM people WHERE time >1407314522");
teenagers.printSchema();
List<Map<Long, Long>> teenagerNames = teenagers.map(
new Function<Row, Map<Long, Long>>() {
public Map<Long, Long> call(Row row) {
Map<Long, Long> tmpMap = new HashMap<Long, Long>();
tmpMap.put(row.getLong(0), row.getLong(1));
return tmpMap;
}
}).collect();
/*
* for (String name: teenagerNames) { System.out.println(name); }
*/
for(Map<Long,Long> teenagerNamestmp:teenagerNames){
for (Map.Entry<Long, Long> entry : teenagerNamestmp.entrySet()) {
System.out.println(entry.getKey() + "/" + entry.getValue());
}
}
我不知道我是否缺少一些配置设置
任何指点都会大有帮助
谢谢,TableInputFormat是基于Map/Reduce的,因此速度会慢得多。请在未来版本的Spark中寻找本机HBase API驱动的解决方案。您的HBase表中有多少青少年?另外,您是如何在群集上或使用本地主机运行Spark作业的?