使用Spark对Hbase进行SQl查询_Hbase_Apache Spark

使用Spark对Hbase进行SQl查询

hbase apache-spark

使用Spark对Hbase进行SQl查询,hbase,apache-spark,Hbase,Apache Spark,我试图使用Spark在Hbase上运行类似SQl的查询我能够运行这些查询，但是在非常小的数据集上，数据集大小增加的那一刻Spark需要很长时间才能完成这项工作 HBase表行数-300万请查找下面的代码- JavaPairRDD<ImmutableBytesWritable, Result> pairRdd = ctx .newAPIHadoopRDD(conf, TableInputFormat.class, Imm

我试图使用Spark在Hbase上运行类似SQl的查询我能够运行这些查询，但是在非常小的数据集上，数据集大小增加的那一刻Spark需要很长时间才能完成这项工作

HBase表行数-300万请查找下面的代码-

JavaPairRDD<ImmutableBytesWritable, Result> pairRdd = ctx
            .newAPIHadoopRDD(conf, TableInputFormat.class,
                    ImmutableBytesWritable.class,
                    org.apache.hadoop.hbase.client.Result.class).filter(new Function<Tuple2<ImmutableBytesWritable,Result>, Boolean>() {

                        public Boolean call(
                                Tuple2<ImmutableBytesWritable, Result> v1)
                                throws Exception {
                            long time=Bytes.toLong(v1._2.getValue(
                                    Bytes.toBytes("si"), Bytes.toBytes("at")));
                            if(time>1407314522 && time<1407814522){
                                return true;
                            }
                            return false;
                        }
                    });

JavaRDD people=pairRdd .Map新功能，个人>{

                public Person call(Tuple2<ImmutableBytesWritable, Result> v1)
                        throws Exception {
                    System.out.println("comming");
                    Person person = new Person();
                    person.setCalling(Bytes.toLong(v1._2.getRow()));
                    person.setCalled(Bytes.toLong(v1._2.getValue(
                            Bytes.toBytes("si"), Bytes.toBytes("called"))));
                    person.setTime(Bytes.toLong(v1._2.getValue(
                            Bytes.toBytes("si"), Bytes.toBytes("at"))));

                    return person;
                }
            });
    JavaSchemaRDD schemaPeople = sqlCtx.applySchema(people, Person.class);
    schemaPeople.registerAsTable("people");

    // SQL can be run over RDDs that have been registered as tables.
    JavaSchemaRDD teenagers = sqlCtx
            .sql("SELECT calling,called FROM people WHERE time >1407314522");
    teenagers.printSchema();
    List<Map<Long, Long>> teenagerNames = teenagers.map(
            new Function<Row, Map<Long, Long>>() {
                public Map<Long, Long> call(Row row) {
                    Map<Long, Long> tmpMap = new HashMap<Long, Long>();
                    tmpMap.put(row.getLong(0), row.getLong(1));
                    return tmpMap;
                }
            }).collect();
    /*
     * for (String name: teenagerNames) { System.out.println(name); }
     */
    for(Map<Long,Long> teenagerNamestmp:teenagerNames){
    for (Map.Entry<Long, Long> entry : teenagerNamestmp.entrySet()) {
        System.out.println(entry.getKey() + "/" + entry.getValue());
    }
}

我不知道我是否缺少一些配置设置任何指点都会大有帮助

谢谢，

TableInputFormat是基于Map/Reduce的，因此速度会慢得多。请在未来版本的Spark中寻找本机HBase API驱动的解决方案。

您的HBase表中有多少青少年？另外，您是如何在群集上或使用本地主机运行Spark作业的？