Scala spark scan hbase:扫描列是否会降低效率?

Scala spark scan hbase:扫描列是否会降低效率?,scala,apache-spark,hbase,Scala,Apache Spark,Hbase,今天,我使用spark扫描Hbase。我的Hbase有一个名为“cf”的列族,“cf”中有25列。我想扫描列的onf,例如:column8。所以我设置了Hbase conf: val myConf = HBaseConfiguration.create() myConf.set("hbase.zookeeper.quorum", "compute000,compute001,compute002") myConf.set("hbase.master", "10.10.10.10:60

今天,我使用spark扫描Hbase。我的Hbase有一个名为“cf”的列族,“cf”中有25列。我想扫描列的onf,例如:column8。所以我设置了Hbase conf:

val myConf = HBaseConfiguration.create()
 myConf.set("hbase.zookeeper.quorum", "compute000,compute001,compute002")
     myConf.set("hbase.master", "10.10.10.10:60000")
     myConf.set("hbase.zookeeper.property.clientPort", "2181")
     myConf.set("hbase.defaults.for.version.skip", "true")
     myConf.set(TableInputFormat.INPUT_TABLE, table)
     myConf.set(TableInputFormat.SCAN_COLUMNS, "cf:column8")
     val hbaseRDD = sc.newAPIHadoopRDD(myConf, classOf[TableInputFormat],
            classOf[org.apache.hadoop.hbase.io.ImmutableBytesWritable],
            classOf[org.apache.hadoop.hbase.client.Result])
     val newHbaseRDD = hbaseRDD.map { case (_, result) =>
              Array( Bytes.toString(result.getValue(cf.getBytes, i.getBytes)).toDouble)
            }

          newHbaseRDD  //Array[Double]
我需要30分钟,但是,如果我不设置扫描列,我只需要4分钟

怎么了,我不应该设置参数“SCAN_COLUMNS”吗

你能帮我吗,谢谢


更新 当我使用此代码时,将一些列放入扫描,应用程序会导致错误:

     ERROR TaskSetManager: Task 22 in stage 0.0 failed 4 times; 
aborting job
    Exception in thread "main" org.apache.spark.SparkException: 
Job aborted due to stage failure: Task 22 in stage 0.0 failed 4 times, 
most recent failure: Lost task 22.3 in stage 0.0 (TID 234, compute031): java.lang.NullPointerException
    at no1.no1$$anonfun$9.apply(no1.scala:137)
    at no1.no1$$anonfun$9.apply(no1.scala:137)
    at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
    at org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:201)
    at org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:56)
    at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:70)
    at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
    at org.apache.spark.scheduler.Task.run(Task.scala:70)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)

Driver stacktrace:
    at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler
$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1273)
    at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1264)
    at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1263)
    at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
    at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
    at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1263)
    at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:730)
    at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:730)
    at scala.Option.foreach(Option.scala:236)
    at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:730)
    at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1457)
    at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1418)
    at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)

嗨,我试过一些例子,但对我来说没有任何区别。但是,我建议一件事,使用调试模式提交spark提交作业,我的意思是更新(/conf)中的log4.properties,使其达到调试级别。如果不存在,请从已有的.template文件复制,并确保看到调试消息。然后,检查一下这份工作,如果它提供了任何线索。我目前正在Hbase和spark环境上工作,让我们看看它的作用。还有,你的桌子有多大。?有多少列族和多少列。?
     ERROR TaskSetManager: Task 22 in stage 0.0 failed 4 times; 
aborting job
    Exception in thread "main" org.apache.spark.SparkException: 
Job aborted due to stage failure: Task 22 in stage 0.0 failed 4 times, 
most recent failure: Lost task 22.3 in stage 0.0 (TID 234, compute031): java.lang.NullPointerException
    at no1.no1$$anonfun$9.apply(no1.scala:137)
    at no1.no1$$anonfun$9.apply(no1.scala:137)
    at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
    at org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:201)
    at org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:56)
    at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:70)
    at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
    at org.apache.spark.scheduler.Task.run(Task.scala:70)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)

Driver stacktrace:
    at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler
$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1273)
    at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1264)
    at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1263)
    at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
    at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
    at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1263)
    at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:730)
    at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:730)
    at scala.Option.foreach(Option.scala:236)
    at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:730)
    at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1457)
    at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1418)
    at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)