Scala spark scan hbase:扫描列是否会降低效率?
今天,我使用spark扫描Hbase。我的Hbase有一个名为“cf”的列族,“cf”中有25列。我想扫描列的onf,例如:column8。所以我设置了Hbase conf:Scala spark scan hbase:扫描列是否会降低效率?,scala,apache-spark,hbase,Scala,Apache Spark,Hbase,今天,我使用spark扫描Hbase。我的Hbase有一个名为“cf”的列族,“cf”中有25列。我想扫描列的onf,例如:column8。所以我设置了Hbase conf: val myConf = HBaseConfiguration.create() myConf.set("hbase.zookeeper.quorum", "compute000,compute001,compute002") myConf.set("hbase.master", "10.10.10.10:60
val myConf = HBaseConfiguration.create()
myConf.set("hbase.zookeeper.quorum", "compute000,compute001,compute002")
myConf.set("hbase.master", "10.10.10.10:60000")
myConf.set("hbase.zookeeper.property.clientPort", "2181")
myConf.set("hbase.defaults.for.version.skip", "true")
myConf.set(TableInputFormat.INPUT_TABLE, table)
myConf.set(TableInputFormat.SCAN_COLUMNS, "cf:column8")
val hbaseRDD = sc.newAPIHadoopRDD(myConf, classOf[TableInputFormat],
classOf[org.apache.hadoop.hbase.io.ImmutableBytesWritable],
classOf[org.apache.hadoop.hbase.client.Result])
val newHbaseRDD = hbaseRDD.map { case (_, result) =>
Array( Bytes.toString(result.getValue(cf.getBytes, i.getBytes)).toDouble)
}
newHbaseRDD //Array[Double]
我需要30分钟,但是,如果我不设置扫描列,我只需要4分钟
怎么了,我不应该设置参数“SCAN_COLUMNS”吗
你能帮我吗,谢谢
更新 当我使用此代码时,将一些列放入扫描,应用程序会导致错误:
ERROR TaskSetManager: Task 22 in stage 0.0 failed 4 times;
aborting job
Exception in thread "main" org.apache.spark.SparkException:
Job aborted due to stage failure: Task 22 in stage 0.0 failed 4 times,
most recent failure: Lost task 22.3 in stage 0.0 (TID 234, compute031): java.lang.NullPointerException
at no1.no1$$anonfun$9.apply(no1.scala:137)
at no1.no1$$anonfun$9.apply(no1.scala:137)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:201)
at org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:56)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:70)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:70)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler
$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1273)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1264)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1263)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1263)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:730)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:730)
at scala.Option.foreach(Option.scala:236)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:730)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1457)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1418)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
嗨,我试过一些例子,但对我来说没有任何区别。但是,我建议一件事,使用调试模式提交spark提交作业,我的意思是更新(/conf)中的log4.properties,使其达到调试级别。如果不存在,请从已有的.template文件复制,并确保看到调试消息。然后,检查一下这份工作,如果它提供了任何线索。我目前正在Hbase和spark环境上工作,让我们看看它的作用。还有,你的桌子有多大。?有多少列族和多少列。?
ERROR TaskSetManager: Task 22 in stage 0.0 failed 4 times;
aborting job
Exception in thread "main" org.apache.spark.SparkException:
Job aborted due to stage failure: Task 22 in stage 0.0 failed 4 times,
most recent failure: Lost task 22.3 in stage 0.0 (TID 234, compute031): java.lang.NullPointerException
at no1.no1$$anonfun$9.apply(no1.scala:137)
at no1.no1$$anonfun$9.apply(no1.scala:137)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:201)
at org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:56)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:70)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:70)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler
$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1273)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1264)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1263)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1263)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:730)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:730)
at scala.Option.foreach(Option.scala:236)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:730)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1457)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1418)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)