Hadoop Spark无法检索特定列中的所有Hbase数据_Hadoop_Apache Spark_Mapreduce_Hbase

Hadoop Spark无法检索特定列中的所有Hbase数据

hadoop apache-spark mapreduce hbase

Hadoop Spark无法检索特定列中的所有Hbase数据,hadoop,apache-spark,mapreduce,hbase,Hadoop,Apache Spark,Mapreduce,Hbase,我的Hbase表有3000万条记录，每条记录都有一列raw:sample，raw是columnfamily sample是column。此列非常大，大小从几KB到50MB不等。当我运行以下Spark代码时，它只能获得4万条记录，但我应该获得3000万条记录： val conf = HBaseConfiguration.create() conf.set("hbase.zookeeper.quorum", "10.1.1.15:2181") conf.set(TableInputFormat.IN

我的Hbase表有3000万条记录，每条记录都有一列

raw:sample

，raw是columnfamily sample是column。此列非常大，大小从几KB到50MB不等。当我运行以下Spark代码时，它只能获得4万条记录，但我应该获得3000万条记录：

val conf = HBaseConfiguration.create()
conf.set("hbase.zookeeper.quorum", "10.1.1.15:2181")
conf.set(TableInputFormat.INPUT_TABLE, "sampleData")
conf.set(TableInputFormat.SCAN_COLUMNS, "raw:sample")
conf.set("hbase.client.keyvalue.maxsize","0")
val hBaseRDD = sc.newAPIHadoopRDD(conf, classOf[TableInputFormat],classOf[org.apache.hadoop.hbase.io.ImmutableBytesWritable],classOf[org.apache.hadoop.hbase.client.Result])
var arrRdd:RDD[Map[String,Object]] = hBaseRDD.map(tuple => tuple._2).map(...)

现在，我通过首先获取id列表，然后通过Spark foreach中的纯Hbase java客户端迭代id列表来获取列

raw:sample

。有什么想法吗？为什么我不能通过Spark获得所有的

raw:sample

，是因为这个专栏太大了吗

几天前，我的一个zookeeper节点和datanodes关闭，但我很快就修复了它，因为副本是3，这是原因吗？如果我运行

hbck-repair

会有帮助，非常感谢

在内部，TableInputFormat创建一个扫描对象，以便从HBase检索数据

尝试（不使用Spark），配置为从HBase检索同一列，查看错误是否重复：

// Instantiating Configuration class
  Configuration config = HBaseConfiguration.create();

  // Instantiating HTable class
  HTable table = new HTable(config, "emp");

  // Instantiating the Scan class
  Scan scan = new Scan();

  // Scanning the required columns
  scan.addColumn(Bytes.toBytes("personal"), Bytes.toBytes("name"));
  scan.addColumn(Bytes.toBytes("personal"), Bytes.toBytes("city"));

  // Getting the scan result
  ResultScanner scanner = table.getScanner(scan);

  // Reading values from scan result
  for (Result result = scanner.next(); result != null; result = scanner.next())

  System.out.println("Found row : " + result);
  //closing the scanner
  scanner.close();

此外，默认情况下，TableInputFormat配置为从HBase服务器请求非常小的数据块（这是不好的，会导致很大的开销）。将以下设置为：

在内部，TableInputFormat创建一个扫描对象，以便从HBase检索数据

尝试（不使用Spark），配置为从HBase检索同一列，查看错误是否重复：

// Instantiating Configuration class
  Configuration config = HBaseConfiguration.create();

  // Instantiating HTable class
  HTable table = new HTable(config, "emp");

  // Instantiating the Scan class
  Scan scan = new Scan();

  // Scanning the required columns
  scan.addColumn(Bytes.toBytes("personal"), Bytes.toBytes("name"));
  scan.addColumn(Bytes.toBytes("personal"), Bytes.toBytes("city"));

  // Getting the scan result
  ResultScanner scanner = table.getScanner(scan);

  // Reading values from scan result
  for (Result result = scanner.next(); result != null; result = scanner.next())

  System.out.println("Found row : " + result);
  //closing the scanner
  scanner.close();

此外，默认情况下，TableInputFormat配置为从HBase服务器请求非常小的数据块（这是不好的，会导致很大的开销）。将以下设置为：

对于像您这样的高吞吐量，ApacheKafka是集成数据流并保持数据管道活动的最佳解决方案。请参考卡夫卡的一些使用案例

再来一次

对于像您这样的高吞吐量，Apache Kafka是集成数据流并保持数据管道活动的最佳解决方案。请参考卡夫卡的一些使用案例

再来一次

Hbase中已经有数据，你是说我最好从Hbase读到卡夫卡，然后从卡夫卡消费？我认为这是多余的，没有必要使用Kafka作为Hbase和Spark/MR之间的桥梁。这只是使用Spark从Hbase读取数据的批处理作业。Kafka不是必需的，但它将确保大型数据管道和数据流不会中断。对于3000万条记录，Kafka使您的系统更稳定。Hbase中已有的数据，您的意思是我最好从Hbase读取到Kafka，然后从Kafka中消费？我认为这是多余的，没有必要使用Kafka作为Hbase和Spark/MR之间的桥梁。这只是使用Spark从Hbase读取数据的批处理作业。Kafka不是必需的，但它将确保大型数据管道和数据流不会中断。对于3000万条记录，Kafka使您的系统更加稳定。我尝试了所有方法，但仍然无法通过扫描检索所有行。哪些行未返回？最后一个？还是在中间错过了行？我尝试了一切，但仍然无法通过扫描检索所有行。哪些行没有返回？最后一个？还是错过中间的一排？