Warning: file_get_contents(/data/phpspider/zhask/data//catemap/3/apache-spark/6.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Java 从Spark写入HBase:org.apache.Spark.sparkeException:任务不可序列化_Java_Apache Spark_Hbase - Fatal编程技术网

Java 从Spark写入HBase:org.apache.Spark.sparkeException:任务不可序列化

Java 从Spark写入HBase:org.apache.Spark.sparkeException:任务不可序列化,java,apache-spark,hbase,Java,Apache Spark,Hbase,我正在为我的大学做一个热图项目,我们必须从一个txt文件(坐标、高度)中获取一些数据(212Go),然后将其放入HBase中,在一个使用Express的web客户端上检索 我练习使用144Mo文件,这是有效的: SparkConf conf = new SparkConf().setAppName("PLE"); JavaSparkContext context = new JavaSparkContext(conf); JavaRDD<String> data = context.

我正在为我的大学做一个热图项目,我们必须从一个txt文件(坐标、高度)中获取一些数据(212Go),然后将其放入HBase中,在一个使用Express的web客户端上检索

我练习使用144Mo文件,这是有效的:

SparkConf conf = new SparkConf().setAppName("PLE");
JavaSparkContext context = new JavaSparkContext(conf);
JavaRDD<String> data = context.textFile(args[0]);
Connection co = ConnectionFactory.createConnection(getConf());
createTable(co);
Table table = co.getTable(TableName.valueOf(TABLE_NAME));
Put put = new Put(Bytes.toBytes("KEY"));

for (String s : data.collect()) {
    String[] tmp = s.split(",");
    put.addImmutable(FAMILY,
                    Bytes.toBytes(tmp[2]),
                    Bytes.toBytes(tmp[0]+","+tmp[1]));
}

table.put(put);
SparkConf conf=new SparkConf().setAppName(“PLE”);
JavaSparkContext上下文=新的JavaSparkContext(conf);
JavaRDD data=context.textFile(args[0]);
Connection co=ConnectionFactory.createConnection(getConf());
createTable(co);
Table Table=co.getTable(TableName.valueOf(Table_NAME));
Put Put=新的Put(字节数。toBytes(“键”));
for(字符串s:data.collect()){
字符串[]tmp=s.split(“,”);
put.addImmutable(系列,
字节.toBytes(tmp[2]),
toBytes(tmp[0]+“,”+tmp[1]);
}
表.put(put);
但是我现在使用了212Go文件,我得到了一些内存错误,我想收集方法收集了内存中的所有数据,所以212Go太多了

所以现在我试着这样做:

SparkConf conf = new SparkConf().setAppName("PLE");
JavaSparkContext context = new JavaSparkContext(conf);
JavaRDD<String> data = context.textFile(args[0]);
Connection co = ConnectionFactory.createConnection(getConf());
createTable(co);
Table table = co.getTable(TableName.valueOf(TABLE_NAME));
Put put = new Put(Bytes.toBytes("KEY"));

data.foreach(line ->{
    String[] tmp = line.split(",");
    put.addImmutable(FAMILY,
                    Bytes.toBytes(tmp[2]),
                    Bytes.toBytes(tmp[0]+","+tmp[1]));
});

table.put(put);
SparkConf conf=new SparkConf().setAppName(“PLE”);
JavaSparkContext上下文=新的JavaSparkContext(conf);
JavaRDD data=context.textFile(args[0]);
Connection co=ConnectionFactory.createConnection(getConf());
createTable(co);
Table Table=co.getTable(TableName.valueOf(Table_NAME));
Put Put=新的Put(字节数。toBytes(“键”));
data.foreach(第->{
字符串[]tmp=line.split(“,”);
put.addImmutable(系列,
字节.toBytes(tmp[2]),
toBytes(tmp[0]+“,”+tmp[1]);
});
表.put(put);
我得到了“org.apache.spark.SparkException:Task not serializable”,我搜索了它并尝试了一些修复,但没有成功,我在这里读到:


其实我并不完全理解这个话题,我只是一个学生,也许我的问题的答案是显而易见的,也许不是,无论如何,提前谢谢

根据经验,序列化数据库连接(任何类型)都没有意义。无论Spark与否,都没有设计为序列化和反序列化

为每个分区创建连接:

data.foreachPartition(partition -> {
  Connection co = ConnectionFactory.createConnection(getConf());
  ... // All required setup
  Table table = co.getTable(TableName.valueOf(TABLE_NAME));
  Put put = new Put(Bytes.toBytes("KEY"));
   while (partition.hasNext()) {
     String line = partition.next();
     String[] tmp = line.split(",");
     put.addImmutable(FAMILY,
                Bytes.toBytes(tmp[2]),
                Bytes.toBytes(tmp[0]+","+tmp[1]));
   }
   ... // Clean connections
});
我还建议您阅读官方的Spark流媒体编程指南