Apache spark Spark作业每天失败一次,java.io.OptionalDataException
我正在使用spark 2.2.0,并在Cloudera上使用Thread运行我的工作。这是一个流式作业,它从卡夫卡获取事件,过滤并丰富它们,将它们存储在ES中,然后将偏移量提交回卡夫卡。以下是配置: 卡夫卡 火花 弹性搜索 我过滤并充实执行器上的事件,然后将它们发送回驱动程序,以便驱动程序可以将它们存储在ES中,一旦收到ES的回复,就可以将偏移量存储在Kafka中。我尝试将事件存储在executors上的ES中,但整个过程变得非常缓慢,我开始遇到巨大的延迟。我必须同步并等待ES发回响应,然后将该响应发送给驱动程序 以下是我的代码片段:Apache spark Spark作业每天失败一次,java.io.OptionalDataException,apache-spark,
elasticsearch,apache-kafka,spark-streaming,spark-streaming-kafka,Apache Spark,
elasticsearch,Apache Kafka,Spark Streaming,Spark Streaming Kafka,我正在使用spark 2.2.0,并在Cloudera上使用Thread运行我的工作。这是一个流式作业,它从卡夫卡获取事件,过滤并丰富它们,将它们存储在ES中,然后将偏移量提交回卡夫卡。以下是配置: 卡夫卡 火花 弹性搜索 我过滤并充实执行器上的事件,然后将它们发送回驱动程序,以便驱动程序可以将它们存储在ES中,一旦收到ES的回复,就可以将偏移量存储在Kafka中。我尝试将事件存储在executors上的ES中,但整个过程变得非常缓慢,我开始遇到巨大的延迟。我必须同步并等待ES发回响应,然后将该
kafkaStream.foreachRDD( // kafka topic
rdd -> { // runs on driver
String batchIdentifier =
Long.toHexString(Double.doubleToLongBits(Math.random()));
LOGGER.info("@@ [" + batchIdentifier + "] Starting batch ...");
Instant batchStart = Instant.now();
List<InsertRequestWrapper> insertRequests =
rdd.mapPartitionsWithIndex( // kafka partition
(index, eventsIterator) -> { // runs on worker
OffsetRange[] offsetRanges = ((HasOffsetRanges) rdd.rdd()).offsetRanges();
LOGGER.info(
"@@ Consuming " + offsetRanges[index].count() + " events" + " partition: " + index
);
if (!eventsIterator.hasNext()) {
return Collections.emptyIterator();
}
// get single ES documents
List<SingleEventBaseDocument> eventList = getSingleEventBaseDocuments(eventsIterator);
// build request wrappers
List<InsertRequestWrapper> requestWrapperList = getRequestsToInsert(eventList, offsetRanges[index]);
LOGGER.info(
"@@ Processed " + offsetRanges[index].count() + " events" + " partition: " + index + " list size: " + eventList.size()
);
return requestWrapperList.iterator();
},
true
).collect();
elasticSearchRepository.addElasticSearchDocuments(insertRequests);
LOGGER.info(
"@@ [" + batchIdentifier + "] Finished batch of " + insertRequests.size() + " messages " +
"in " + (Instant.now().toEpochMilli() - batchStart.toEpochMilli()) + "ms"
);
});
private List<SingleEventBaseDocument> getSingleEventBaseDocuments(final Iterator<ConsumerRecord<String, byte[]>> eventsIterator) {
Iterable<ConsumerRecord<String, byte[]>> iterable = () -> eventsIterator;
return StreamSupport.stream(iterable.spliterator(), true)
.map(this::toEnrichedEvent)
.filter(this::isValidEvent)
.map(this::toEventDocument)
.collect(Collectors.toList());
}
private List<InsertRequestWrapper> getRequestsToInsert(List<SingleEventBaseDocument> list, OffsetRange offsetRange) {
return list.stream()
.map(event -> toInsertRequestWrapper(event, offsetRange))
.collect(Collectors.toList());
}
执行者日志:
需要执行器日志才能真正了解,但它很可能与内存有关,因为您正在使用.collect,或者您正在使用未向我们展示的函数中的迭代器。您不应该使用.collect函数,而应该使用可以直接将rdd写入es的库。起点:@soote我已经编辑了我的问题并添加了执行器堆栈跟踪和函数调用。从堆栈跟踪来看,问题似乎发生在执行器中,而不是驱动程序中。此外,这项工作持续了1.5天。它最初处理500k事件,因为我想进行批量导入,然后数量下降到20k。当驱动程序在10秒内处理500k事件时,它如何不能处理20k事件/
num-executors 5
executor-cores 8
executor-memory 12g
driver-memory 16g
spark.batch.interval=10
es.bulk.action.count=5000
es.bulk.action.bytes=20
es.bulk.action.flush.interval=30
es.bulk.backoff.policy.interval=1
es.bulk.number.of.retries=0
kafkaStream.foreachRDD( // kafka topic
rdd -> { // runs on driver
String batchIdentifier =
Long.toHexString(Double.doubleToLongBits(Math.random()));
LOGGER.info("@@ [" + batchIdentifier + "] Starting batch ...");
Instant batchStart = Instant.now();
List<InsertRequestWrapper> insertRequests =
rdd.mapPartitionsWithIndex( // kafka partition
(index, eventsIterator) -> { // runs on worker
OffsetRange[] offsetRanges = ((HasOffsetRanges) rdd.rdd()).offsetRanges();
LOGGER.info(
"@@ Consuming " + offsetRanges[index].count() + " events" + " partition: " + index
);
if (!eventsIterator.hasNext()) {
return Collections.emptyIterator();
}
// get single ES documents
List<SingleEventBaseDocument> eventList = getSingleEventBaseDocuments(eventsIterator);
// build request wrappers
List<InsertRequestWrapper> requestWrapperList = getRequestsToInsert(eventList, offsetRanges[index]);
LOGGER.info(
"@@ Processed " + offsetRanges[index].count() + " events" + " partition: " + index + " list size: " + eventList.size()
);
return requestWrapperList.iterator();
},
true
).collect();
elasticSearchRepository.addElasticSearchDocuments(insertRequests);
LOGGER.info(
"@@ [" + batchIdentifier + "] Finished batch of " + insertRequests.size() + " messages " +
"in " + (Instant.now().toEpochMilli() - batchStart.toEpochMilli()) + "ms"
);
});
private List<SingleEventBaseDocument> getSingleEventBaseDocuments(final Iterator<ConsumerRecord<String, byte[]>> eventsIterator) {
Iterable<ConsumerRecord<String, byte[]>> iterable = () -> eventsIterator;
return StreamSupport.stream(iterable.spliterator(), true)
.map(this::toEnrichedEvent)
.filter(this::isValidEvent)
.map(this::toEventDocument)
.collect(Collectors.toList());
}
private List<InsertRequestWrapper> getRequestsToInsert(List<SingleEventBaseDocument> list, OffsetRange offsetRange) {
return list.stream()
.map(event -> toInsertRequestWrapper(event, offsetRange))
.collect(Collectors.toList());
}
Driver stacktrace:
2019-10-09 13:16:40,172 WARN org.apache.spark.ExecutorAllocationManager No stages are running, but numRunningTasks != 0
2019-10-09 13:16:40,172 INFO org.apache.spark.scheduler.DAGScheduler Job 7934 failed: collect at AbstractJob.java:133, took 0.116175 s
2019-10-09 13:16:40,177 INFO org.apache.spark.streaming.scheduler.JobScheduler Finished job streaming job 1570627000000 ms.0 from job set of time 1570627000000 ms
2019-10-09 13:16:40,179 ERROR org.apache.spark.streaming.scheduler.JobScheduler Error running job streaming job 1570627000000 ms.0
org.apache.spark.SparkException: Job aborted due to stage failure: Task 2 in stage 7934.0 failed 4 times, most recent failure: Lost task 2.3 in stage 7934.0 (TID 39681, l-lhr1-hdpwo-806.zanox-live.de, executor 3): java.io.OptionalDataException
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1587)
at java.io.ObjectInputStream.readObject(ObjectInputStream.java:427)
at java.util.HashMap.readObject(HashMap.java:1407)
at sun.reflect.GeneratedMethodAccessor5.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1158)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2169)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2060)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1567)
at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2278)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2202)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2060)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1567)
at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2278)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2202)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2060)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1567)
at java.io.ObjectInputStream.readArray(ObjectInputStream.java:1966)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1561)
at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2278)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2202)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2060)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1567)
at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2278)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2202)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2060)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1567)
at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2278)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2202)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2060)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1567)
at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2278)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2202)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2060)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1567)
at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2278)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2202)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2060)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1567)
at java.io.ObjectInputStream.readObject(ObjectInputStream.java:427)
at org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:75)
at org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:114)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:80)
at org.apache.spark.scheduler.Task.run(Task.scala:109)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:380)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
ERROR executor.Executor: Exception in task 4.3 in stage 7934.0 (TID 39685)
java.io.OptionalDataException
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1587)
at java.io.ObjectInputStream.readObject(ObjectInputStream.java:427)
at java.util.HashMap.readObject(HashMap.java:1407)
at sun.reflect.GeneratedMethodAccessor9.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1158)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2169)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2060)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1567)
at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2278)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2202)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2060)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1567)
at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2278)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2202)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2060)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1567)
at java.io.ObjectInputStream.readArray(ObjectInputStream.java:1966)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1561)
at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2278)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2202)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2060)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1567)
at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2278)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2202)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2060)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1567)
at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2278)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2202)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2060)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1567)
at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2278)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2202)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2060)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1567)
at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2278)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2202)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2060)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1567)
at java.io.ObjectInputStream.readObject(ObjectInputStream.java:427)
at org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:75)
at org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:114)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:80)
at org.apache.spark.scheduler.Task.run(Task.scala:109)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:380)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)