Java Spliterator:如何平等地处理大型流拆分?
我正在使用的代码Java Spliterator:如何平等地处理大型流拆分?,java,java-8,java-stream,future,spliterator,Java,Java 8,Java Stream,Future,Spliterator,我正在使用的代码 package com.skimmer; import java.util.ArrayList; import java.util.List; import java.util.Spliterator; import java.util.concurrent.Callable; import java.util.concurrent.ExecutionException; import java.util.concurrent.ExecutorService; import
package com.skimmer;
import java.util.ArrayList;
import java.util.List;
import java.util.Spliterator;
import java.util.concurrent.Callable;
import java.util.concurrent.ExecutionException;
import java.util.concurrent.ExecutorService;
import java.util.concurrent.Executors;
import java.util.concurrent.Future;
import java.util.concurrent.atomic.AtomicLong;
import java.util.stream.LongStream;
import java.util.stream.Stream;
public class App {
public static void main(String[] args) throws InterruptedException, ExecutionException {
// Simply creating some 'test' data
Stream<String> test = LongStream.range(0, 10000000L).mapToObj(i -> i + "-test");
Spliterator<String> spliterator = test.parallel().spliterator();
List<Callable<Long>> callableList = new ArrayList<Callable<Long>>();
// Creating a future for each split to process concurrently
int totalSplits = 0;
while ((spliterator = spliterator.trySplit()) != null) {
totalSplits++;
callableList.add(new Worker(spliterator, "future-" + totalSplits));
}
ExecutorService executor = Executors.newFixedThreadPool(totalSplits);
List<Future<Long>> futures = executor.invokeAll(callableList);
AtomicLong counter = new AtomicLong(0);
for (Future<Long> future : futures)
counter.getAndAdd(future.get());
System.out.println("Total processed " + counter.get());
System.out.println("Total splits " + totalSplits);
executor.shutdown();
}
public static class Worker implements Callable<Long> {
private Spliterator<String> spliterator;
private String name;
public Worker(Spliterator<String> spliterator, String name) {
this.spliterator = spliterator;
this.name = name;
}
@Override
public Long call() {
AtomicLong counter = new AtomicLong(0);
spliterator.forEachRemaining(s -> {
// We'll assume busy processing code here
counter.getAndIncrement();
});
System.out.println(name + " Total processed : " + counter.get());
return counter.get();
}
}
}
我的问题:
第一个trySplit(和future任务'future-0')正好获得n/2个要开始处理的元素总数。第一对分裂需要很长时间才能完成——随着n的增长,情况变得更糟。是否有其他方法来处理流,其中每个未来/可调用元素都得到了要处理的元素的相等分布,例如(N/splits),即1000000/20=50000
期望的结果
furture-11 Total processed : 50000
furture-10 Total processed : 50000
furture-9 Total processed : 50000
furture-12 Total processed : 50000
furture-7 Total processed : 50000
furture-13 Total processed : 50000
furture-8 Total processed : 50000
furture-6 Total processed : 50000
furture-14 Total processed : 50000
furture-5 Total processed : 50000
furture-15 Total processed : 50000
furture-4 Total processed : 50000
furture-17 Total processed : 50000
furture-18 Total processed : 50000
furture-19 Total processed : 50000
furture-16 Total processed : 50000
furture-3 Total processed : 50000
furture-2 Total processed : 50000
furture-1 Total processed : 50000
future-0 Total processed : 50000
Total processed 1000000
Total splits 20
后续问题:如果Spliterator无法做到这一点,那么最好使用什么其他方法/解决方案来同时处理大数据流
实际案例场景:处理一个大的(6GB)CSV文件,该文件太大,无法保存在内存中您在这里得到了完美平衡的分割。问题是,每次将元素序列拆分为两半(由两个拆分器
实例表示)时,都会为其中一半创建一个作业,甚至不尝试进一步拆分它,而只是将另一半细分
因此,在第一次拆分之后,您就创建了一个包含500000个元素的作业。然后,对其他500000个元素调用trySplit
,将其完美分割为250000个元素的两个块,创建另一个包含250000个元素的一个块的作业,并仅尝试细分另一个块。等等是你的代码创造了不平衡的工作
当您将第一部分更改为
// Simply creating some 'test' data
Stream<String> test = LongStream.range(0, 10000000L).mapToObj(i -> i + "-test");
// Creating a future for each split to process concurrently
List<Callable<Long>> callableList = new ArrayList<>();
int workChunkTarget = 5000;
Deque<Spliterator<String>> spliterators = new ArrayDeque<>();
spliterators.add(test.parallel().spliterator());
int totalSplits = 0;
while(!spliterators.isEmpty()) {
Spliterator<String> spliterator = spliterators.pop();
Spliterator<String> prefix;
while(spliterator.estimateSize() > workChunkTarget
&& (prefix = spliterator.trySplit()) != null) {
spliterators.push(spliterator);
spliterator = prefix;
}
totalSplits++;
callableList.add(new Worker(spliterator, "future-" + totalSplits));
}
//只需创建一些“测试”数据
Stream test=LongStream.range(0,10000000升).mapToObj(i->i+“-test”);
//为每个拆分创建并发处理的未来
List callableList=new ArrayList();
int workChunkTarget=5000;
Deque spliterators=新的ArrayDeque();
添加(test.parallel().spliterator());
int totalSplits=0;
而(!spliterators.isEmpty()){
Spliterator Spliterator=spliterators.pop();
拆分器前缀;
while(spliterator.estimateSize()>workChunkTarget
&&(prefix=spliterator.trySplit())!=null){
分离器。推动(分离器);
拆分器=前缀;
}
totalSplits++;
添加(新工作者(拆分器,“未来-”+totalSplits));
}
您可以安静地接近所需的目标工作负载大小(尽可能接近,因为数字不是二的幂)
Spliterator
设计使用像ForkJoinTask
这样的工具工作得更顺畅,在每次成功的trySplit
之后都可以提交一个新作业,作业本身将决定在工作线程未饱和时同时拆分和生成新作业(类似于并行流操作在参考实现中完成)。为什么在I/O限制(在这种情况下,浪费的线程数超过了可以提供的线程数)或CPU限制(在这种情况下,浪费的线程数超过了核心数)的对象上使用20个线程?@chrylis仅用于演示目的。在上述问题中(与现实世界不同),我知道元素的总数(1000000)。在我上面描述的实际情况中,我不知道这一点(例如,来自套接字连接的CSV文件)。我可以通过更复杂的代码来适当地结束线程,但不想让读者混淆这种复杂性,因为这不是我的问题检查这个问题,它应该有助于管理并行性,但所涉及的资源和线程配置之间的不匹配可能是您的“问题”.我不认为有任何公平政策适用于遗嘱执行人的工作。
// Simply creating some 'test' data
Stream<String> test = LongStream.range(0, 10000000L).mapToObj(i -> i + "-test");
// Creating a future for each split to process concurrently
List<Callable<Long>> callableList = new ArrayList<>();
int workChunkTarget = 5000;
Deque<Spliterator<String>> spliterators = new ArrayDeque<>();
spliterators.add(test.parallel().spliterator());
int totalSplits = 0;
while(!spliterators.isEmpty()) {
Spliterator<String> spliterator = spliterators.pop();
Spliterator<String> prefix;
while(spliterator.estimateSize() > workChunkTarget
&& (prefix = spliterator.trySplit()) != null) {
spliterators.push(spliterator);
spliterator = prefix;
}
totalSplits++;
callableList.add(new Worker(spliterator, "future-" + totalSplits));
}