Java 如何对非常大的文件进行排序
我有一些文件,应该根据每行开头的id进行排序。 这些文件大约有2-3 gb 我试图将所有数据读入Java 如何对非常大的文件进行排序,java,file,sorting,Java,File,Sorting,我有一些文件,应该根据每行开头的id进行排序。 这些文件大约有2-3 gb 我试图将所有数据读入ArrayList并对它们进行排序。但是,记忆还不足以保存它们。它不起作用。 线条看起来像 0052304 0000004000000000000000000000000041约翰·泰迪000023 00220000000000000000000000000000000000000000000041乔治家族00013 如何对文件进行排序???您需要做的是通过流将文件分块并分别处理它们。然后您可以将文件
ArrayList
并对它们进行排序。但是,记忆还不足以保存它们。它不起作用。线条看起来像
0052304 0000004000000000000000000000000041约翰·泰迪000023
00220000000000000000000000000000000000000000000041乔治家族00013
如何对文件进行排序???您需要做的是通过流将文件分块并分别处理它们。然后您可以将文件合并在一起,因为它们已经被排序,这与合并排序的工作方式类似
这个SO问题的答案很有价值:您需要一个外部合并排序来完成这项工作。是它的Java实现,可以对非常大的文件进行排序。这完全不是Java的问题。您需要研究一种有效的算法来对未完全读入内存的数据进行排序。对合并排序进行一些调整可以实现这一点 看看这个: 以及:
基本上,这里的想法是将文件分成更小的部分,对它们进行排序(使用合并排序或其他方法),然后使用“合并自合并排序”创建新的排序文件。不必一次将所有数据加载到内存中,只需读取行开始位置的键和索引即可(可能还有长度)例如 这将使用大约40字节每行 对该数组排序后,可以使用RandomAccessFile按行的显示顺序读取行
注意:由于您将随机访问磁盘,而不是使用内存,这可能会非常慢。一个典型的磁盘随机访问数据需要8毫秒,如果您有1000万行,这将需要大约一天。(这是绝对最坏的情况)在内存中大约需要10秒。您可以使用SQL Lite文件数据库,将数据加载到数据库中,然后让它排序并返回结果 优点:无需担心编写最佳排序算法 缺点:需要磁盘空间,处理速度较慢
操作系统附带了强大的文件排序实用程序。一个调用bash脚本的简单函数应该会有所帮助
public static void runScript(final Logger log, final String scriptFile) throws IOException, InterruptedException {
final String command = scriptFile;
if (!new File (command).exists() || !new File(command).canRead() || !new File(command).canExecute()) {
log.log(Level.SEVERE, "Cannot find or read " + command);
log.log(Level.WARNING, "Make sure the file is executable and you have permissions to execute it. Hint: use \"chmod +x filename\" to make it executable");
throw new IOException("Cannot find or read " + command);
}
final int returncode = Runtime.getRuntime().exec(new String[] {"bash", "-c", command}).waitFor();
if (returncode!=0) {
log.log(Level.SEVERE, "The script returned an Error with exit code: " + returncode);
throw new IOException();
}
}
由于您的记录已经是平面文件文本格式,您可以通过管道将它们导入UNIX
sort(1)
例如,排序-n-t'-k1,1output
。它将自动分块数据,并使用可用内存和/tmp
执行合并排序。如果您需要比可用内存更多的空间,请在命令中添加-t/tmpdir
非常有趣的是,每个人都告诉你下载巨大的C#或Java库,或者自己实现合并排序,而你可以使用一个在每个平台上都可用并且已经存在了几十年的工具。你需要执行外部排序。这是Hadoop/MapReduce背后的驱动思想,只是它不需要分布式C考虑到luster并在单个节点上工作 为了获得更好的性能,应该使用Hadoop/Spark 根据您的系统更改此行。
fpath
是一个大的输入文件(使用20GB测试)。shared
路径是存储执行日志的地方。fdir
是存储和合并中间文件的地方。根据您的计算机更改这些路径
public static final String fdir = "/tmp/";
public static final String shared = "/exports/home/schatterjee/cs553-pa2a/";
public static final String fPath = "/input/data-20GB.in";
public static final String opLog = shared+"Mysort20GB.log";
然后运行以下程序。最终排序的文件将在fdir
path中创建,名称为op401。最后一行Runtime.getRuntime().exec(“valsort”+fdir+“op”+(treeHeight*100)+1+“>”+opLog);
检查输出是否已排序。如果未安装valsort或未使用gensort()生成输入文件,请删除此行
另外,不要忘记将int-totalines=200000000;
更改为文件中的总行数。线程数(int-threadCount=16
)应始终为2的幂,并且足够大,以便(总大小*2/个线程)内存中可以驻留大量数据。更改线程计数将更改最终输出文件的名称。如16,它将是op401,32,它将是op501,8,它将是op301等
享受
import java.io.*;
import java.nio.file.Files;
import java.nio.file.Paths;
import java.util.ArrayList;
import java.util.Comparator;
import java.util.stream.Stream;
class SplitFile extends Thread {
String fileName;
int startLine, endLine;
SplitFile(String fileName, int startLine, int endLine) {
this.fileName = fileName;
this.startLine = startLine;
this.endLine = endLine;
}
public static void writeToFile(BufferedWriter writer, String line) {
try {
writer.write(line + "\r\n");
} catch (Exception e) {
throw new RuntimeException(e);
}
}
public void run() {
try {
BufferedWriter writer = Files.newBufferedWriter(Paths.get(fileName));
int totalLines = endLine + 1 - startLine;
Stream<String> chunks =
Files.lines(Paths.get(Mysort20GB.fPath))
.skip(startLine - 1)
.limit(totalLines)
.sorted(Comparator.naturalOrder());
chunks.forEach(line -> {
writeToFile(writer, line);
});
System.out.println(" Done Writing " + Thread.currentThread().getName());
writer.close();
} catch (Exception e) {
System.out.println(e);
}
}
}
class MergeFiles extends Thread {
String file1, file2, file3;
MergeFiles(String file1, String file2, String file3) {
this.file1 = file1;
this.file2 = file2;
this.file3 = file3;
}
public void run() {
try {
System.out.println(file1 + " Started Merging " + file2 );
FileReader fileReader1 = new FileReader(file1);
FileReader fileReader2 = new FileReader(file2);
FileWriter writer = new FileWriter(file3);
BufferedReader bufferedReader1 = new BufferedReader(fileReader1);
BufferedReader bufferedReader2 = new BufferedReader(fileReader2);
String line1 = bufferedReader1.readLine();
String line2 = bufferedReader2.readLine();
//Merge 2 files based on which string is greater.
while (line1 != null || line2 != null) {
if (line1 == null || (line2 != null && line1.compareTo(line2) > 0)) {
writer.write(line2 + "\r\n");
line2 = bufferedReader2.readLine();
} else {
writer.write(line1 + "\r\n");
line1 = bufferedReader1.readLine();
}
}
System.out.println(file1 + " Done Merging " + file2 );
new File(file1).delete();
new File(file2).delete();
writer.close();
} catch (Exception e) {
System.out.println(e);
}
}
}
public class Mysort20GB {
//public static final String fdir = "/Users/diesel/Desktop/";
public static final String fdir = "/tmp/";
public static final String shared = "/exports/home/schatterjee/cs553-pa2a/";
public static final String fPath = "/input/data-20GB.in";
public static final String opLog = shared+"Mysort20GB.log";
public static void main(String[] args) throws Exception{
long startTime = System.nanoTime();
int threadCount = 16; // Number of threads
int totalLines = 200000000;
int linesPerFile = totalLines / threadCount;
ArrayList<Thread> activeThreads = new ArrayList<Thread>();
for (int i = 1; i <= threadCount; i++) {
int startLine = i == 1 ? i : (i - 1) * linesPerFile + 1;
int endLine = i * linesPerFile;
SplitFile mapThreads = new SplitFile(fdir + "op" + i, startLine, endLine);
activeThreads.add(mapThreads);
mapThreads.start();
}
activeThreads.stream().forEach(t -> {
try {
t.join();
} catch (Exception e) {
}
});
int treeHeight = (int) (Math.log(threadCount) / Math.log(2));
for (int i = 0; i < treeHeight; i++) {
ArrayList<Thread> actvThreads = new ArrayList<Thread>();
for (int j = 1, itr = 1; j <= threadCount / (i + 1); j += 2, itr++) {
int offset = i * 100;
String tempFile1 = fdir + "op" + (j + offset);
String tempFile2 = fdir + "op" + ((j + 1) + offset);
String opFile = fdir + "op" + (itr + ((i + 1) * 100));
MergeFiles reduceThreads =
new MergeFiles(tempFile1,tempFile2,opFile);
actvThreads.add(reduceThreads);
reduceThreads.start();
}
actvThreads.stream().forEach(t -> {
try {
t.join();
} catch (Exception e) {
}
});
}
long endTime = System.nanoTime();
double timeTaken = (endTime - startTime)/1e9;
System.out.println(timeTaken);
BufferedWriter logFile = new BufferedWriter(new FileWriter(opLog, true));
logFile.write("Time Taken in seconds:" + timeTaken);
Runtime.getRuntime().exec("valsort " + fdir + "op" + (treeHeight*100)+1 + " > " + opLog);
logFile.close();
}
}
import java.io.*;
导入java.nio.file.Files;
导入java.nio.file.path;
导入java.util.ArrayList;
导入java.util.Comparator;
导入java.util.stream.stream;
类SplitFile扩展线程{
字符串文件名;
终点线;
拆分文件(字符串文件名、整数行、整数行){
this.fileName=文件名;
this.startine=startine;
this.endLine=endLine;
}
公共静态void writeToFile(BufferedWriter,字符串行){
试一试{
writer.write(行+“\r\n”);
}捕获(例外e){
抛出新的运行时异常(e);
}
}
公开募捐{
试一试{
BufferedWriter writer=Files.newBufferedWriter(path.get(fileName));
int totalLines=终点线+1-终点线;
流块=
Files.line(path.get(Mysort20GB.fPath))
.skip(startine-1)
.限额(合计)
.sorted(Comparator.naturalOrder());
chunks.forEach(行->{
writeToFile(writer,line);
});
System.out.println(“完成写入”+Thread.currentThread().getName());
writer.close();
}捕获(例外e){
系统输出打印ln(e);
}
}
}
类合并文件扩展线程{
字符串file1、file2、file3;
合并文件(字符串文件1、字符串文件2、字符串文件3){
this.file1=file1;
import java.io.*;
import java.nio.file.Files;
import java.nio.file.Paths;
import java.util.ArrayList;
import java.util.Comparator;
import java.util.stream.Stream;
class SplitFile extends Thread {
String fileName;
int startLine, endLine;
SplitFile(String fileName, int startLine, int endLine) {
this.fileName = fileName;
this.startLine = startLine;
this.endLine = endLine;
}
public static void writeToFile(BufferedWriter writer, String line) {
try {
writer.write(line + "\r\n");
} catch (Exception e) {
throw new RuntimeException(e);
}
}
public void run() {
try {
BufferedWriter writer = Files.newBufferedWriter(Paths.get(fileName));
int totalLines = endLine + 1 - startLine;
Stream<String> chunks =
Files.lines(Paths.get(Mysort20GB.fPath))
.skip(startLine - 1)
.limit(totalLines)
.sorted(Comparator.naturalOrder());
chunks.forEach(line -> {
writeToFile(writer, line);
});
System.out.println(" Done Writing " + Thread.currentThread().getName());
writer.close();
} catch (Exception e) {
System.out.println(e);
}
}
}
class MergeFiles extends Thread {
String file1, file2, file3;
MergeFiles(String file1, String file2, String file3) {
this.file1 = file1;
this.file2 = file2;
this.file3 = file3;
}
public void run() {
try {
System.out.println(file1 + " Started Merging " + file2 );
FileReader fileReader1 = new FileReader(file1);
FileReader fileReader2 = new FileReader(file2);
FileWriter writer = new FileWriter(file3);
BufferedReader bufferedReader1 = new BufferedReader(fileReader1);
BufferedReader bufferedReader2 = new BufferedReader(fileReader2);
String line1 = bufferedReader1.readLine();
String line2 = bufferedReader2.readLine();
//Merge 2 files based on which string is greater.
while (line1 != null || line2 != null) {
if (line1 == null || (line2 != null && line1.compareTo(line2) > 0)) {
writer.write(line2 + "\r\n");
line2 = bufferedReader2.readLine();
} else {
writer.write(line1 + "\r\n");
line1 = bufferedReader1.readLine();
}
}
System.out.println(file1 + " Done Merging " + file2 );
new File(file1).delete();
new File(file2).delete();
writer.close();
} catch (Exception e) {
System.out.println(e);
}
}
}
public class Mysort20GB {
//public static final String fdir = "/Users/diesel/Desktop/";
public static final String fdir = "/tmp/";
public static final String shared = "/exports/home/schatterjee/cs553-pa2a/";
public static final String fPath = "/input/data-20GB.in";
public static final String opLog = shared+"Mysort20GB.log";
public static void main(String[] args) throws Exception{
long startTime = System.nanoTime();
int threadCount = 16; // Number of threads
int totalLines = 200000000;
int linesPerFile = totalLines / threadCount;
ArrayList<Thread> activeThreads = new ArrayList<Thread>();
for (int i = 1; i <= threadCount; i++) {
int startLine = i == 1 ? i : (i - 1) * linesPerFile + 1;
int endLine = i * linesPerFile;
SplitFile mapThreads = new SplitFile(fdir + "op" + i, startLine, endLine);
activeThreads.add(mapThreads);
mapThreads.start();
}
activeThreads.stream().forEach(t -> {
try {
t.join();
} catch (Exception e) {
}
});
int treeHeight = (int) (Math.log(threadCount) / Math.log(2));
for (int i = 0; i < treeHeight; i++) {
ArrayList<Thread> actvThreads = new ArrayList<Thread>();
for (int j = 1, itr = 1; j <= threadCount / (i + 1); j += 2, itr++) {
int offset = i * 100;
String tempFile1 = fdir + "op" + (j + offset);
String tempFile2 = fdir + "op" + ((j + 1) + offset);
String opFile = fdir + "op" + (itr + ((i + 1) * 100));
MergeFiles reduceThreads =
new MergeFiles(tempFile1,tempFile2,opFile);
actvThreads.add(reduceThreads);
reduceThreads.start();
}
actvThreads.stream().forEach(t -> {
try {
t.join();
} catch (Exception e) {
}
});
}
long endTime = System.nanoTime();
double timeTaken = (endTime - startTime)/1e9;
System.out.println(timeTaken);
BufferedWriter logFile = new BufferedWriter(new FileWriter(opLog, true));
logFile.write("Time Taken in seconds:" + timeTaken);
Runtime.getRuntime().exec("valsort " + fdir + "op" + (treeHeight*100)+1 + " > " + opLog);
logFile.close();
}
}
0022024 0000004000000000000000000000000000000041 George Clan 00013
0052304 0000004000000000000000000000000000000041 John Teddy 000023