Java:在每309个字符后插入换行符
让我先说一下,我对Java非常陌生 我有一个包含一行的文件。文件大小约为200MB。我需要在每309个字符后插入一个换行符。我相信我有足够的代码来正确地完成这项工作,但我总是遇到内存错误。我尝试过增加堆空间,但没有效果 是否有一种内存占用较少的处理方法Java:在每309个字符后插入换行符,java,split,newline,Java,Split,Newline,让我先说一下,我对Java非常陌生 我有一个包含一行的文件。文件大小约为200MB。我需要在每309个字符后插入一个换行符。我相信我有足够的代码来正确地完成这项工作,但我总是遇到内存错误。我尝试过增加堆空间,但没有效果 是否有一种内存占用较少的处理方法 BufferedReader r = new BufferedReader(new FileReader(fileName)); String line; while ((line=r.readLine()) != null) { Sys
BufferedReader r = new BufferedReader(new FileReader(fileName));
String line;
while ((line=r.readLine()) != null) {
System.out.println(line.replaceAll("(.{309})", "$1\n"));
}
您的代码有两个问题:
read()
返回的字节数少于您要求的字节数,但仍有字节需要读取
首先是简单的版本:
private static void charRead(boolean verifyHash) {
Reader in = null;
Writer out = null;
long start = System.nanoTime();
long wrote = 0;
MessageDigest md = null;
try {
if (verifyHash) {
md = MessageDigest.getInstance("SHA1");
}
in = new BufferedReader(new FileReader(IN_FILE));
out = new BufferedWriter(new FileWriter(CHAR_FILE));
int count = 0;
for (int c = in.read(); c != -1; c = in.read()) {
if (verifyHash) {
md.update((byte) c);
}
out.write(c);
wrote++;
if (++count >= COUNT) {
if (verifyHash) {
md.update((byte) '\n');
}
out.write("\n");
wrote++;
count = 0;
}
}
} catch (IOException e) {
throw new RuntimeException(e);
} catch (NoSuchAlgorithmException e) {
throw new RuntimeException(e);
} finally {
safeClose(in);
safeClose(out);
long end = System.nanoTime();
System.out.printf("Created %s size %,d in %,.3f seconds. Hash: %s%n",
CHAR_FILE, wrote, (end - start) / 1000000000.0d, hash(md, verifyHash));
}
}
和“块”版本:
给出此结果(英特尔Q9450、Windows 7 64位、8GB RAM、7200rpm 1.5TB驱动器上的测试运行):
结论:SHA1哈希验证非常昂贵,这就是为什么我运行有和没有的版本。基本上,在预热后,“高效”版本的速度只有原来的2倍。我想此时该文件已有效地存储在内存中
如果我颠倒块和字符读取的顺序,结果是:
Created E:\temp\char.dat size 200,647,249 in 8.071 seconds. Hash: 0x22ce9e17e17a67e5ea6f8fe929d2ce4780e8ffa4
Created E:\temp\char.dat size 200,647,249 in 8.087 seconds. Hash: 0x22ce9e17e17a67e5ea6f8fe929d2ce4780e8ffa4
Created E:\temp\char.dat size 200,647,249 in 4.128 seconds. Hash: (not calculated)
Created E:\temp\char.dat size 200,647,249 in 3.918 seconds. Hash: (not calculated)
Created E:\temp\char.dat size 200,647,249 in 18.020 seconds. Hash: 0x22ce9e17e17a67e5ea6f8fe929d2ce4780e8ffa4
Created E:\temp\char.dat size 200,647,249 in 17.953 seconds. Hash: 0x22ce9e17e17a67e5ea6f8fe929d2ce4780e8ffa4
Created E:\temp\char.dat size 200,647,249 in 7.879 seconds. Hash: (not calculated)
Created E:\temp\char.dat size 200,647,249 in 8.016 seconds. Hash: (not calculated)
有趣的是,一个字符一个字符的版本在第一次读取文件时会受到更大的初始冲击
所以,像往常一样,它是效率和简单性之间的选择。打开它,一次读取一个字符,然后将该字符写入需要的位置。保留一个计数器,每次计数器足够大时,写出一行换行符并将计数器设置为零。不确定此解决方案有多好,但您始终可以逐字读取
将文件读取器包装在BufferedReader中,然后继续循环,每次读取309个字符 类似(未经测试):
不要使用
BufferedReader
,因为它会将大部分底层文件保存在内存中。直接使用FileReader
,然后使用read()
方法获取所需的数据:
FileReader reader = new FileReader(fileName);
char[] buffer = new char[309];
int charsRead = 0;
while ((charsRead = reader.read(buffer, 0, buffer.length)) == buffer.length)
{
System.out.println(new String(buffer));
}
if (charsRead > 0)
{
// print any trailing chars
System.out.println(new String(buffer, 0, charsRead));
}
读入长度为309的字节数组,然后写入读取的字节:
import java.io.*;
public class Test {
public static void main(String[] args) throws Exception {
InputStream in = null;
byte[] chars = new byte[309];
try {
in = new FileInputStream(args[0]);
int read = 0;
while((read = in.read(chars)) != -1) {
System.out.write(chars, 0, read);
System.out.println("");
}
}finally {
if(in != null) {
in.close();
}
}
}
}
您可以将程序更改为:
BufferedReader r = null;
r = new BufferedReader(new FileReader(fileName));
char[] data = new char[309];
while (r.read(data, 0, 309) > 0) {
System.out.println(new String(data) + "\n");
}
这是我的想法,没有经过测试。您可以设置BufferedReader的大小,以避免一次读取整个文件。-1:您不能保证reader.read()会填充缓冲区。
BufferedReader
不会读取,请将整个文件保存在内存中。问题是,如果文件是一行,那么根据定义,readLine()
将读取整个文件,然后将其包装在BufferedReader中。我只是简单地说了一下regex部分(这不是解决这个问题的最佳方法):在这种情况下,第1组是不必要的。您可以改为引用组0,例如,replaceAll(“.{309},“$0\n”)
。必须有一个标准的Unix实用程序才能做到这一点,不是吗?类似于columnif309text>out
?无论如何,我认为Java对于这样的东西来说太冗长了。@poly:我实际上从我一直在使用的sed代码中获取了正则表达式:sed's/(.\{309\})/\1\n/g'file.txt>file\u parsed.txt我们已经开始使用Talend ETL工具,所以我希望能够在Java中完成它。另外,感谢正则表达式技巧!字节可能会在多字节编码(如utf-8或utf-16)中中断数据。这在最初的问题中没有具体说明,但仍然存在。如果309字节是多字节字符的第一个字节,那么再见。
Created E:\temp\char.dat size 200,647,249 in 29.690 seconds. Hash: 0x22ce9e17e17a67e5ea6f8fe929d2ce4780e8ffa4
Created E:\temp\char.dat size 200,647,249 in 18.177 seconds. Hash: 0x22ce9e17e17a67e5ea6f8fe929d2ce4780e8ffa4
Created E:\temp\char.dat size 200,647,249 in 7.911 seconds. Hash: (not calculated)
Created E:\temp\char.dat size 200,647,249 in 7.867 seconds. Hash: (not calculated)
Created E:\temp\char.dat size 200,647,249 in 8.018 seconds. Hash: 0x22ce9e17e17a67e5ea6f8fe929d2ce4780e8ffa4
Created E:\temp\char.dat size 200,647,249 in 7.949 seconds. Hash: 0x22ce9e17e17a67e5ea6f8fe929d2ce4780e8ffa4
Created E:\temp\char.dat size 200,647,249 in 3.958 seconds. Hash: (not calculated)
Created E:\temp\char.dat size 200,647,249 in 3.909 seconds. Hash: (not calculated)
Created E:\temp\char.dat size 200,647,249 in 8.071 seconds. Hash: 0x22ce9e17e17a67e5ea6f8fe929d2ce4780e8ffa4
Created E:\temp\char.dat size 200,647,249 in 8.087 seconds. Hash: 0x22ce9e17e17a67e5ea6f8fe929d2ce4780e8ffa4
Created E:\temp\char.dat size 200,647,249 in 4.128 seconds. Hash: (not calculated)
Created E:\temp\char.dat size 200,647,249 in 3.918 seconds. Hash: (not calculated)
Created E:\temp\char.dat size 200,647,249 in 18.020 seconds. Hash: 0x22ce9e17e17a67e5ea6f8fe929d2ce4780e8ffa4
Created E:\temp\char.dat size 200,647,249 in 17.953 seconds. Hash: 0x22ce9e17e17a67e5ea6f8fe929d2ce4780e8ffa4
Created E:\temp\char.dat size 200,647,249 in 7.879 seconds. Hash: (not calculated)
Created E:\temp\char.dat size 200,647,249 in 8.016 seconds. Hash: (not calculated)
FileInputStream fis = new FileInputStream(file);
char current;
int counter = 0
while (fis.available() > 0) {
current = (char) fis.read();
counter++;
// output current to file
if ((counter%309) = 0) {
//output newline character
}
}
BufferedReader r = new BufferedReader(new FileReader("yourfile.txt"), 1024);
boolean done = false;
char[] buffer = new char[309];
while(!done)
{
int read = r.read(buffer,0,309);
if(read > 0)
{
//write buffer to dfestination, appending newline
}
else
{
done = true;
}
}
FileReader reader = new FileReader(fileName);
char[] buffer = new char[309];
int charsRead = 0;
while ((charsRead = reader.read(buffer, 0, buffer.length)) == buffer.length)
{
System.out.println(new String(buffer));
}
if (charsRead > 0)
{
// print any trailing chars
System.out.println(new String(buffer, 0, charsRead));
}
import java.io.*;
public class Test {
public static void main(String[] args) throws Exception {
InputStream in = null;
byte[] chars = new byte[309];
try {
in = new FileInputStream(args[0]);
int read = 0;
while((read = in.read(chars)) != -1) {
System.out.write(chars, 0, read);
System.out.println("");
}
}finally {
if(in != null) {
in.close();
}
}
}
}
BufferedReader r = null;
r = new BufferedReader(new FileReader(fileName));
char[] data = new char[309];
while (r.read(data, 0, 309) > 0) {
System.out.println(new String(data) + "\n");
}