Java 将windows-1252输入文件转换为utf-8输出文件的字符编码
我正在处理一个HTML文档,我将它从Word的保存选项(以编程方式)转换为HTML。此HTML文本文件是windows-1252编码的。(是的,我读了很多关于字节和Unicode代码点的书,我知道128以外的代码点可以是2,3,最多6个字节,等等)我在Word文档模板中添加了很多不可打印的字符,并编写了代码来计算每个字符(十进制等效)。当然,我知道我不想允许十进制#160,这是微软Word翻译成HTML的不间断空格。我预计在不久的将来,人们会将更多这些“非法”构造放入模板中,我将需要捕获它们并处理它们(因为它们会在浏览器中引起有趣的查看,towit:(这是Eclipse控制台的转储,我将所有文档行放入地图中) 我用#32(规则空格)替换了十进制#160然后使用UTF-8编码将字符写入一个新文件-这也是我的想法,我可以使用此技术替换或决定不使用十进制等效性写入特定字符吗?我想避免使用字符串,因为我可以处理多个文档,并且不想耗尽内存…所以我在文件中执行此操作Java 将windows-1252输入文件转换为utf-8输出文件的字符编码,java,file,ms-word,character-encoding,character,Java,File,Ms Word,Character Encoding,Character,我正在处理一个HTML文档,我将它从Word的保存选项(以编程方式)转换为HTML。此HTML文本文件是windows-1252编码的。(是的,我读了很多关于字节和Unicode代码点的书,我知道128以外的代码点可以是2,3,最多6个字节,等等)我在Word文档模板中添加了很多不可打印的字符,并编写了代码来计算每个字符(十进制等效)。当然,我知道我不想允许十进制#160,这是微软Word翻译成HTML的不间断空格。我预计在不久的将来,人们会将更多这些“非法”构造放入模板中,我将需要捕获它们并处
public static void convert1252toUFT8(String fileName) throws IOException {
File f = new File(fileName);
Reader r = new BufferedReader(new InputStreamReader(new FileInputStream(f), "windows-1252"));
OutputStreamWriter writer = new OutputStreamWriter(new FileOutputStream(fileName + "x"), StandardCharsets.UTF_8);
List<Character> charsList = new ArrayList<>();
int count = 0;
try {
int intch;
while ((intch = r.read()) != -1) { //reads a single character and returns integer equivalent
int ch = (char)intch;
//System.out.println("intch=" + intch + " ch=" + ch + " isValidCodePoint()=" + Character.isValidCodePoint(ch)
//+ " isDefined()=" + Character.isDefined(ch) + " charCount()=" + Character.charCount(ch) + " char="
//+ (char)intch);
if (Character.isValidCodePoint(ch)) {
if (intch == 160 ) {
intch = 32;
}
charsList.add((char)intch);
count++;
} else {
System.out.println("unexpected character found but not dealt with.");
}
}
} catch (Exception e) {
e.printStackTrace();
} finally {
System.out.println("Chars read in=" + count + " Chars read out=" + charsList.size());
for(Character item : charsList) {
writer.write((char)item);
}
writer.close();
r.close();
charsList = null;
//check that #160 was replaced File
//f2 = new File(fileName + "x");
//Reader r2 = new BufferedReader(new InputStreamReader(new FileInputStream(f2), "UTF-8"));
//int intch2;
//while ((intch2 = r2.read()) != -1) { //reads a single character and returns integer equivalent
//int ch2 = (char)intch2;
//System.out.println("intch2=" + intch2 + " ch2=" + ch2 + " isValidCodePoint()=" +
//Character.isValidCodePoint(ch2) + " char=" + (char)intch2);
//}
}
}
publicstaticvoidconvert1252touft8(字符串文件名)抛出IOException{
文件f=新文件(文件名);
Reader r=新的BufferedReader(新的InputStreamReader(新文件InputStream(f),“windows-1252”);
OutputStreamWriter writer=新的OutputStreamWriter(新文件OutputStream(fileName+“x”),StandardCharsets.UTF_8;
List charsList=new ArrayList();
整数计数=0;
试一试{
int intch;
while((intch=r.read())!=-1){//读取单个字符并返回等效整数
intch=(char)intch;
//System.out.println(“intch=“+intch+”ch=“+ch+”isValidCodePoint()=”+Character.isValidCodePoint(ch)
//+“isDefined()=”+字符。isDefined(ch)+“charCount()=”+字符。charCount(ch)+“char=”
//+(char)intch);
if(字符isValidCodePoint(ch)){
如果(intch==160){
intch=32;
}
添加((char)intch);
计数++;
}否则{
System.out.println(“发现意外字符但未处理”);
}
}
}捕获(例外e){
e、 printStackTrace();
}最后{
System.out.println(“Chars read in=“+count+”Chars read out=“+charsList.size());
for(字符项:charsList){
writer.write((字符)项);
}
writer.close();
r、 close();
charsList=null;
//检查是否替换了#160文件
//f2=新文件(文件名+“x”);
//读卡器r2=新的BufferedReader(新的InputStreamReader(新文件InputStream(f2),“UTF-8”);
//int intch2;
//而((intch2=r2.read())!=-1){//读取单个字符并返回等效的整数
//intch2=(char)intch2;
//System.out.println(“intch2=“+intch2+”ch2=“+ch2+”isValidCodePoint()=”+
//Character.isValidCodePoint(ch2)+“char=”+(char)intch2);
//}
}
}
首先,HTML页面采用与UTF-8不同的编码并没有什么问题。事实上,文档中很可能包含类似于
在其标题中,当您更改文件的字符编码而不调整此标题行时,将使文档无效
此外,没有理由在文档中替换codepoint 160,因为它是Unicode的标准,这就是为什么 ;
是
的有效替代品的原因,如果文档的字符集支持此codepoint,那么直接使用它也是有效的
您试图避免使用字符串是一个典型的例子。缺少实际测量会导致类似于ArrayList
的解决方案,它会消耗字符串
的两倍内存
如果要复制或转换文件,不应将整个文件保存在内存中。只需在读取下一个文件之前将数据写回,但为了提高效率,请使用一些缓冲区,而不是一次读写一个字符。此外,还应使用来管理输入和输出资源
public static void convert1252toUFT8(String fileName) throws IOException {
Path in = Paths.get(fileName), out = Paths.get(fileName+"x");
int readCount = 0, writeCount = 0;
try(BufferedReader br = Files.newBufferedReader(in, Charset.forName("windows-1252"));
BufferedWriter bw = Files.newBufferedWriter(out, // default UTF-8
StandardOpenOption.CREATE, StandardOpenOption.TRUNCATE_EXISTING)) {
char[] buffer = new char[1000];
do {
int count = br.read(buffer);
if(count < 0) break;
readCount += count;
// if you really want to replace non breaking spaces:
for(int ix = 0; ix < count; ix++) {
if(buffer[ix] == 160) buffer[ix] = ' ';
}
bw.write(buffer, 0, count);
writeCount += count;
} while(true);
} finally {
System.out.println("Chars read in="+readCount+" Chars written out="+writeCount);
}
}
请注意,对于成功转换,读取和写入的字符数相同,但仅对于输入编码Windows-1252,字符数与字节数相同,即文件大小(当整个文件有效时)
这个转换代码示例只是为了完成,正如在开始时所说的,在不修改标题的情况下转换HTML页面可能会使文件无效,甚至没有必要
取决于实施情况,甚至四次谢谢-有很多“噪音”在这个主题上,我尝试了您改进的解决方案,效果非常好!我希望取消字符ArrayList将使其可用于gc,而不是占用更多不变的字符串内存。您关于buffer等的建议非常有用-是的,我确实在“new”中更改了字符集文件设置为utf-8。操作结束时设置为
null
是不必要的,因为它仍然符合gc的条件。这同样适用于String
对象,不可变不会阻止gc。但是ArrayList
在操作期间的内存消耗要高得多,因为您有一个对Cha的引用列表Racker
对象,而不是char[]
数组的包装器(如果JRE不重用大多数字符
实例,情况会更糟)。正如答案中所述,不同时将整个文件放在内存中也有助于减少内存消耗。
public static void convert1252toUFT8(String fileName) throws IOException {
Path in = Paths.get(fileName), out = Paths.get(fileName+"x");
int readCount = 0, writeCount = 0;
try(BufferedReader br = Files.newBufferedReader(in, Charset.forName("windows-1252"));
BufferedWriter bw = Files.newBufferedWriter(out, // default UTF-8
StandardOpenOption.CREATE, StandardOpenOption.TRUNCATE_EXISTING)) {
char[] buffer = new char[1000];
do {
int count = br.read(buffer);
if(count < 0) break;
readCount += count;
// if you really want to replace non breaking spaces:
for(int ix = 0; ix < count; ix++) {
if(buffer[ix] == 160) buffer[ix] = ' ';
}
bw.write(buffer, 0, count);
writeCount += count;
} while(true);
} finally {
System.out.println("Chars read in="+readCount+" Chars written out="+writeCount);
}
}
public static void convert1252toUFT8(String fileName) throws IOException {
Path in = Paths.get(fileName), out = Paths.get(fileName+"x");
int readCount = 0, writeCount = 0;
try(Reader br = Channels.newReader(Files.newByteChannel(in), "windows-1252");
Writer bw = Channels.newWriter(
Files.newByteChannel(out, WRITE, CREATE, TRUNCATE_EXISTING),
StandardCharsets.UTF_8)) {
char[] buffer = new char[1000];
do {
int count = br.read(buffer);
if(count < 0) break;
readCount += count;
// if you really want to replace non breaking spaces:
for(int ix = 0; ix < count; ix++) {
if(buffer[ix] == 160) buffer[ix] = ' ';
}
bw.write(buffer, 0, count);
writeCount += count;
} while(true);
} finally {
System.out.println("Chars read in="+readCount+" Chars written out="+writeCount);
}
}
public static void convert1252toUFT8(String fileName) throws IOException {
Path in = Paths.get(fileName), out = Paths.get(fileName+"x");
int readCount = 0, writeCount = 0;
CharsetDecoder dec = Charset.forName("windows-1252")
.newDecoder().onUnmappableCharacter(CodingErrorAction.IGNORE);
try(Reader br = Channels.newReader(Files.newByteChannel(in), dec, -1);
…