Java 将windows-1252输入文件转换为utf-8输出文件的字符编码_Java_File_Ms Word_Character Encoding_Character

Java 将windows-1252输入文件转换为utf-8输出文件的字符编码

java file ms-word character-encoding

Java 将windows-1252输入文件转换为utf-8输出文件的字符编码,java,file,ms-word,character-encoding,character,Java,File,Ms Word,Character Encoding,Character,我正在处理一个HTML文档，我将它从Word的保存选项（以编程方式）转换为HTML。此HTML文本文件是windows-1252编码的。（是的，我读了很多关于字节和Unicode代码点的书，我知道128以外的代码点可以是2，3，最多6个字节，等等）我在Word文档模板中添加了很多不可打印的字符，并编写了代码来计算每个字符（十进制等效）。当然，我知道我不想允许十进制#160，这是微软Word翻译成HTML的不间断空格。我预计在不久的将来，人们会将更多这些“非法”构造放入模板中，我将需要捕获它们并处

我正在处理一个HTML文档，我将它从Word的保存选项（以编程方式）转换为HTML。此HTML文本文件是windows-1252编码的。（是的，我读了很多关于字节和Unicode代码点的书，我知道128以外的代码点可以是2，3，最多6个字节，等等）我在Word文档模板中添加了很多不可打印的字符，并编写了代码来计算每个字符（十进制等效）。当然，我知道我不想允许十进制#160，这是微软Word翻译成HTML的不间断空格。我预计在不久的将来，人们会将更多这些“非法”构造放入模板中，我将需要捕获它们并处理它们（因为它们会在浏览器中引起有趣的查看，towit:（这是Eclipse控制台的转储，我将所有文档行放入地图中）

我用#32（规则空格）替换了十进制#160然后使用UTF-8编码将字符写入一个新文件-这也是我的想法，我可以使用此技术替换或决定不使用十进制等效性写入特定字符吗？我想避免使用字符串，因为我可以处理多个文档，并且不想耗尽内存…所以我在文件中执行此操作

 public static void convert1252toUFT8(String fileName) throws IOException {   
    File f = new File(fileName);
    Reader r = new BufferedReader(new InputStreamReader(new FileInputStream(f), "windows-1252"));
    OutputStreamWriter writer = new OutputStreamWriter(new FileOutputStream(fileName + "x"), StandardCharsets.UTF_8); 
    List<Character> charsList = new ArrayList<>(); 
    int count = 0;

    try {
        int intch;
        while ((intch = r.read()) != -1) {   //reads a single character and returns integer equivalent
            int ch = (char)intch;
            //System.out.println("intch=" + intch + " ch=" + ch + " isValidCodePoint()=" + Character.isValidCodePoint(ch) 
            //+ " isDefined()=" + Character.isDefined(ch) + " charCount()=" + Character.charCount(ch) + " char=" 
            //+ (char)intch);

            if (Character.isValidCodePoint(ch)) {
                if (intch == 160 ) {
                    intch = 32;
                }
                charsList.add((char)intch);
                count++;
            } else {
                System.out.println("unexpected character found but not dealt with.");
            }
        }
    } catch (Exception e) {
        e.printStackTrace();
    } finally {
        System.out.println("Chars read in=" + count + " Chars read out=" + charsList.size());
        for(Character item : charsList) {
            writer.write((char)item);
        }
        writer.close();
        r.close();
        charsList = null;

        //check that #160 was replaced File 
        //f2 = new File(fileName + "x"); 
        //Reader r2 = new BufferedReader(new InputStreamReader(new FileInputStream(f2), "UTF-8")); 
        //int intch2;
        //while ((intch2 = r2.read()) != -1) { //reads a single character and returns integer equivalent 
        //int ch2 = (char)intch2; 
        //System.out.println("intch2=" + intch2 + " ch2=" + ch2 + " isValidCodePoint()=" +
        //Character.isValidCodePoint(ch2) + " char=" + (char)intch2); 
        //}

    }   
}

publicstaticvoidconvert1252touft8（字符串文件名）抛出IOException{
文件f=新文件（文件名）；
Reader r=新的BufferedReader（新的InputStreamReader（新文件InputStream（f），“windows-1252”）；
OutputStreamWriter writer=新的OutputStreamWriter（新文件OutputStream（fileName+“x”），StandardCharsets.UTF_8；
List charsList=new ArrayList（）；
整数计数=0；
试一试{
int intch；
while（（intch=r.read（））！=-1）{//读取单个字符并返回等效整数
intch=（char）intch；
//System.out.println（“intch=“+intch+”ch=“+ch+”isValidCodePoint（）=”+Character.isValidCodePoint（ch）
//+“isDefined（）=”+字符。isDefined（ch）+“charCount（）=”+字符。charCount（ch）+“char=”
//+（char）intch）；
if（字符isValidCodePoint（ch））{
如果（intch==160）{
intch=32；
}
添加（（char）intch）；
计数++；
}否则{
System.out.println（“发现意外字符但未处理”）；
}
}
}捕获（例外e）{
e、 printStackTrace（）；
}最后{
System.out.println（“Chars read in=“+count+”Chars read out=“+charsList.size（））；
for（字符项：charsList）{
writer.write（（字符）项）；
}
writer.close（）；
r、 close（）；
charsList=null；
//检查是否替换了#160文件
//f2=新文件（文件名+“x”）；
//读卡器r2=新的BufferedReader（新的InputStreamReader（新文件InputStream（f2），“UTF-8”）；
//int intch2；
//而（（intch2=r2.read（））！=-1）{//读取单个字符并返回等效的整数
//intch2=（char）intch2；
//System.out.println（“intch2=“+intch2+”ch2=“+ch2+”isValidCodePoint（）=”+
//Character.isValidCodePoint（ch2）+“char=”+（char）intch2）；
//}
}   
}

首先，HTML页面采用与UTF-8不同的编码并没有什么问题。事实上，文档中很可能包含类似于

在其标题中，当您更改文件的字符编码而不调整此标题行时，将使文档无效

此外，没有理由在文档中替换codepoint 160，因为它是Unicode的标准，这就是为什么

；

是

的有效替代品的原因，如果文档的字符集支持此codepoint，那么直接使用它也是有效的

您试图避免使用字符串是一个典型的例子。缺少实际测量会导致类似于

ArrayList

的解决方案，它会消耗

字符串

的两倍内存

如果要复制或转换文件，不应将整个文件保存在内存中。只需在读取下一个文件之前将数据写回，但为了提高效率，请使用一些缓冲区，而不是一次读写一个字符。此外，还应使用来管理输入和输出资源

public static void convert1252toUFT8(String fileName) throws IOException {
    Path in = Paths.get(fileName), out = Paths.get(fileName+"x");
    int readCount = 0, writeCount = 0;
    try(BufferedReader br = Files.newBufferedReader(in, Charset.forName("windows-1252"));
        BufferedWriter bw = Files.newBufferedWriter(out, // default UTF-8
            StandardOpenOption.CREATE, StandardOpenOption.TRUNCATE_EXISTING)) {

        char[] buffer = new char[1000];
        do {
            int count = br.read(buffer);
            if(count < 0) break;
            readCount += count;

            // if you really want to replace non breaking spaces:
            for(int ix = 0; ix < count; ix++) {
                if(buffer[ix] == 160) buffer[ix] = ' ';
            }

            bw.write(buffer, 0, count);
            writeCount += count;
        } while(true);
    } finally {
        System.out.println("Chars read in="+readCount+" Chars written out="+writeCount);
    }
}

请注意，对于成功转换，读取和写入的字符数相同，但仅对于输入编码Windows-1252，字符数与字节数相同，即文件大小（当整个文件有效时）

这个转换代码示例只是为了完成，正如在开始时所说的，在不修改标题的情况下转换HTML页面可能会使文件无效，甚至没有必要

取决于实施情况，甚至四次

谢谢-有很多“噪音”在这个主题上，我尝试了您改进的解决方案，效果非常好！我希望取消字符ArrayList将使其可用于gc，而不是占用更多不变的字符串内存。您关于buffer等的建议非常有用-是的，我确实在“new”中更改了字符集文件设置为utf-8。操作结束时设置为

null

是不必要的，因为它仍然符合gc的条件。这同样适用于

String

对象，不可变不会阻止gc。但是

ArrayList

在操作期间的内存消耗要高得多，因为您有一个对

Cha的引用列表Racker

对象，而不是

char[]

数组的包装器（如果JRE不重用大多数

字符

实例，情况会更糟）。正如答案中所述，不同时将整个文件放在内存中也有助于减少内存消耗。

public static void convert1252toUFT8(String fileName) throws IOException {
    Path in = Paths.get(fileName), out = Paths.get(fileName+"x");
    int readCount = 0, writeCount = 0;
    try(BufferedReader br = Files.newBufferedReader(in, Charset.forName("windows-1252"));
        BufferedWriter bw = Files.newBufferedWriter(out, // default UTF-8
            StandardOpenOption.CREATE, StandardOpenOption.TRUNCATE_EXISTING)) {

        char[] buffer = new char[1000];
        do {
            int count = br.read(buffer);
            if(count < 0) break;
            readCount += count;

            // if you really want to replace non breaking spaces:
            for(int ix = 0; ix < count; ix++) {
                if(buffer[ix] == 160) buffer[ix] = ' ';
            }

            bw.write(buffer, 0, count);
            writeCount += count;
        } while(true);
    } finally {
        System.out.println("Chars read in="+readCount+" Chars written out="+writeCount);
    }
}

public static void convert1252toUFT8(String fileName) throws IOException {
    Path in = Paths.get(fileName), out = Paths.get(fileName+"x");
    int readCount = 0, writeCount = 0;
    try(Reader br = Channels.newReader(Files.newByteChannel(in), "windows-1252");
        Writer bw = Channels.newWriter(
            Files.newByteChannel(out, WRITE, CREATE, TRUNCATE_EXISTING),
            StandardCharsets.UTF_8)) {

        char[] buffer = new char[1000];
        do {
            int count = br.read(buffer);
            if(count < 0) break;
            readCount += count;

            // if you really want to replace non breaking spaces:
            for(int ix = 0; ix < count; ix++) {
                if(buffer[ix] == 160) buffer[ix] = ' ';
            }

            bw.write(buffer, 0, count);
            writeCount += count;
        } while(true);
    } finally {
        System.out.println("Chars read in="+readCount+" Chars written out="+writeCount);
    }
}

public static void convert1252toUFT8(String fileName) throws IOException {
    Path in = Paths.get(fileName), out = Paths.get(fileName+"x");
    int readCount = 0, writeCount = 0;
    CharsetDecoder dec = Charset.forName("windows-1252")
            .newDecoder().onUnmappableCharacter(CodingErrorAction.IGNORE);
    try(Reader br = Channels.newReader(Files.newByteChannel(in), dec, -1);
…