Warning: file_get_contents(/data/phpspider/zhask/data//catemap/8/file/3.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Java 将windows-1252输入文件转换为utf-8输出文件的字符编码_Java_File_Ms Word_Character Encoding_Character - Fatal编程技术网

Java 将windows-1252输入文件转换为utf-8输出文件的字符编码

Java 将windows-1252输入文件转换为utf-8输出文件的字符编码,java,file,ms-word,character-encoding,character,Java,File,Ms Word,Character Encoding,Character,我正在处理一个HTML文档,我将它从Word的保存选项(以编程方式)转换为HTML。此HTML文本文件是windows-1252编码的。(是的,我读了很多关于字节和Unicode代码点的书,我知道128以外的代码点可以是2,3,最多6个字节,等等)我在Word文档模板中添加了很多不可打印的字符,并编写了代码来计算每个字符(十进制等效)。当然,我知道我不想允许十进制#160,这是微软Word翻译成HTML的不间断空格。我预计在不久的将来,人们会将更多这些“非法”构造放入模板中,我将需要捕获它们并处

我正在处理一个HTML文档,我将它从Word的保存选项(以编程方式)转换为HTML。此HTML文本文件是windows-1252编码的。(是的,我读了很多关于字节和Unicode代码点的书,我知道128以外的代码点可以是2,3,最多6个字节,等等)我在Word文档模板中添加了很多不可打印的字符,并编写了代码来计算每个字符(十进制等效)。当然,我知道我不想允许十进制#160,这是微软Word翻译成HTML的不间断空格。我预计在不久的将来,人们会将更多这些“非法”构造放入模板中,我将需要捕获它们并处理它们(因为它们会在浏览器中引起有趣的查看,towit:(这是Eclipse控制台的转储,我将所有文档行放入地图中)

我用#32(规则空格)替换了十进制#160然后使用UTF-8编码将字符写入一个新文件-这也是我的想法,我可以使用此技术替换或决定不使用十进制等效性写入特定字符吗?我想避免使用字符串,因为我可以处理多个文档,并且不想耗尽内存…所以我在文件中执行此操作

 public static void convert1252toUFT8(String fileName) throws IOException {   
    File f = new File(fileName);
    Reader r = new BufferedReader(new InputStreamReader(new FileInputStream(f), "windows-1252"));
    OutputStreamWriter writer = new OutputStreamWriter(new FileOutputStream(fileName + "x"), StandardCharsets.UTF_8); 
    List<Character> charsList = new ArrayList<>(); 
    int count = 0;

    try {
        int intch;
        while ((intch = r.read()) != -1) {   //reads a single character and returns integer equivalent
            int ch = (char)intch;
            //System.out.println("intch=" + intch + " ch=" + ch + " isValidCodePoint()=" + Character.isValidCodePoint(ch) 
            //+ " isDefined()=" + Character.isDefined(ch) + " charCount()=" + Character.charCount(ch) + " char=" 
            //+ (char)intch);

            if (Character.isValidCodePoint(ch)) {
                if (intch == 160 ) {
                    intch = 32;
                }
                charsList.add((char)intch);
                count++;
            } else {
                System.out.println("unexpected character found but not dealt with.");
            }
        }
    } catch (Exception e) {
        e.printStackTrace();
    } finally {
        System.out.println("Chars read in=" + count + " Chars read out=" + charsList.size());
        for(Character item : charsList) {
            writer.write((char)item);
        }
        writer.close();
        r.close();
        charsList = null;

        //check that #160 was replaced File 
        //f2 = new File(fileName + "x"); 
        //Reader r2 = new BufferedReader(new InputStreamReader(new FileInputStream(f2), "UTF-8")); 
        //int intch2;
        //while ((intch2 = r2.read()) != -1) { //reads a single character and returns integer equivalent 
        //int ch2 = (char)intch2; 
        //System.out.println("intch2=" + intch2 + " ch2=" + ch2 + " isValidCodePoint()=" +
        //Character.isValidCodePoint(ch2) + " char=" + (char)intch2); 
        //}

    }   
}
publicstaticvoidconvert1252touft8(字符串文件名)抛出IOException{
文件f=新文件(文件名);
Reader r=新的BufferedReader(新的InputStreamReader(新文件InputStream(f),“windows-1252”);
OutputStreamWriter writer=新的OutputStreamWriter(新文件OutputStream(fileName+“x”),StandardCharsets.UTF_8;
List charsList=new ArrayList();
整数计数=0;
试一试{
int intch;
while((intch=r.read())!=-1){//读取单个字符并返回等效整数
intch=(char)intch;
//System.out.println(“intch=“+intch+”ch=“+ch+”isValidCodePoint()=”+Character.isValidCodePoint(ch)
//+“isDefined()=”+字符。isDefined(ch)+“charCount()=”+字符。charCount(ch)+“char=”
//+(char)intch);
if(字符isValidCodePoint(ch)){
如果(intch==160){
intch=32;
}
添加((char)intch);
计数++;
}否则{
System.out.println(“发现意外字符但未处理”);
}
}
}捕获(例外e){
e、 printStackTrace();
}最后{
System.out.println(“Chars read in=“+count+”Chars read out=“+charsList.size());
for(字符项:charsList){
writer.write((字符)项);
}
writer.close();
r、 close();
charsList=null;
//检查是否替换了#160文件
//f2=新文件(文件名+“x”);
//读卡器r2=新的BufferedReader(新的InputStreamReader(新文件InputStream(f2),“UTF-8”);
//int intch2;
//而((intch2=r2.read())!=-1){//读取单个字符并返回等效的整数
//intch2=(char)intch2;
//System.out.println(“intch2=“+intch2+”ch2=“+ch2+”isValidCodePoint()=”+
//Character.isValidCodePoint(ch2)+“char=”+(char)intch2);
//}
}   
}

首先,HTML页面采用与UTF-8不同的编码并没有什么问题。事实上,文档中很可能包含类似于


在其标题中,当您更改文件的字符编码而不调整此标题行时,将使文档无效

此外,没有理由在文档中替换codepoint 160,因为它是Unicode的标准,这就是为什么
 ;
的有效替代品的原因,如果文档的字符集支持此codepoint,那么直接使用它也是有效的

您试图避免使用字符串是一个典型的例子。缺少实际测量会导致类似于
ArrayList
的解决方案,它会消耗
字符串
的两倍内存

如果要复制或转换文件,不应将整个文件保存在内存中。只需在读取下一个文件之前将数据写回,但为了提高效率,请使用一些缓冲区,而不是一次读写一个字符。此外,还应使用来管理输入和输出资源

public static void convert1252toUFT8(String fileName) throws IOException {
    Path in = Paths.get(fileName), out = Paths.get(fileName+"x");
    int readCount = 0, writeCount = 0;
    try(BufferedReader br = Files.newBufferedReader(in, Charset.forName("windows-1252"));
        BufferedWriter bw = Files.newBufferedWriter(out, // default UTF-8
            StandardOpenOption.CREATE, StandardOpenOption.TRUNCATE_EXISTING)) {

        char[] buffer = new char[1000];
        do {
            int count = br.read(buffer);
            if(count < 0) break;
            readCount += count;

            // if you really want to replace non breaking spaces:
            for(int ix = 0; ix < count; ix++) {
                if(buffer[ix] == 160) buffer[ix] = ' ';
            }

            bw.write(buffer, 0, count);
            writeCount += count;
        } while(true);
    } finally {
        System.out.println("Chars read in="+readCount+" Chars written out="+writeCount);
    }
}
请注意,对于成功转换,读取和写入的字符数相同,但仅对于输入编码Windows-1252,字符数与字节数相同,即文件大小(当整个文件有效时)

这个转换代码示例只是为了完成,正如在开始时所说的,在不修改标题的情况下转换HTML页面可能会使文件无效,甚至没有必要


取决于实施情况,甚至四次

谢谢-有很多“噪音”在这个主题上,我尝试了您改进的解决方案,效果非常好!我希望取消字符ArrayList将使其可用于gc,而不是占用更多不变的字符串内存。您关于buffer等的建议非常有用-是的,我确实在“new”中更改了字符集文件设置为utf-8。操作结束时设置为
null
是不必要的,因为它仍然符合gc的条件。这同样适用于
String
对象,不可变不会阻止gc。但是
ArrayList
在操作期间的内存消耗要高得多,因为您有一个对
Cha的引用列表Racker
对象,而不是
char[]
数组的包装器(如果JRE不重用大多数
字符
实例,情况会更糟)。正如答案中所述,不同时将整个文件放在内存中也有助于减少内存消耗。
public static void convert1252toUFT8(String fileName) throws IOException {
    Path in = Paths.get(fileName), out = Paths.get(fileName+"x");
    int readCount = 0, writeCount = 0;
    try(BufferedReader br = Files.newBufferedReader(in, Charset.forName("windows-1252"));
        BufferedWriter bw = Files.newBufferedWriter(out, // default UTF-8
            StandardOpenOption.CREATE, StandardOpenOption.TRUNCATE_EXISTING)) {

        char[] buffer = new char[1000];
        do {
            int count = br.read(buffer);
            if(count < 0) break;
            readCount += count;

            // if you really want to replace non breaking spaces:
            for(int ix = 0; ix < count; ix++) {
                if(buffer[ix] == 160) buffer[ix] = ' ';
            }

            bw.write(buffer, 0, count);
            writeCount += count;
        } while(true);
    } finally {
        System.out.println("Chars read in="+readCount+" Chars written out="+writeCount);
    }
}
public static void convert1252toUFT8(String fileName) throws IOException {
    Path in = Paths.get(fileName), out = Paths.get(fileName+"x");
    int readCount = 0, writeCount = 0;
    try(Reader br = Channels.newReader(Files.newByteChannel(in), "windows-1252");
        Writer bw = Channels.newWriter(
            Files.newByteChannel(out, WRITE, CREATE, TRUNCATE_EXISTING),
            StandardCharsets.UTF_8)) {

        char[] buffer = new char[1000];
        do {
            int count = br.read(buffer);
            if(count < 0) break;
            readCount += count;

            // if you really want to replace non breaking spaces:
            for(int ix = 0; ix < count; ix++) {
                if(buffer[ix] == 160) buffer[ix] = ' ';
            }

            bw.write(buffer, 0, count);
            writeCount += count;
        } while(true);
    } finally {
        System.out.println("Chars read in="+readCount+" Chars written out="+writeCount);
    }
}
public static void convert1252toUFT8(String fileName) throws IOException {
    Path in = Paths.get(fileName), out = Paths.get(fileName+"x");
    int readCount = 0, writeCount = 0;
    CharsetDecoder dec = Charset.forName("windows-1252")
            .newDecoder().onUnmappableCharacter(CodingErrorAction.IGNORE);
    try(Reader br = Channels.newReader(Files.newByteChannel(in), dec, -1);
…