读取UTF-8文件和处理UTF-8字符串时的Java file.encoding_Java_Utf 8_Character Encoding

读取UTF-8文件和处理UTF-8字符串时的Java file.encoding

java utf-8 character-encoding

读取UTF-8文件和处理UTF-8字符串时的Java file.encoding,java,utf-8,character-encoding,Java,Utf 8,Character Encoding,我正在尝试读取UTF-8编码的XML文件，并将UTF-8字符串传递给本机代码（C++dll）我的问题最好用一个示例程序来解释 import java.io.BufferedReader; import java.io.File; import java.io.FileInputStream; import java.io.IOException; import java.io.InputStreamReader; import java.io.UnsupportedEncodingExcept

我正在尝试读取UTF-8编码的XML文件，并将UTF-8字符串传递给本机代码（C++dll）

我的问题最好用一个示例程序来解释

import java.io.BufferedReader;
import java.io.File;
import java.io.FileInputStream;
import java.io.IOException;
import java.io.InputStreamReader;
import java.io.UnsupportedEncodingException;

public class UniCodeTest {

    private static void testByteConversion(String input) throws UnsupportedEncodingException  {    

        byte[] utf_8 = input.getBytes("UTF-8");  // convert unicode string to UTF-8
        String test = new String(utf_8);         // Build String with UTF-8 equvalent chars 
        byte[] utf_8_converted = test.getBytes();// Get the bytes: in effect this will be called in JNI wrapper on C++ side to read it in char*

        // simple workaround to print hex values
        String utfString = "";
        for (int i = 0; i < utf_8.length; i++) {
            utfString += " " + Integer.toHexString(utf_8[i]);
        }          

        String convertedUtfString = "";
        for (int i = 0; i < utf_8_converted.length; i++) {
            convertedUtfString += " " + Integer.toHexString(utf_8_converted[i]);
        }
        if (utfString.equals(convertedUtfString))   {
            System.out.println("Success" ); 
        }
        else {
            System.out.println("Failure" ); 
        }
    }

    public static void main(String[] args) {
        try {
              File inFile = new File("c:/test.txt");
              BufferedReader in = new BufferedReader(new InputStreamReader(new FileInputStream(inFile), "UTF8"));
              String str;
              while ((str = in.readLine()) != null) {
                  testByteConversion(str);
              }
              in.close();
        } catch (IOException e) {
            e.printStackTrace();
        }
    }

}

我做了以下实验：

将file.encoding属性设置为“UTF-8” 我在这两方面都取得了成功

当我将file.encoding设置为“CP-1252”时第一次输入，我得到“成功”，第二次输入，我得到“失败”

这是我从失败案例中得到的信息

utf_8           :  e0 ae a8 e0 ae a9 e0 af 8d e0 ae ae e0 af 88
utf_8_converted :  e0 ae a8 e0 ae a9 e0 af 3f e0 ae ae e0 af 88

当file.encoding设置为CP-1252时，我不理解为什么8d转换为3f。谁能给我解释一下吗

我错过了file.encoding和字符串操作之间的链接

提前感谢：）

我只是斜读了你的文章，但这是一个奇怪的步骤：

byte[] utf_8 = input.getBytes("UTF-8");  // convert unicode string to UTF-8
String test = new String(utf_8);

因为您在java中获取一个字符串（这是一个编码不可知的unicode代码点列表），将其转换为具有给定编码（UTF-8）的字节，但随后您在不指定编码的情况下构造一个新字符串，因此，实际上，测试现在包含用系统编码转换的utf-8字节，这可能是有效的结果，也可能不是有效的结果，这取决于您在字符串中输入的内容以及您使用的系统编码

在下一步中，您将再次从可怕的实体中获取字节，该实体在默认编码中是“test”。假设它甚至可以工作（如原始UTF-8字符串中的字节在任何系统编码中都是有效的字节数组），下一步基本上是无用的，因为它将使用与构建测试相同的系统编码：

byte[] utf_8_converted = test.getBytes();

我认为这句话是问题的根源：

byte[]utf_8_converted=test.getBytes（）
从API的文档中：
使用平台的
默认字符集，将结果存储到新的字节数组中
当此字符串无法在中编码时，此方法的行为
默认字符集未指定。CharsetEncoder类应该是
当需要对编码过程进行更多控制时使用
需要注意的是：用于转换的默认字符集
不是UTF-8

试试这个：
byte[]utf_8_converted=test.getBytes（“utf-8”）
我不确定，但查看两种编码的代码页布局，我可以看到8d
（141十进制）在CP-1252中为空，而在URF-8中有一个值。也许这是你的问题。上面的测试程序是10年前编写的精简版，所以可用性在这里不是问题。代码背后的思想是构造一个utf-8编码的字符串，以便本机代码可以访问该字符串以支持utf-8。顺便说一句，不管你说什么unicode点都是UTF-16编码。我更感兴趣的是理解file.encoding如何以及为什么会影响行为。任何指向这一点的指针都会更有帮助。thanksI我试图了解file.encoding如何影响平台字符集。任何指向这一点的指示都是非常感谢的
byte[] utf_8_converted = test.getBytes();