Java：如何确定流的正确字符集编码_Java_File_Encoding_Stream_Character Encoding

Java：如何确定流的正确字符集编码

java file encoding stream character-encoding

Java：如何确定流的正确字符集编码,java,file,encoding,stream,character-encoding,Java,File,Encoding,Stream,Character Encoding,参考以下线程：通过编程确定inputstream/文件的正确字符集编码的最佳方法是什么我已尝试使用以下方法： File in = new File(args[0]); InputStreamReader r = new InputStreamReader(new FileInputStream(in)); System.out.println(r.getEncoding()); 但是，对于我知道是用ISO8859_1编码的文件，上述代码会产生ASCII码，这是不正确的，并且不允许我将文

参考以下线程：

通过编程确定inputstream/文件的正确字符集编码的最佳方法是什么

我已尝试使用以下方法：

File in =  new File(args[0]);
InputStreamReader r = new InputStreamReader(new FileInputStream(in));
System.out.println(r.getEncoding());

但是，对于我知道是用ISO8859_1编码的文件，上述代码会产生ASCII码，这是不正确的，并且不允许我将文件内容正确地呈现回控制台。

您能在以下列表中选择适当的字符集吗：

无法确定任意字节流的编码。这就是编码的本质。编码是指字节值与其表示形式之间的映射。因此，每种编码“可能”都是正确的

该方法将返回为流设置（读取）的编码。它不会为您猜测编码

有些流告诉您创建它们使用了哪种编码：XML、HTML。但不是任意字节流

无论如何，如果必须的话，你可以试着自己猜一个编码。每种语言的每个字符都有一个共同的频率。在英语中，字符e经常出现，但ê很少出现。在ISO-8859-1流中，通常没有0x00字符。但是一个UTF-16流有很多

或者：您可以询问用户。我已经看到过一些应用程序，它们以不同的编码向您显示文件片段，并要求您选择“正确”的一个。

如果您不知道数据的编码，那么很难确定，但您可以尝试使用。此外，还存在。

您当然可以通过验证文件的特定字符集，并注意“格式错误的输入”或“不可映射字符”错误。当然，这只会告诉您字符集是否错误；它不会告诉你它是否正确。为此，您需要一个比较基础来评估解码结果，例如，您是否事先知道字符是否限制在某个子集内，或者文本是否遵循某种严格的格式？归根结底，字符集检测是没有任何保证的猜测。

对于ISO8859_1文件，没有一种简单的方法将其与ASCII区分开来。但是，对于Unicode文件，通常可以根据文件的前几个字节检测到这一点

UTF-8和UTF-16文件在文件的最开头包含一个（BOM）。BOM表是零宽度非打断空间

不幸的是，由于历史原因，Java并没有自动检测到这一点。记事本之类的程序将检查BOM表并使用适当的编码。使用unix或Cygwin，可以使用file命令检查BOM表。例如：

$ file sample2.sql 
sample2.sql: Unicode text, UTF-16, big-endian

对于Java，我建议您查看以下代码，它将检测常见的文件格式并选择正确的编码：

我发现了一个不错的第三方库，可以检测实际的编码：

我没有对它进行广泛的测试，但它似乎很有效。

上面的lib是简单的BOM检测器，当然只有在文件开头有BOM时才有效。看一下哪个会扫描文本

查看以下内容：（icu4j）它们有用于从IOStream检测字符集的库可以这样简单：

BufferedInputStream bis = new BufferedInputStream(input);
CharsetDetector cd = new CharsetDetector();
cd.setText(bis);
CharsetMatch cm = cd.detect();

if (cm != null) {
   reader = cm.getReader();
   charset = cm.getName();
}else {
   throw new UnsupportedCharsetException()
}

我使用了这个库，类似于jchardet来检测Java中的编码：如果您使用ICU4J（）

这是我的密码：

String charset=“ISO-8859-1”//默认图表集，放任何你想要的
字节[]fileContent=null；
FileInputStream fin=null；
//创建FileInputStream对象
fin=新的FileInputStream（file.getPath（））；
/*
*创建足够大的字节数组以容纳文件内容。
*使用File.length确定文件的大小（以字节为单位）。
*/
fileContent=新字节[（int）file.length（）]；
/*
*要读取字节数组中的文件内容，请使用
*java FileInputStream类的int read（byte[]byteArray）方法。
*
*/
fin.read（文件内容）；
字节[]数据=文件内容；
CharsetDetector=新的CharsetDetector（）；
检测器.setText（数据）；
CharsetMatch cm=检测器。检测（）；
如果（cm！=null）{
int confidence=cm.getConfidence（）；
System.out.println（“编码：“+cm.getName（）+”-置信度：“+Confidence+”%”）；
//这是你的编码名称和信心
//在我的例子中，如果置信度>50，我返回编码，否则返回默认值
如果（置信度>50）{
charset=cm.getName（）；
}
}

记住把所有需要的东西都放进去

我希望这对你有用。

以下是我的最爱：

依赖关系：

<dependency>
  <groupId>org.apache.any23</groupId>
  <artifactId>apache-any23-encoding</artifactId>
  <version>1.1</version>
</dependency>

<dependency>
  <groupId>org.codehaus.guessencoding</groupId>
  <artifactId>guessencoding</artifactId>
  <version>1.4</version>
  <type>jar</type>
</dependency>

依赖关系：

<dependency>
  <groupId>org.apache.any23</groupId>
  <artifactId>apache-any23-encoding</artifactId>
  <version>1.1</version>
</dependency>

<dependency>
  <groupId>org.codehaus.guessencoding</groupId>
  <artifactId>guessencoding</artifactId>
  <version>1.4</version>
  <type>jar</type>
</dependency>

TikaEncodingDetector的一种替代方法是使用

使用哪个图书馆？在撰写本文时，出现了三个库：

我不包括在内，因为它在引擎盖下使用了ICU4j 3.4

如何判断哪一个检测到了正确的字符集（或尽可能接近）？无法验证上述每个库检测到的字符集。但是，可以依次询问他们，并对返回的响应打分

如何对返回的响应进行评分？每个响应可以分配一个点。响应点越多，检测到的字符集的可信度就越高。这是一种简单的评分方法。你可以详细说明其他人

是否有任何示例代码？下面是实现前几行中描述的策略的完整片段

public static String guessEncoding(InputStream input) throws IOException {
    // Load input data
    long count = 0;
    int n = 0, EOF = -1;
    byte[] buffer = new byte[4096];
    ByteArrayOutputStream output = new ByteArrayOutputStream();

    while ((EOF != (n = input.read(buffer))) && (count <= Integer.MAX_VALUE)) {
        output.write(buffer, 0, n);
        count += n;
    }
    
    if (count > Integer.MAX_VALUE) {
        throw new RuntimeException("Inputstream too large.");
    }

    byte[] data = output.toByteArray();

    // Detect encoding
    Map<String, int[]> encodingsScores = new HashMap<>();

    // * GuessEncoding
    updateEncodingsScores(encodingsScores, new CharsetToolkit(data).guessEncoding().displayName());

    // * ICU4j
    CharsetDetector charsetDetector = new CharsetDetector();
    charsetDetector.setText(data);
    charsetDetector.enableInputFilter(true);
    CharsetMatch cm = charsetDetector.detect();
    if (cm != null) {
        updateEncodingsScores(encodingsScores, cm.getName());
    }

    // * juniversalchardset
    UniversalDetector universalDetector = new UniversalDetector(null);
    universalDetector.handleData(data, 0, data.length);
    universalDetector.dataEnd();
    String encodingName = universalDetector.getDetectedCharset();
    if (encodingName != null) {
        updateEncodingsScores(encodingsScores, encodingName);
    }

    // Find winning encoding
    Map.Entry<String, int[]> maxEntry = null;
    for (Map.Entry<String, int[]> e : encodingsScores.entrySet()) {
        if (maxEntry == null || (e.getValue()[0] > maxEntry.getValue()[0])) {
            maxEntry = e;
        }
    }

    String winningEncoding = maxEntry.getKey();
    //dumpEncodingsScores(encodingsScores);
    return winningEncoding;
}

private static void updateEncodingsScores(Map<String, int[]> encodingsScores, String encoding) {
    String encodingName = encoding.toLowerCase();
    int[] encodingScore = encodingsScores.get(encodingName);

    if (encodingScore == null) {
        encodingsScores.put(encodingName, new int[] { 1 });
    } else {
        encodingScore[0]++;
    }
}    

private static void dumpEncodingsScores(Map<String, int[]> encodingsScores) {
    System.out.println(toString(encodingsScores));
}

private static String toString(Map<String, int[]> encodingsScores) {
    String GLUE = ", ";
    StringBuilder sb = new StringBuilder();

    for (Map.Entry<String, int[]> e : encodingsScores.entrySet()) {
        sb.append(e.getKey() + ":" + e.getValue()[0] + GLUE);
    }
    int len = sb.length();
    sb.delete(len - GLUE.length(), len);

    return "{ " + sb.toString() + " }";
}

公共静态字符串猜测编码（InputStream输入）引发IOException{
//加载输入数据
长计数=0；
int n=0，EOF=-1；
字节[]缓冲区=新字节[4096]；
ByteArrayOutputStream输出=新建ByteArrayOutputStream（）；
而（（EOF！=（n=input.read（buffer））&&（count Integer.MAX_VALUE）{
抛出新的RuntimeException（“Inputstream太大”）；
}
字节[]数据=输出。toByteArray（）；
//检测编码
Map encodingsCores=new HashMap（）；
//*猜测编码
更新编码分数（编码存储区、新字符集）
Charset charset = new AutoDetectReader(new FileInputStream(file)).getCharset();

public static String guessEncoding(InputStream input) throws IOException {
    // Load input data
    long count = 0;
    int n = 0, EOF = -1;
    byte[] buffer = new byte[4096];
    ByteArrayOutputStream output = new ByteArrayOutputStream();

    while ((EOF != (n = input.read(buffer))) && (count <= Integer.MAX_VALUE)) {
        output.write(buffer, 0, n);
        count += n;
    }
    
    if (count > Integer.MAX_VALUE) {
        throw new RuntimeException("Inputstream too large.");
    }

    byte[] data = output.toByteArray();

    // Detect encoding
    Map<String, int[]> encodingsScores = new HashMap<>();

    // * GuessEncoding
    updateEncodingsScores(encodingsScores, new CharsetToolkit(data).guessEncoding().displayName());

    // * ICU4j
    CharsetDetector charsetDetector = new CharsetDetector();
    charsetDetector.setText(data);
    charsetDetector.enableInputFilter(true);
    CharsetMatch cm = charsetDetector.detect();
    if (cm != null) {
        updateEncodingsScores(encodingsScores, cm.getName());
    }

    // * juniversalchardset
    UniversalDetector universalDetector = new UniversalDetector(null);
    universalDetector.handleData(data, 0, data.length);
    universalDetector.dataEnd();
    String encodingName = universalDetector.getDetectedCharset();
    if (encodingName != null) {
        updateEncodingsScores(encodingsScores, encodingName);
    }

    // Find winning encoding
    Map.Entry<String, int[]> maxEntry = null;
    for (Map.Entry<String, int[]> e : encodingsScores.entrySet()) {
        if (maxEntry == null || (e.getValue()[0] > maxEntry.getValue()[0])) {
            maxEntry = e;
        }
    }

    String winningEncoding = maxEntry.getKey();
    //dumpEncodingsScores(encodingsScores);
    return winningEncoding;
}

private static void updateEncodingsScores(Map<String, int[]> encodingsScores, String encoding) {
    String encodingName = encoding.toLowerCase();
    int[] encodingScore = encodingsScores.get(encodingName);

    if (encodingScore == null) {
        encodingsScores.put(encodingName, new int[] { 1 });
    } else {
        encodingScore[0]++;
    }
}    

private static void dumpEncodingsScores(Map<String, int[]> encodingsScores) {
    System.out.println(toString(encodingsScores));
}

private static String toString(Map<String, int[]> encodingsScores) {
    String GLUE = ", ";
    StringBuilder sb = new StringBuilder();

    for (Map.Entry<String, int[]> e : encodingsScores.entrySet()) {
        sb.append(e.getKey() + ":" + e.getValue()[0] + GLUE);
    }
    int len = sb.length();
    sb.delete(len - GLUE.length(), len);

    return "{ " + sb.toString() + " }";
}

final String[] encodings = { "US-ASCII", "ISO-8859-1", "UTF-8", "UTF-16BE", "UTF-16LE", "UTF-16" };

List<String> lines;

for (String encoding : encodings) {
    try {
        lines = Files.readAllLines(path, Charset.forName(encoding));
        for (String line : lines) {
            // do something...
        }
        break;
    } catch (IOException ioe) {
        System.out.println(encoding + " failed, trying next.");
    }
}

...    
import org.xml.sax.InputSource;
...

InputSource inputSource = new InputSource(inputStream);
inputStreamReader = new InputStreamReader(
    inputSource.getByteStream(), inputSource.getEncoding()
  );

<?xml version="1.0" encoding="utf-16"?>
<rss xmlns:dc="https://purl.org/dc/elements/1.1/" version="2.0">
<channel>
...**strong text**