Character encoding jsoup与字符编码_Character Encoding_Jsoup

Character encoding jsoup与字符编码

character-encoding

Character encoding jsoup与字符编码,character-encoding,jsoup,Character Encoding,Jsoup,我有一大堆关于jsoup的字符集支持的问题，其中大部分都有API文档中的引用： jsoup.jsoup: 公共静态文档解析（文件格式，字符串charsetName）. 设置为null以从http等效元标记（如果存在）确定，或返回UTF-8 这是否意味着“charset”元标记不用于检测编码 jsoup.nodes.Document: 公共无效字符集（字符集字符集） ... 此方法相当于OutputSettings.charset（charset），但另外 public Charset Ch

我有一大堆关于jsoup的字符集支持的问题，其中大部分都有API文档中的引用：

jsoup.jsoup:

公共静态文档解析（文件格式，字符串charsetName）.

设置为null以从http等效元标记（如果存在）确定，或返回UTF-8

这是否意味着“charset”元标记不用于检测编码

jsoup.nodes.Document:

公共无效字符集（字符集字符集）

... 此方法相当于

OutputSettings.charset（charset）

，但另外

public Charset Charset（）

... 此方法相当于

Document.OutputSettings.charset（）

这是否意味着不存在“输入字符集”和“输出字符集”，并且它们确实是相同的设置

jsoup.nodes.Document:

公共无效字符集（字符集字符集）

... 已删除过时的字符集/编码定义

这是否会删除“http equiv”元标记而不是“charset”元标记？为了向后兼容，有没有办法同时保持这两种兼容性

jsoup.nodes.Document.OutputSettings:

public Charset Charset（）

在可能的情况下（从URL或文件解析时），文档的输出字符集会自动设置为输入字符集。否则，默认为UTF-8

我需要知道文档是否没有指定编码*。这是否意味着jsoup不能提供这些信息

*我将运行juniversalchardet，而不是默认为UTF-8

这些单据已过期/不完整。Jsoup确实使用了字符集meta标记以及http equiv标记来检测字符集。从源代码中，我们看到此方法如下所示：

public static Document parse(File in, String charsetName) throws IOException {
    return DataUtil.load(in, charsetName, in.getAbsolutePath());
}

DataUtil.load依次调用

parseByteData（…）

，如下所示：（）

我不太清楚您在这里的意思，但不知道，输出字符集设置控制将文档HTML/XML打印为字符串时转义的字符，而输入字符集确定如何读取文件

它只会删除

meta[name=charset]

项。从源代码中，更新/删除文档中字符集定义的方法：（）

本质上，如果您调用

charset（…）

并且它没有charset元标记，它将添加一个，否则将更新现有的一个。它不接触http equiv标记

如果您想知道documet是否指定了编码，只需查找http-equiv-charset或meta-charset标记，如果没有这样的标记，这意味着文档没有指定编码

Jsoup是开放源代码的，您可以自己查看源代码以了解它的工作方式：（您还可以修改它以完全执行您想要的操作！）

当我有时间的时候，我会用更多的细节更新这个答案。如果您还有任何其他问题，请告诉我。

关于[4]，我打开了以下问题：。由于大多数方法都是包私有的，所以重新实现我自己需要大量重复的代码（检查标记、解析内容类型、检查BOM）。检查不能仅在Jsoup内完成（即没有标题，但字符集UTF-8），因为这会忽略BOM。关于[2]，您的精彩详细回答已经回答了它。Re。您的第一个评论和您打开的问题：在我看来，这不太可能得到改进/修复，因为Jsoup的作者似乎不再在这上面花费那么多时间了。。很好，我的回答帮助了你@user19087

//reads bytes first into a buffer, then decodes with the appropriate charset. done this way to support
// switching the chartset midstream when a meta http-equiv tag defines the charset.
// todo - this is getting gnarly. needs a rewrite.
static Document parseByteData(ByteBuffer byteData, String charsetName, String baseUri, Parser parser) {
  String docData;
  Document doc = null;

   if (charsetName == null) { // determine from meta. safe parse as UTF-8
    // look for <meta http-equiv="Content-Type" content="text/html;charset=gb2312"> or HTML5 <meta charset="gb2312">
    docData = Charset.forName(defaultCharset).decode(byteData).toString();
    doc = parser.parseInput(docData, baseUri);
    Element meta = doc.select("meta[http-equiv=content-type], meta[charset]").first();
    if (meta != null) { // if not found, will keep utf-8 as best attempt
        String foundCharset = null;
        if (meta.hasAttr("http-equiv")) {
            foundCharset = getCharsetFromContentType(meta.attr("content"));
        }
        if (foundCharset == null && meta.hasAttr("charset")) {
            try {
                if (Charset.isSupported(meta.attr("charset"))) {
                    foundCharset = meta.attr("charset");
                }
            } catch (IllegalCharsetNameException e) {
                foundCharset = null;
            }
        }

        (Snip...)

Element meta = doc.select("meta[http-equiv=content-type], meta[charset]").first();

private void ensureMetaCharsetElement() {
if (updateMetaCharset) {
    OutputSettings.Syntax syntax = outputSettings().syntax();

    if (syntax == OutputSettings.Syntax.html) {
        Element metaCharset = select("meta[charset]").first();

        if (metaCharset != null) {
            metaCharset.attr("charset", charset().displayName());
        } else {
            Element head = head();

            if (head != null) {
                head.appendElement("meta").attr("charset", charset().displayName());
            }
        }

        // Remove obsolete elements
        select("meta[name=charset]").remove();
    } else if (syntax == OutputSettings.Syntax.xml) {
    (Snip..)