在Java中解析包含HTML实体的XML文件而不更改XML_Java_Xml_Xml Parsing

在Java中解析包含HTML实体的XML文件而不更改XML
java xml
在Java中解析包含HTML实体的XML文件而不更改XML,java,xml,xml-parsing,Java,Xml,Xml Parsing,我必须用Java解析一堆XML文件，这些文件有时（而且无效地）包含HTML实体，例如&mdash，等等。我理解正确的处理方法是在解析之前向XML文件添加合适的实体声明。但是，我不能这样做，因为我无法控制这些XML文件是否有某种回调可以覆盖，每当Java XML解析器遇到这样的实体时就会调用它？我在API中找不到一个我想使用： DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance(); DocumentBuilder
我必须用Java解析一堆XML文件，这些文件有时（而且无效地）包含HTML实体，例如
&mdash，
等等。我理解正确的处理方法是在解析之前向XML文件添加合适的实体声明。但是，我不能这样做，因为我无法控制这些XML文件
是否有某种回调可以覆盖，每当Java XML解析器遇到这样的实体时就会调用它？我在API中找不到一个
我想使用：
DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();

DocumentBuilder parser = dbf.newDocumentBuilder();
Document        doc    = parser.parse( stream );

我发现我可以在org.xml.sax.helpers.DefaultHandler
中重写resolveEntity
，但是如何在更高级别的API中使用它呢
下面是一个完整的示例：
public class Main {
    public static void main( String [] args ) throws Exception {
        DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
        DocumentBuilder parser = dbf.newDocumentBuilder();
        Document        doc    = parser.parse( new FileInputStream( "test.xml" ));
    }

}
对于test.xml：
<?xml version="1.0" encoding="UTF-8"?>
<foo>
    <bar>Some&nbsp;text &mdash; invalid!</bar>
</foo>

更新：我一直在用一个调试器浏览JDK源代码，天哪，真是太多的意大利面了。我不知道设计是什么，也不知道是否有。一层洋葱可以覆盖多少层
它们的关键类似乎是com.sun.org.apache.xerces.internal.impl.xmlenticymanager
，但我找不到任何代码，这些代码要么允许我在使用之前向其中添加内容，要么试图在不经过该类的情况下解析实体。
只是为了引入一种不同的解决方案：
您可以使用流实现来封装输入流，该流实现将实体替换为合法的实体
虽然这确实是一种黑客行为，但它应该是一种快速简单的解决方案（或者更好地说：变通方法）。

但不像xml框架内部解决方案那样优雅和干净
问题-1:我必须解析Java中的大量XML文件
无效--包含HTML实体，如&mdash
XML只有一个特性。&mdash，
不在其中。它仅在普通HTML或旧版JSP中使用时有效。因此，SAX不会有帮助。可以使用StaX
完成，它具有基于高级迭代器的API。（收集自此）
问题-2:我发现我可以覆盖中的resolveEntity
org.xml.sax.helpers.DefaultHandler，但如何将其用于
更高级别的API
用于XML的流式API称为StaX，是一种用于读取和写入XML文档的API

StaX
是一种拉式解析模型。应用程序可以通过从解析器中提取（获取）事件来控制XML文档的解析
核心StaX API分为两类，如下所示。是的

基于光标的API:它是低级API
。基于游标的API允许应用程序将XML作为令牌流或事件进行处理
基于迭代器的API:基于迭代器的高级API允许应用程序将XML作为一系列事件对象进行处理，每个事件对象将XML结构的一部分传递给应用程序


STaX API支持不通过属性替换字符实体引用的概念
：
要求解析器用其内部实体引用替换内部实体引用
替换文本并将其报告为字符
这可以设置为XmlInputFactory
，然后依次用于构造XmlEventReader
或XmlStreamReader

但是，API谨慎地说，此属性只是为了强制实现执行替换，而不是强制实现不替换它们
你可以试试。希望它能解决你的问题。对于你的情况
Main.java
test.xml:
CompactTokenizer.java
有关更多信息，您可以按照教程进行操作




另一种方法，因为您没有使用严格的OXM方法。
您可能想尝试使用一个不那么严格的解析器，比如JSoup？
这将停止使用无效的XML模式等即时问题，但它只会将问题转移到您的代码中。
为此，我将使用类似Jsoup的库。我在下面测试了以下内容，它可以正常工作。我不知道这是否有用。它可以位于以下位置：
publicstaticvoidmain（字符串参数[]）{
字符串html=”“+
“某些文本无效！”；
Document doc=Jsoup.parse（html，“，Parser.xmlParser（））；
对于（元素e：文件选择（“条形”））{
系统输出打印ln（e）；
}   
}

结果:
<bar>
 Some&nbsp;text — invalid!
</bar>


一些文本无效！

从文件加载可以在以下位置找到：
我昨天做了一个类似的东西，我需要从流中未压缩的XML向数据库添加值
//import I'm not sure if all are necessary :) 
import javax.xml.parsers.DocumentBuilder;
import javax.xml.parsers.DocumentBuilderFactory;
import javax.xml.parsers.ParserConfigurationException;
import javax.xml.xpath.*;
import org.w3c.dom.Document;
import org.xml.sax.InputSource;
import org.xml.sax.SAXException;

//I didnt checked this code now because i'm in work for sure its work maybe 
you will need to do little changes
InputSource is = new InputSource(new FileInputStream("test.xml"));

DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
DocumentBuilder db = dbf.newDocumentBuilder();
Document doc = db.parse(is);
XPathFactory xpf = XPathFactory.newInstance();
XPath xpath = xpf.newXPath();
String words= xpath.evaluate("/foo/bar", doc.getDocumentElement());
ParsingHexToChar.parseToChar(words);

// lib which i use common-lang3.jar
//metod to parse 
public static String parseToChar( String words){

    String decode= org.apache.commons.lang3.StringEscapeUtils.unescapeHtml4(words);

        return decode;
 }

使用org.apache.commons包尝试以下操作：
DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
DocumentBuilder parser = dbf.newDocumentBuilder();

InputStream in = new FileInputStream(xmlfile);    
String unescapeHtml4 = IOUtils.toString(in);

CharSequenceTranslator obj = new AggregateTranslator(new LookupTranslator(EntityArrays.ISO8859_1_UNESCAPE()),
          new LookupTranslator(EntityArrays.HTML40_EXTENDED_UNESCAPE())    
         );

unescapeHtml4 = obj.translate(unescapeHtml4);
StringReader readerInput= new StringReader(unescapeHtml4);

InputSource is = new InputSource(readerInput);
Document doc    = parser.parse(is);    

你能提供一些数据样本吗？它同时混合了xml和html？@jtahlborn：这个回调似乎没有被调用；我在那里设置了一个断点，但它从未被命中。我使用JSOUPAPI解析HTML文件。这是一个开源软件，有解析HTML所需的各种实用方法。确实是一个黑客：-）我该如何处理字符集？寻找&。。。；，例如，我必须知道字符集，但XML文件只在第一行指定它，您必须事先知道。当然，您可以进一步推动黑客攻击，并读取xml头来解析相应的字符集和treast输入。xml堆栈intrinisc解决方案仍然更可取。我现有的代码处理文档而不是事件。有没有一种方法可以使用StaX“过滤”实体（例如，用其他东西替换它们）并在流程结束时仍然生成文档，这样我就不必重做所有代码？（最好不解析XML两次）@johannesenst StAX
import  java.io.BufferedReader;
import  java.io.FileReader;
import  java.io.IOException;

import javax.xml.stream.XMLOutputFactory;
import javax.xml.stream.XMLStreamException;
import javax.xml.stream.XMLStreamWriter;

import java.util.Arrays;

public class StAXExpand {   
    static XMLStreamWriter xmlsw = null;
    public static void main(String[] argv) {
        try {
            xmlsw = XMLOutputFactory.newInstance()
                          .createXMLStreamWriter(System.out);
            CompactTokenizer tok = new CompactTokenizer(
                          new FileReader(argv[0]));

            String rootName = "dummyRoot";
            // ignore everything preceding the word before the first "["
            while(!tok.nextToken().equals("[")){
                rootName=tok.getToken();
            }
            // start creating new document
            xmlsw.writeStartDocument();
            ignorableSpacing(0);
            xmlsw.writeStartElement(rootName);
            expand(tok,3);
            ignorableSpacing(0);
            xmlsw.writeEndDocument();

            xmlsw.flush();
            xmlsw.close();
        } catch (XMLStreamException e){
            System.out.println(e.getMessage());
        } catch (IOException ex) {
            System.out.println("IOException"+ex);
            ex.printStackTrace();
        }
    }

    public static void expand(CompactTokenizer tok, int indent) 
        throws IOException,XMLStreamException {
        tok.skip("["); 
        while(tok.getToken().equals("@")) {// add attributes
            String attName = tok.nextToken();
            tok.nextToken();
            xmlsw.writeAttribute(attName,tok.skip("["));
            tok.nextToken();
            tok.skip("]");
        }
        boolean lastWasElement=true; // for controlling the output of newlines 
        while(!tok.getToken().equals("]")){ // process content 
            String s = tok.getToken().trim();
            tok.nextToken();
            if(tok.getToken().equals("[")){
                if(lastWasElement)ignorableSpacing(indent);
                xmlsw.writeStartElement(s);
                expand(tok,indent+3);
                lastWasElement=true;
            } else {
                xmlsw.writeCharacters(s);
                lastWasElement=false;
            }
        }
        tok.skip("]");
        if(lastWasElement)ignorableSpacing(indent-3);
        xmlsw.writeEndElement();
   }

    private static char[] blanks = "\n".toCharArray();
    private static void ignorableSpacing(int nb) 
        throws XMLStreamException {
        if(nb>blanks.length){// extend the length of space array 
            blanks = new char[nb+1];
            blanks[0]='\n';
            Arrays.fill(blanks,1,blanks.length,' ');
        }
        xmlsw.writeCharacters(blanks, 0, nb+1);
    }

}

import  java.io.Reader;
import  java.io.IOException;
import  java.io.StreamTokenizer;

public class CompactTokenizer {
    private StreamTokenizer st;

    CompactTokenizer(Reader r){
        st = new StreamTokenizer(r);
        st.resetSyntax(); // remove parsing of numbers...
        st.wordChars('\u0000','\u00FF'); // everything is part of a word
                                         // except the following...
        st.ordinaryChar('\n');
        st.ordinaryChar('[');
        st.ordinaryChar(']');
        st.ordinaryChar('@');
    }

    public String nextToken() throws IOException{
        st.nextToken();
        while(st.ttype=='\n'|| 
              (st.ttype==StreamTokenizer.TT_WORD && 
               st.sval.trim().length()==0))
            st.nextToken();
        return getToken();
    }

    public String getToken(){
        return (st.ttype == StreamTokenizer.TT_WORD) ? st.sval : (""+(char)st.ttype);
    }

    public String skip(String sym) throws IOException {
        if(getToken().equals(sym))
            return nextToken();
        else
            throw new IllegalArgumentException("skip: "+sym+" expected but"+ 
                                               sym +" found ");
    }
}

public static void main(String args[]){


    String html = "<?xml version=\"1.0\" encoding=\"UTF-8\"?><foo>" + 
                  "<bar>Some&nbsp;text &mdash; invalid!</bar></foo>";
    Document doc = Jsoup.parse(html, "", Parser.xmlParser());

    for (Element e : doc.select("bar")) {
        System.out.println(e);
    }   


}

<bar>
 Some&nbsp;text — invalid!
</bar>

//import I'm not sure if all are necessary :) 
import javax.xml.parsers.DocumentBuilder;
import javax.xml.parsers.DocumentBuilderFactory;
import javax.xml.parsers.ParserConfigurationException;
import javax.xml.xpath.*;
import org.w3c.dom.Document;
import org.xml.sax.InputSource;
import org.xml.sax.SAXException;

//I didnt checked this code now because i'm in work for sure its work maybe 
you will need to do little changes
InputSource is = new InputSource(new FileInputStream("test.xml"));

DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
DocumentBuilder db = dbf.newDocumentBuilder();
Document doc = db.parse(is);
XPathFactory xpf = XPathFactory.newInstance();
XPath xpath = xpf.newXPath();
String words= xpath.evaluate("/foo/bar", doc.getDocumentElement());
ParsingHexToChar.parseToChar(words);

// lib which i use common-lang3.jar
//metod to parse 
public static String parseToChar( String words){

    String decode= org.apache.commons.lang3.StringEscapeUtils.unescapeHtml4(words);

        return decode;
 }

DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
DocumentBuilder parser = dbf.newDocumentBuilder();

InputStream in = new FileInputStream(xmlfile);    
String unescapeHtml4 = IOUtils.toString(in);

CharSequenceTranslator obj = new AggregateTranslator(new LookupTranslator(EntityArrays.ISO8859_1_UNESCAPE()),
          new LookupTranslator(EntityArrays.HTML40_EXTENDED_UNESCAPE())    
         );

unescapeHtml4 = obj.translate(unescapeHtml4);
StringReader readerInput= new StringReader(unescapeHtml4);

InputSource is = new InputSource(readerInput);
Document doc    = parser.parse(is);