Java 从XML文件解析HTML内容

Java 从XML文件解析HTML内容,java,xml,Java,Xml,这对我很管用 <xbrli:xbrl xmlns:aoi="http://www.aointl.com/20160331" xmlns:country="http://xbrl.sec.gov/country/2016-01-31" xmlns:currency="http://xbrl.sec.gov/currency/2016-01-31" xmlns:dei="http://xbrl.sec.gov/dei/2014-01-31" xmlns:exch="http://xbr

这对我很管用

    <xbrli:xbrl xmlns:aoi="http://www.aointl.com/20160331" xmlns:country="http://xbrl.sec.gov/country/2016-01-31" xmlns:currency="http://xbrl.sec.gov/currency/2016-01-31" xmlns:dei="http://xbrl.sec.gov/dei/2014-01-31" xmlns:exch="http://xbrl.sec.gov/exch/2016-01-31" xmlns:invest="http://xbrl.sec.gov/invest/2013-01-31" xmlns:iso4217="http://www.xbrl.org/2003/iso4217" xmlns:link="http://www.xbrl.org/2003/linkbase" xmlns:naics="http://xbrl.sec.gov/naics/2011-01-31" xmlns:nonnum="http://www.xbrl.org/dtr/type/non-numeric" xmlns:num="http://www.xbrl.org/dtr/type/numeric" xmlns:ref="http://www.xbrl.org/2006/ref" xmlns:sic="http://xbrl.sec.gov/sic/2011-01-31" xmlns:stpr="http://xbrl.sec.gov/stpr/2011-01-31" xmlns:us-gaap="http://fasb.org/us-gaap/2016-01-31" xmlns:us-roles="http://fasb.org/us-roles/2016-01-31" xmlns:us-types="http://fasb.org/us-types/2016-01-31" xmlns:utreg="http://www.xbrl.org/2009/utr" xmlns:xbrldi="http://xbrl.org/2006/xbrldi" xmlns:xbrldt="http://xbrl.org/2005/xbrldt" xmlns:xbrli="http://www.xbrl.org/2003/instance" xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:xsd="http://www.w3.org/2001/XMLSchema" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
    <link:schemaRef xlink:href="aoi-20160331.xsd" xlink:type="simple"/>
    <xbrli:context id="FD2016Q4YTD">
    <xbrli:entity>
    <xbrli:identifier scheme="http://www.sec.gov/CIK">0000939930</xbrli:identifier>
    </xbrli:entity>
    <xbrli:period>
    <xbrli:startDate>2015-04-01</xbrli:startDate>
    <xbrli:endDate>2016-03-31</xbrli:endDate>
    </xbrli:period>
    </xbrli:context>

    <aoi:OtherIncomeAndExpensePolicyTextBlock contextRef="FD2016Q4YTD" id="Fact-F51C7616E17E5B8B0B770D410BBF5A3E">
    <div style="font-family:Times New Roman;font-size:10pt;"><div style="line-height:120%;text-align:justify;font-size:10pt;"><font style="font-family:inherit;font-size:10pt;font-weight:bold;">Other Income (Expense)</font></div><div style="line-height:120%;text-align:justify;font-size:10pt;"><font style="font-family:inherit;font-size:10pt;"></font></div></div>
    </aoi:OtherIncomeAndExpensePolicyTextBlock>
    </xbrli:xbrl>

This is My XML[XBRL], i need to parse this. This xml is my input and i don't know whether its a valid or not but in need output like this :

    <div style="font-family:Times New Roman;font-size:10pt;"><div style="line-height:120%;text-align:justify;font-size:10pt;"><font style="font-family:inherit;font-size:10pt;font-weight:bold;">Other Income (Expense)</font></div><div style="line-height:120%;text-align:justify;font-size:10pt;"><font style="font-family:inherit;font-size:10pt;"></font></div></div>

Please someone share me the knowledge for this problem i am facing from last two weeks.

this is the code i am using 

    File fXmlFile = new File("/home/devteam-user1/Desktop/ky/UnitTesting.xml");
                DocumentBuilderFactory dbFactory = DocumentBuilderFactory.newInstance();
                DocumentBuilder dBuilder = dbFactory.newDocumentBuilder();
                Document doc = dBuilder.parse(fXmlFile);

                XPath xPath =  XPathFactory.newInstance().newXPath();
                final String DIV_UNDER_ROOT = "/*/aoi";
                NodeList divList = (NodeList)xPath.compile(DIV_UNDER_ROOT)
                        .evaluate(doc, XPathConstants.NODESET);
                System.out.println(divList.getLength());
                for (int i = 0; i < divList.getLength() ; i++) {  // just in case there is more than one
                    Node divNode = divList.item(i);
                    System.out.println(nodeToString(divNode));

//nodeToString method below 

    private static String nodeToString(Node node) throws Exception
        {
            Transformer transformer = TransformerFactory.newInstance().newTransformer();
            transformer.setOutputProperty(OutputKeys.INDENT, "yes");
            StreamResult result = new StreamResult(new StringWriter());
            transformer.transform(new DOMSource(node), result);
            return result.getWriter().toString();
        }

你的主要问题在于

public static void main(String[] args) throws IOException {
    FileInputStream fis = new FileInputStream("yourfile.xml");
    Document doc = Jsoup.parse(Utils.streamToString(fis));
    System.out.println(doc.select("aoi|OtherIncomeAndExpensePolicyTextBlock").html().toString());
}
它是一个XPath表达式,匹配根目录下的任何节点2级别,该根目录的本地名称为aoi,没有名称空间。这不是你想要的

您希望匹配两个级别的节点的任何内容,该节点的名称空间使用aoi别名,这意味着它属于名称空间,并且其本地名称为OtherIncomeAndExpensePolicyTextBlock

在Java中的XPath中匹配名称空间非常麻烦,请参见,但长话短说,您可以尝试以下方法:

final String DIV_UNDER_ROOT = "/*/aoi";
这只有在DocumentBuilderFactory具有名称空间意识时才起作用,因此您应该通过如上所述进行配置来确保:

final String DIV_UNDER_ROOT = "//*[local-name()='OtherIncomeAndExpensePolicyTextBlock' and namespace-uri()='http://www.aointl.com/20160331']/*";

我不太明白,但是如果需要将HTML合并到XML中,应该转义字符。例如,Hello World将输出为“Hello World”或使用block@marco我不需要在xml中插入html。它已经存在于xml中了。我需要使用任何java api来获取html内容。在我的问题中,我清楚地提到了我的输入和输出使用XML解析器通过XML标记提取XML信息。保留HTML。但是您的XML文档作为一个整体是格式良好的吗?HTML部分没有缺少结束标记?这是同一个人提出的另一个问题的第2部分。我在那里给出了完整的答案,所以他把我的答案复制/粘贴到新问题中。在这个论坛上这是正确的行为吗@莎伦:我希望我们能从像你这样的知识巨人那里获得知识。你告诉我它的XML格式不好。n另外,我是这个论坛的新手,因为我知道如果我的问题是正确的,那么我将很容易得到解决方案。。。除此之外什么也不做。。Thanks@JohnAdam-不要编辑问题的答案!不要打开新问题复制粘贴上一个问题的答案!!这不是如何对待试图帮助你的人!!–Sharonb当你从别人那里复制粘贴时,这实际上只是一种基本的礼貌answer@wutzebaer,Utils?选择医生?你能解释一下吗?@wutzebaer,非常感谢你的代码。。它的working fine.OP应该花时间学习使用XPATH工具和语法。就这些
DocumentBuilderFactory dbFactory = DocumentBuilderFactory.newInstance();
dbFactory.setNamespaceAware(true);