在Java中提取嵌套的html标记?
我有以下HTML片段:在Java中提取嵌套的html标记?,java,html,Java,Html,我有以下HTML片段: String source = "<p>dsdds</p>" + "<ul class=\"some-class-name\">" + "<li>data</li>" + "<li><div><ul><li>data</li></ul></d
String source = "<p>dsdds</p>"
+ "<ul class=\"some-class-name\">"
+ "<li>data</li>"
+ "<li><div><ul><li>data</li></ul></div></li>"
+ "</ul>"
+ "<p>data</p>"
+ "<ul>data</ul><div>data</div>";
String source=“dsdds”
+“”
+“- 数据
”
+“- 数据”
+“
”
+“数据”
+“数据
数据”;
我想要达到的结果是:
<ul class="some-class-name">
<li>data</li>
<li><div><ul><li>data</li></ul></div></li>
</ul>
- 资料
- 数据
到目前为止,我所尝试的:
String endTag = "</ul>";
int origin = source.indexOf("<ul class=\"some-class-name\">");
int currentFrom = origin;
int to = source.indexOf(endTag, currentFrom);
while (true) {
int curIndex = source.indexOf("<ul", currentFrom + 1);
if (curIndex > -1) {
currentFrom = curIndex;
to = source.indexOf(endTag, currentFrom);
} else {
to = source.indexOf(endTag, to);
break;
}
}
System.out.println(source.substring(origin, to + endTag.length()));
String endTag=“”;
int origin=source.indexOf(“””;
int currentFrom=原点;
int-to=source.indexOf(endTag,currentFrom);
while(true){
int curIndex=source.indexOf(“幸运的是,您的片段是有效的XHTML,这意味着它是有效的XML
XPath专门用于从XML中提取节点:
// Must have a single root in order to parse.
String input = "<div>" + source + "</div>";
XPath xpath = XPathFactory.newInstance().newXPath();
Node node = (Node)
xpath.evaluate("//ul[@class='some-class-name']",
new InputSource(new StringReader(input)),
XPathConstants.NODE);
StringWriter result = new StringWriter();
Transformer transformer =
TransformerFactory.newInstance().newTransformer();
transformer.setOutputProperty(OutputKeys.OMIT_XML_DECLARATION, "yes");
transformer.transform(new DOMSource(node), new StreamResult(result));
String fragment = result.toString();
//必须有一个根才能进行分析。
字符串输入=“源+”;
XPath=XPathFactory.newInstance().newXPath();
节点=(节点)
evaluate(“//ul[@class='some-class-name']”,
新建InputSource(新建StringReader(输入)),
XPathConstants.NODE);
StringWriter结果=新建StringWriter();
变压器=
TransformerFactory.newInstance().newTransformer();
setOutputProperty(OutputKeys.OMIT_XML_声明,“yes”);
transform(新的DOMSource(节点)、新的StreamResult(结果));
字符串片段=result.toString();
您应该这样使用
Document doc = Jsoup.parse(source);
Element e = doc.select("ul.some-class-name").first();
System.out.println(e);
结果:
<ul class="some-class-name">
<li>data</li>
<li>
<div>
<ul>
<li>data</li>
</ul>
</div></li>
</ul>
- 资料
-
- 资料
不要重新发明循环,使用Html解析器l,如jsoup
<ul class="some-class-name">
<li>data</li>
<li>
<div>
<ul>
<li>data</li>
</ul>
</div></li>
</ul>