Java 如何通过文本内容获取HTMLDOM路径?
HTML文件:Java 如何通过文本内容获取HTMLDOM路径?,java,html,Java,Html,HTML文件: <html> <body> <div class="main"> <p id="tID">content</p> </div> </body> </html> chrome开发者工具有这个功能(Elements标签,底部栏),我想知道如何在java中实现它 谢谢你的帮助:)玩得开心:) JAVA代码 import
<html>
<body>
<div class="main">
<p id="tID">content</p>
</div>
</body>
</html>
chrome开发者工具有这个功能(Elements标签,底部栏),我想知道如何在java中实现它
谢谢你的帮助:)玩得开心:)
JAVA代码
import java.io.File;
import javax.xml.xpath.XPath;
import javax.xml.xpath.XPathConstants;
import javax.xml.xpath.XPathFactory;
import org.htmlcleaner.CleanerProperties;
import org.htmlcleaner.DomSerializer;
import org.htmlcleaner.HtmlCleaner;
import org.htmlcleaner.TagNode;
import org.w3c.dom.Document;
import org.w3c.dom.NamedNodeMap;
import org.w3c.dom.Node;
public class Teste {
public static void main(String[] args) {
try {
// read and clean document
TagNode tagNode = new HtmlCleaner().clean(new File("test.xml"));
Document document = new DomSerializer(new CleanerProperties()).createDOM(tagNode);
// use XPath to find target node
XPath xpath = XPathFactory.newInstance().newXPath();
Node node = (Node) xpath.evaluate("//*[text()='content']", document, XPathConstants.NODE);
// assembles jquery/css selector
String result = "";
while (node != null && node.getParentNode() != null) {
result = readPath(node) + " " + result;
node = node.getParentNode();
}
System.out.println(result);
// returns html body div#myDiv.foo.bar p#tID
} catch (Exception e) {
e.printStackTrace();
}
}
// Gets id and class attributes of this node
private static String readPath(Node node) {
NamedNodeMap attributes = node.getAttributes();
String id = readAttribute(attributes.getNamedItem("id"), "#");
String clazz = readAttribute(attributes.getNamedItem("class"), ".");
return node.getNodeName() + id + clazz;
}
// Read attribute
private static String readAttribute(Node node, String token) {
String result = "";
if(node != null) {
result = token + node.getTextContent().replace(" ", token);
}
return result;
}
}
<html>
<body>
<br>
<div id="myDiv" class="foo bar">
<p id="tID">content</p>
</div>
</body>
</html>
XML示例
import java.io.File;
import javax.xml.xpath.XPath;
import javax.xml.xpath.XPathConstants;
import javax.xml.xpath.XPathFactory;
import org.htmlcleaner.CleanerProperties;
import org.htmlcleaner.DomSerializer;
import org.htmlcleaner.HtmlCleaner;
import org.htmlcleaner.TagNode;
import org.w3c.dom.Document;
import org.w3c.dom.NamedNodeMap;
import org.w3c.dom.Node;
public class Teste {
public static void main(String[] args) {
try {
// read and clean document
TagNode tagNode = new HtmlCleaner().clean(new File("test.xml"));
Document document = new DomSerializer(new CleanerProperties()).createDOM(tagNode);
// use XPath to find target node
XPath xpath = XPathFactory.newInstance().newXPath();
Node node = (Node) xpath.evaluate("//*[text()='content']", document, XPathConstants.NODE);
// assembles jquery/css selector
String result = "";
while (node != null && node.getParentNode() != null) {
result = readPath(node) + " " + result;
node = node.getParentNode();
}
System.out.println(result);
// returns html body div#myDiv.foo.bar p#tID
} catch (Exception e) {
e.printStackTrace();
}
}
// Gets id and class attributes of this node
private static String readPath(Node node) {
NamedNodeMap attributes = node.getAttributes();
String id = readAttribute(attributes.getNamedItem("id"), "#");
String clazz = readAttribute(attributes.getNamedItem("class"), ".");
return node.getNodeName() + id + clazz;
}
// Read attribute
private static String readAttribute(Node node, String token) {
String result = "";
if(node != null) {
result = token + node.getTextContent().replace(" ", token);
}
return result;
}
}
<html>
<body>
<br>
<div id="myDiv" class="foo bar">
<p id="tID">content</p>
</div>
</body>
</html>
内容
解释
文档
指向已评估的XML/*[text()='content']
查找text='content'的所有内容,并查找节点while
循环到第一个节点,获取当前元素的id和类更多解释
,cleaner将替换为
你是说java,还是javascript?但它不是XML文档,如果有
或其他标记没有结束标记,它将无法解析。org.XML.sax.SAXParseException:元素类型“br”必须由匹配的结束标记“”终止。
我编辑了我的答案以处理格式不正确的XML。看一看。