Java 使用HTMLUNIT从标记之间的HTML页面提取数据
我正在尝试使用Html单元从网页中提取数据。我已经通过将HTML页面转换为文本,然后使用正则表达式从HTML页面中提取数据来实现这一点。我还实现了在Html中使用class属性从Html表中提取数据 我想再次完全使用HtmlUnit进行所有提取,以了解我使用正则表达式所做的相同需求。我无法获取如何以键值对的形式提取标记中的数据 下面是示例Html数据Java 使用HTMLUNIT从标记之间的HTML页面提取数据,java,htmlunit,Java,Htmlunit,我正在尝试使用Html单元从网页中提取数据。我已经通过将HTML页面转换为文本,然后使用正则表达式从HTML页面中提取数据来实现这一点。我还实现了在Html中使用class属性从Html表中提取数据 我想再次完全使用HtmlUnit进行所有提取,以了解我使用正则表达式所做的相同需求。我无法获取如何以键值对的形式提取标记中的数据 下面是示例Html数据 <div class="top_red_bar"> <div id="site-breadcrumbs">
<div class="top_red_bar">
<div id="site-breadcrumbs">
<a href="/admin/index.jsp" title="Home">Home</a>
|
<a href="/admin/queues.jsp" title="Queues">Queues</a>
|
<a href="/admin/topics.jsp" title="Topics">Topics</a>
|
<a href="/admin/subscribers.jsp" title="Subscribers">Subscribers</a>
|
<a href="/admin/connections.jsp" title="Connections">Connections</a>
|
<a href="/admin/network.jsp" title="Network">Network</a>
|
<a href="/admin/scheduled.jsp" title="Scheduled">Scheduled</a>
|
<a href="/admin/send.jsp"
title="Send">Send</a>
</div>
<div id="site-quicklinks"><P>
<a href="http://activemq.apache.org/support.html"
title="Get help and support using Apache ActiveMQ">Support</a></p>
</div>
</div>
<table border="0">
<tbody>
<tr>
<td valign="top" width="100%" style="overflow:hidden;">
<div class="body-content">
<h2>Welcome!</h2>
<p>
Welcome to the Apache ActiveMQ Console of <b>localhost</b> (ID:TOOLCONTROLPJX526-524666-65544585445-2:3)
</p>
<p>
You can find more information about Apache ActiveMQ on the <a href="http://activemq.apache.org/">Apache ActiveMQ Site</a>
</p>
<h2>Broker</h2>
<table>
<tr>
<td>Name</td>
<td><b>localhost</b></td>
</tr>
<tr>
<td>Version</td>
<td><b>5.13.3</b></td>
</tr>
<tr>
<td>ID</td>
<td><b>ID:TOOLCONTROLPJX526-524666-65544585445-2:3</b></td>
</tr>
<tr>
<td>Uptime</td>
<td><b>17 days 13 hours</b></td>
</tr>
<tr>
<td>Store percent used</td>
<td><b>19</b></td>
</tr>
<tr>
<td>Memory percent used</td>
<td><b>0</b></td>
</tr>
<tr>
<td>Temp percent used</td>
<td><b>0</b></td>
</tr>
</table>
如何实现这一目标?我想知道在HTLM单元中使用哪些方法来实现这一点。这是我遵循的步骤(不是唯一的解决方案)
@Rcordoval,不要想太多…我不是来这里写代码的,而是用Html单元提取标签的具体想法。如果您看到我的问题,我已经使用其他方法(regex)完成了,但无法找到或理解Htmlunit来实现这一点。。!您可以在这里找到很多例子HtmlUnit有许多方法(在HtmlPage中)来获取元素(通过标记名、ID、路径等)。如果元素是表,则返回的是HtmlTable。HtmlTable有获取行的方法,行有获取单元格的方法。javadoc是您的朋友,请阅读。@Rcordoval..谢谢分享。。!真的很感激。。!我现在一个人试试。。!
Name:localhost
Version:5.13.3
ID:ID:TOOLCONTROLPJX526-524666-65544585445-2:3
Uptime:7 days 13 hours
Store percent used:19
Memory percent used:0
Temp percent used:0
import java.net.URL;
import com.gargoylesoftware.htmlunit.StringWebResponse;
import com.gargoylesoftware.htmlunit.WebClient;
import com.gargoylesoftware.htmlunit.html.HTMLParser;
import com.gargoylesoftware.htmlunit.html.HtmlPage;
import com.gargoylesoftware.htmlunit.html.HtmlTable;
import com.gargoylesoftware.htmlunit.html.HtmlTableRow;
import com.gargoylesoftware.htmlunit.html.HtmlTableRow.CellIterator;
public class ExtractTableData {
public static void main(String[] args) throws Exception {
String html = "<div class=\"top_red_bar\">\n" + " <div id=\"site-breadcrumbs\">\n"
+ " <a href=\"/admin/index.jsp\" title=\"Home\">Home</a>\n"
+ " |\n"
+ " <a href=\"/admin/queues.jsp\" title=\"Queues\">Queues</a>\n"
+ " |\n"
+ " <a href=\"/admin/topics.jsp\" title=\"Topics\">Topics</a>\n"
+ " |\n"
+ " <a href=\"/admin/subscribers.jsp\" title=\"Subscribers\">Subscribers</a>\n"
+ " |\n"
+ " <a href=\"/admin/connections.jsp\" title=\"Connections\">Connections</a>\n"
+ " |\n"
+ " <a href=\"/admin/network.jsp\" title=\"Network\">Network</a>\n"
+ " |\n"
+ " <a href=\"/admin/scheduled.jsp\" title=\"Scheduled\">Scheduled</a>\n"
+ " |\n" + " <a href=\"/admin/send.jsp\"\n"
+ " title=\"Send\">Send</a>\n" + " </div>\n"
+ " <div id=\"site-quicklinks\"><P>\n"
+ " <a href=\"http://activemq.apache.org/support.html\"\n"
+ " title=\"Get help and support using Apache ActiveMQ\">Support</a></p>\n"
+ " </div>\n" + " </div>\n" + "\n"
+ " <table border=\"0\">\n" + " <tbody>\n"
+ " <tr>\n"
+ " <td valign=\"top\" width=\"100%\" style=\"overflow:hidden;\">\n"
+ " <div class=\"body-content\">\n" + "\n" + "\n"
+ "<h2>Welcome!</h2>\n" + "\n" + "<p>\n"
+ "Welcome to the Apache ActiveMQ Console of <b>localhost</b> (ID:TOOLCONTROLPJX526-524666-65544585445-2:3)\n"
+ "</p>\n" + "\n" + "<p>\n"
+ "You can find more information about Apache ActiveMQ on the <a href=\"http://activemq.apache.org/\">Apache ActiveMQ Site</a>\n"
+ "</p>\n" + "\n" + "<h2>Broker</h2>\n" + "\n" + "\n" + "<table>\n" + " <tr>\n"
+ " <td>Name</td>\n" + " <td><b>localhost</b></td>\n" + " </tr>\n" + " <tr>\n"
+ " <td>Version</td>\n" + " <td><b>5.13.3</b></td>\n" + " </tr>\n" + " <tr>\n"
+ " <td>ID</td>\n" + " <td><b>ID:TOOLCONTROLPJX526-524666-65544585445-2:3</b></td>\n"
+ " </tr>\n" + " <tr>\n" + " <td>Uptime</td>\n"
+ " <td><b>17 days 13 hours</b></td>\n" + " </tr>\n" + " <tr>\n"
+ " <td>Store percent used</td>\n" + " <td><b>19</b></td>\n" + " </tr>\n"
+ " <tr>\n" + " <td>Memory percent used</td>\n" + " <td><b>0</b></td>\n"
+ " </tr>\n" + " <tr>\n" + " <td>Temp percent used</td>\n" + " <td><b>0</b></td>\n"
+ " </tr>\n" + "</table>";
WebClient webClient = new WebClient();
HtmlPage page = HTMLParser.parseHtml(new StringWebResponse(html, new URL("http://dummy.url.for.parsing.com/")),
webClient.getCurrentWindow());
final HtmlTable table = (HtmlTable) page.getByXPath("//table").get(1);
for (final HtmlTableRow row : table.getRows()) {
CellIterator cellIterator = row.getCellIterator();
if (cellIterator.hasNext()) {
System.out.print(cellIterator.next().asText());
while (cellIterator.hasNext()) {
System.out.print(":" + cellIterator.next().asText());
}
}
System.out.println();
}
}
}
Name:localhost
Version:5.13.3
ID:ID:TOOLCONTROLPJX526-524666-65544585445-2:3
Uptime:17 days 13 hours
Store percent used:19
Memory percent used:0
Temp percent used:0