Java 使用HTMLUNIT从标记之间的HTML页面提取数据

Java 使用HTMLUNIT从标记之间的HTML页面提取数据,java,htmlunit,Java,Htmlunit,我正在尝试使用Html单元从网页中提取数据。我已经通过将HTML页面转换为文本,然后使用正则表达式从HTML页面中提取数据来实现这一点。我还实现了在Html中使用class属性从Html表中提取数据 我想再次完全使用HtmlUnit进行所有提取,以了解我使用正则表达式所做的相同需求。我无法获取如何以键值对的形式提取标记中的数据 下面是示例Html数据 <div class="top_red_bar"> <div id="site-breadcrumbs">

我正在尝试使用Html单元从网页中提取数据。我已经通过将HTML页面转换为文本,然后使用正则表达式从HTML页面中提取数据来实现这一点。我还实现了在Html中使用class属性从Html表中提取数据

我想再次完全使用HtmlUnit进行所有提取,以了解我使用正则表达式所做的相同需求。我无法获取如何以键值对的形式提取标记中的数据

下面是示例Html数据

<div class="top_red_bar">
    <div id="site-breadcrumbs">
        <a href="/admin/index.jsp" title="Home">Home</a>
        &#124;
        <a href="/admin/queues.jsp" title="Queues">Queues</a>
        &#124;
        <a href="/admin/topics.jsp" title="Topics">Topics</a>
        &#124;
        <a href="/admin/subscribers.jsp" title="Subscribers">Subscribers</a>
        &#124;
        <a href="/admin/connections.jsp" title="Connections">Connections</a>
        &#124;
        <a href="/admin/network.jsp" title="Network">Network</a>
        &#124;
         <a href="/admin/scheduled.jsp" title="Scheduled">Scheduled</a>
        &#124;
        <a href="/admin/send.jsp"
           title="Send">Send</a>
    </div>
    <div id="site-quicklinks"><P>
        <a href="http://activemq.apache.org/support.html"
           title="Get help and support using Apache ActiveMQ">Support</a></p>
    </div>
</div>

<table border="0">
<tbody>
    <tr>
        <td valign="top" width="100%" style="overflow:hidden;">
            <div class="body-content">


<h2>Welcome!</h2>

<p>
Welcome to the Apache ActiveMQ Console of <b>localhost</b> (ID:TOOLCONTROLPJX526-524666-65544585445-2:3)
</p>

<p>
You can find more information about Apache ActiveMQ on the <a href="http://activemq.apache.org/">Apache ActiveMQ Site</a>
</p>

<h2>Broker</h2>


<table>
    <tr>
        <td>Name</td>
        <td><b>localhost</b></td>
    </tr>
    <tr>
        <td>Version</td>
        <td><b>5.13.3</b></td>
    </tr>
    <tr>
        <td>ID</td>
        <td><b>ID:TOOLCONTROLPJX526-524666-65544585445-2:3</b></td>
    </tr>
    <tr>
        <td>Uptime</td>
        <td><b>17 days 13 hours</b></td>
    </tr>
    <tr>
        <td>Store percent used</td>
        <td><b>19</b></td>
    </tr>
    <tr>
        <td>Memory percent used</td>
        <td><b>0</b></td>
    </tr>
    <tr>
        <td>Temp percent used</td>
        <td><b>0</b></td>
    </tr>
</table>

如何实现这一目标?我想知道在HTLM单元中使用哪些方法来实现这一点。

这是我遵循的步骤(不是唯一的解决方案)

  • 通过带有伪url的parseHtml方法解析字符串
  • 通过xpath获取第二个表
  • 使用双嵌套循环进行迭代(for和迭代器-正确附加分隔符-)
  • 可提取数据:
    @Rcordoval,不要想太多…我不是来这里写代码的,而是用Html单元提取标签的具体想法。如果您看到我的问题,我已经使用其他方法(regex)完成了,但无法找到或理解Htmlunit来实现这一点。。!您可以在这里找到很多例子HtmlUnit有许多方法(在HtmlPage中)来获取元素(通过标记名、ID、路径等)。如果元素是表,则返回的是HtmlTable。HtmlTable有获取行的方法,行有获取单元格的方法。javadoc是您的朋友,请阅读。@Rcordoval..谢谢分享。。!真的很感激。。!我现在一个人试试。。!
    Name:localhost
    Version:5.13.3
    ID:ID:TOOLCONTROLPJX526-524666-65544585445-2:3
    Uptime:7 days 13 hours
    Store percent used:19
    Memory percent used:0
    Temp percent used:0
    
    import java.net.URL;
    
    import com.gargoylesoftware.htmlunit.StringWebResponse;
    import com.gargoylesoftware.htmlunit.WebClient;
    import com.gargoylesoftware.htmlunit.html.HTMLParser;
    import com.gargoylesoftware.htmlunit.html.HtmlPage;
    import com.gargoylesoftware.htmlunit.html.HtmlTable;
    import com.gargoylesoftware.htmlunit.html.HtmlTableRow;
    import com.gargoylesoftware.htmlunit.html.HtmlTableRow.CellIterator;
    
    
    public class ExtractTableData {
    
        public static void main(String[] args) throws Exception {
    
            String html = "<div class=\"top_red_bar\">\n" + "                        <div id=\"site-breadcrumbs\">\n"
                    + "                            <a href=\"/admin/index.jsp\" title=\"Home\">Home</a>\n"
                    + "                            &#124;\n"
                    + "                            <a href=\"/admin/queues.jsp\" title=\"Queues\">Queues</a>\n"
                    + "                            &#124;\n"
                    + "                            <a href=\"/admin/topics.jsp\" title=\"Topics\">Topics</a>\n"
                    + "                            &#124;\n"
                    + "                            <a href=\"/admin/subscribers.jsp\" title=\"Subscribers\">Subscribers</a>\n"
                    + "                            &#124;\n"
                    + "                            <a href=\"/admin/connections.jsp\" title=\"Connections\">Connections</a>\n"
                    + "                            &#124;\n"
                    + "                            <a href=\"/admin/network.jsp\" title=\"Network\">Network</a>\n"
                    + "                            &#124;\n"
                    + "                             <a href=\"/admin/scheduled.jsp\" title=\"Scheduled\">Scheduled</a>\n"
                    + "                            &#124;\n" + "                            <a href=\"/admin/send.jsp\"\n"
                    + "                               title=\"Send\">Send</a>\n" + "                        </div>\n"
                    + "                        <div id=\"site-quicklinks\"><P>\n"
                    + "                            <a href=\"http://activemq.apache.org/support.html\"\n"
                    + "                               title=\"Get help and support using Apache ActiveMQ\">Support</a></p>\n"
                    + "                        </div>\n" + "                    </div>\n" + "\n"
                    + "                    <table border=\"0\">\n" + "                        <tbody>\n"
                    + "                            <tr>\n"
                    + "                                <td valign=\"top\" width=\"100%\" style=\"overflow:hidden;\">\n"
                    + "                                    <div class=\"body-content\">\n" + "\n" + "\n"
                    + "<h2>Welcome!</h2>\n" + "\n" + "<p>\n"
                    + "Welcome to the Apache ActiveMQ Console of <b>localhost</b> (ID:TOOLCONTROLPJX526-524666-65544585445-2:3)\n"
                    + "</p>\n" + "\n" + "<p>\n"
                    + "You can find more information about Apache ActiveMQ on the <a href=\"http://activemq.apache.org/\">Apache ActiveMQ Site</a>\n"
                    + "</p>\n" + "\n" + "<h2>Broker</h2>\n" + "\n" + "\n" + "<table>\n" + "    <tr>\n"
                    + "        <td>Name</td>\n" + "        <td><b>localhost</b></td>\n" + "    </tr>\n" + "    <tr>\n"
                    + "        <td>Version</td>\n" + "        <td><b>5.13.3</b></td>\n" + "    </tr>\n" + "    <tr>\n"
                    + "        <td>ID</td>\n" + "        <td><b>ID:TOOLCONTROLPJX526-524666-65544585445-2:3</b></td>\n"
                    + "    </tr>\n" + "    <tr>\n" + "        <td>Uptime</td>\n"
                    + "        <td><b>17 days 13 hours</b></td>\n" + "    </tr>\n" + "    <tr>\n"
                    + "        <td>Store percent used</td>\n" + "        <td><b>19</b></td>\n" + "    </tr>\n"
                    + "    <tr>\n" + "        <td>Memory percent used</td>\n" + "        <td><b>0</b></td>\n"
                    + "    </tr>\n" + "    <tr>\n" + "        <td>Temp percent used</td>\n" + "        <td><b>0</b></td>\n"
                    + "    </tr>\n" + "</table>";
            WebClient webClient = new WebClient();
            HtmlPage page = HTMLParser.parseHtml(new StringWebResponse(html, new URL("http://dummy.url.for.parsing.com/")),
                    webClient.getCurrentWindow());
    
            final HtmlTable table = (HtmlTable) page.getByXPath("//table").get(1);
    
            for (final HtmlTableRow row : table.getRows()) {
    
                CellIterator cellIterator = row.getCellIterator();
    
                if (cellIterator.hasNext()) {
                    System.out.print(cellIterator.next().asText());
                    while (cellIterator.hasNext()) {
                        System.out.print(":" + cellIterator.next().asText());
                    }
                }
                System.out.println();
            }
    
        }
    
    }
    
    Name:localhost
    Version:5.13.3
    ID:ID:TOOLCONTROLPJX526-524666-65544585445-2:3
    Uptime:17 days 13 hours
    Store percent used:19
    Memory percent used:0
    Temp percent used:0