Java 带Jsoup的XPath表达式_Java_Xpath_Html Parsing_Jsoup

Java 带Jsoup的XPath表达式

java xpath

Java 带Jsoup的XPath表达式,java,xpath,html-parsing,jsoup,Java,Xpath,Html Parsing,Jsoup,需要这个表达式的帮助吗 "//tr[td[normalize-space(font) = '"+params[1]+"']]/td/font/text()" 我试图从这个HTML文档中获取信息 <table width="575" border="0" cellspacing="1" cellpadding="0"> <tr> <td width="39" class="back1"><b class="texto4">CR

需要这个表达式的帮助吗

"//tr[td[normalize-space(font) = '"+params[1]+"']]/td/font/text()"

我试图从这个HTML文档中获取信息

<table width="575" border="0" cellspacing="1" cellpadding="0">
    <tr> 
      <td width="39" class="back1"><b class="texto4">CRN</b></td>
      <td width="60" class="back1"><b class="texto4">Materia</b></td>
      <td width="53" class="back1"><b class="texto4">Secci&oacute;n</b></td>
      <td width="55" class="back1"><b class="texto4">Cr&eacute;ditos</b></td>
      <td width="156" class="back1"><b class="texto4">T&iacute;tulo</b></td>
      <td width="69" class="back1"><b class="texto4">Cupo</b></td>
      <td width="57" class="back1"><b class="texto4">Inscritos</b></td>
      <td width="77" class="back1"><b class="texto4">Disponible</b></td>
    </tr>
    <tr> 
      <td width="39"><font class="texto4"> 
        10110                        </font></td>
      <td width="60"><font class="texto4"> 
        IIND1000                        </font></td>
      <td width="53"><font class="texto4"> 
      <div align="center">
        1                        </div></font></td>
      <td width="55"><font class="texto4"> 
        <div align="center">
        3                       </div>
        </font></td>
      <td width="156"><font class="texto4"> 
        INTROD. INGEN. INDUSTRIAL                        </font></td>
      <td width="69"><font class="texto4"> 
        100                        </font></td>
      <td width="57"><font class="texto4"> 
        100                        </font></td>
      <td width="77"><font class="texto4"> 
        0                        </font></td>
    </tr>
</table>


CRN
材料
Secció；N
Cré；迪托斯
Tí；图洛
库波
因斯克里托斯
有争议
10110
IIND1000
1.
3.
介绍。英根。工业的
100
100
0

如果我查找params1=10110，我想得到tr标签中的每个td元素（10110，IIND1000，1，3，INTROD.INGEN.INDUSTRIAL，100100，0）

Jtidy没有真正做好这项工作（），所以我决定改用Jsoup。是否有人知道如何在开始时转换该Xpath表达式以便在Jsoup中使用

到目前为止，我已经设法得到了这个表达式：

font.texto4:contains（10110）

，它只得到“10110”。然而，我还没有找到一种方法来从同一级别的每个子节点获取文本

EDTI：我是Jsoup的noob，但我正在尝试更多的表达式并检查结果。我发现如果我尝试这个表达式

tr>td:contains（10110）font.texto4

，我会得到表中每个元素的文本。我只想把它缩小到同一级别的tr节点集。

可以用xpath和jsoup两种方式完成。考虑这个例子。

import java.io.ByteArrayInputStream;
import java.io.IOException;
import java.io.InputStream;

import javax.xml.parsers.DocumentBuilder;
import javax.xml.parsers.DocumentBuilderFactory;
import javax.xml.parsers.ParserConfigurationException;
import javax.xml.xpath.XPath;
import javax.xml.xpath.XPathConstants;
import javax.xml.xpath.XPathExpressionException;
import javax.xml.xpath.XPathFactory;

import org.jsoup.Jsoup;
import org.jsoup.nodes.Element;
import org.w3c.dom.Document;
import org.w3c.dom.NodeList;
import org.xml.sax.SAXException;


public class SibilingParse {

    public static void main(String[] args) {
        try {
                String html = "<table width='575' border='0' cellspacing='1' cellpadding='0'>"
                                + "<tr>"
                                    + "<td width='39'><font class='texto4'>10110</font></td>"
                                    + "<td width='60'><font class='texto4'>IIND1000</font></td>"
                                    + "<td width='53'><font class='texto4'><div align='center'>1</div></font></td>"
                                    + "<td width='55'><font class='texto4'><div align='center'>3</div></font></td>"
                                    + "<td width='156'><font class='texto4'>INTROD. INGEN. INDUSTRIAL</font></td>"
                                    + "<td width='69'><font class='texto4'>100</font></td>"
                                    + "<td width='57'><font class='texto4'>100</font></td>"
                                    + "<td width='77'><font class='texto4'>0</font></td>"
                                + "</tr>"
                            + "</table>";

                //Xpath way
                System.out.println("XPATH");
                InputStream xmlStream = new ByteArrayInputStream(html.getBytes());
                DocumentBuilderFactory builderFactory = DocumentBuilderFactory.newInstance();
                DocumentBuilder builder = builderFactory.newDocumentBuilder();
                Document xmlDocument = builder.parse(xmlStream);
                XPath xPath =  XPathFactory.newInstance().newXPath();

                String expression = "/table/tr/td//*[text()='10110']//../following-sibling::td";
                NodeList nodeList = (NodeList) xPath.compile(expression).evaluate(xmlDocument, XPathConstants.NODESET);
                for (int i = 0; i < nodeList.getLength(); i++) {
                    System.out.println(nodeList.item(i).getFirstChild().getTextContent()); 
                }
                System.out.println();

                // Jsoup way
                org.jsoup.nodes.Document doc = Jsoup.parse(html);
                Elements tds = doc.select("td:contains(10110)");
                if(tds != null && tds.size() > 0){
                    for(Element td : tds.first().siblingElements()){
                        System.out.println(td.text());
                    }
                }
            } catch (ParserConfigurationException e) {
                e.printStackTrace();
            } catch (SAXException e) {
                e.printStackTrace();
            } catch (IOException e) {
                e.printStackTrace();
            } catch (XPathExpressionException e) {
                e.printStackTrace();
            }
        }

}

好主意！然而，我在这一行得到了一个NPE:doc.select（“td:contains（10110）”）.first（）.siblingElements（），更具体地说是.siblingElements（）。这次我直接从URL解析文档，而不是像您那样从HTML代码解析文档。这可能是问题所在吗？（URL在我的浏览器中加载时没有问题）这是因为10110没有td。所以

first（）

将为null，我正在null上调用siblingElements。我应该做一些空值或长度检查。我只是举个例子：）更新了我的答案。现在应该可以了……）完成。那是因为父母的td。当我们说

td:contains

时，它也会搜索子td。有一种方法可以

containsOwn

检查自己的文本。由于td没有该id和字体字段，我们需要在字体字段上执行

contains

或

containsOwn

。更新了我的答案。是的。即使在浏览器中，服务器也很慢。这就是我为什么要暂停的原因

import java.io.IOException;

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;

public class SiblingJsoup {

    public static void main(String[] args) {
        try {
            Document doc = Jsoup
                    .connect("http://registroapps.uniandes.edu.co/scripts/adm_con_horario1_joomla.php?depto=IIND")
                    .timeout(20000)
                    .get();

            Elements tds = doc.select("font:containsOwn(10110)");
            if (tds != null && tds.size() > 0) {
                for (Element td : tds.parents().first().siblingElements()) {
                    System.out.println(td.text());
                }
            }
            System.out.println("Done");
        } catch (IOException e) {
            e.printStackTrace();
        }

    }

}