Java-获取Xpath查询的HTML页面源代码_Java_Android_Xpath

Java-获取Xpath查询的HTML页面源代码

java android xpath

Java-获取Xpath查询的HTML页面源代码,java,android,xpath,Java,Android,Xpath,我正在尝试做一些简单的事情（至少我认为它很简单），就是从网页中提取HTML代码，然后创建一个DOM，这样我就可以对它使用xPath查询我已经找到了大量关于如何在Java中为本地文件使用XML xPath的示例，但是在从网站获取源代码之后，却没有找到任何关于如何使用XML xPath的示例我已经学会了如何在PHP中实现这一点，它使用了以下代码 $url = 'pagehtmlhere' $output = file_get_contents($url); $doc = new DOMDocum

我正在尝试做一些简单的事情（至少我认为它很简单），就是从网页中提取HTML代码，然后创建一个DOM，这样我就可以对它使用xPath查询

我已经找到了大量关于如何在Java中为本地文件使用XML xPath的示例，但是在从网站获取源代码之后，却没有找到任何关于如何使用XML xPath的示例

我已经学会了如何在PHP中实现这一点，它使用了以下代码

$url = 'pagehtmlhere'
$output = file_get_contents($url);
$doc = new DOMDocument();

libxml_use_internal_errors(true); //Supress Warnings for HTML5 conversion issue
$doc->loadHTML($output);
libxml_use_internal_errors(false); //Start Showing Errors

$xpath = new DOMXpath($doc);


$TitleString = "//h2[@class='title']/text()";
$BodyString = "//section[@id='body']/text()";
$ImageString = "//img[@id='iwi']/@src";



$titleQuery = $xpath->query($TitleString);
$title = $titleQuery->item(1)->nodeValue;

$bodyText = "";
$textQuery = $xpath->query($BodyString);

foreach($textQuery as $text){
    $bodyText .= $text->nodeValue . " ";
    }


$imageQuery = $xpath->query($ImageString);
$imageSrc = $imageQuery->item(0)->nodeValue;

但我完全不知道如何在Java中实现这一点

我尝试了以下代码

            URL url = new URL(PageURL);
            URLConnection conn = url.openConnection();


            //FileInputStream file = new FileInputStream(new File("c:/employees.xml"));


            InputStream file = conn.getInputStream();
            DocumentBuilderFactory builderFactory = DocumentBuilderFactory.newInstance();

            DocumentBuilder builder =  builderFactory.newDocumentBuilder();

            Document xmlDocument = builder.parse(file);

            XPath xPath =  XPathFactory.newInstance().newXPath();



           // System.out.println("*************************");
            String expression = "//div[contains(@class,\"carousel\")]/descendant-or-self::*[img]/img/@src')";
            //System.out.println(expression);
            String email = xPath.compile(expression).evaluate(xmlDocument);
           // System.out.println(email);

            Log.d("email", email);

当然，我在[InputStream file=conn.getInputStream（）；]行中得到了一个错误，因为这显然是不正确的方法

有人能帮我举一个有效的例子吗？请绝对不要使用任何HTML解析器，如HTMLCleaner或任何类似的废话。我花了好几个小时试图让HTML更干净以允许“资产”xPATH搜索，这是一场噩梦，我真的不想处理它，我根本不想依赖其他人的库。

经过长时间的查找后找到了答案。我只需要做一个HTTP连接并将inputstream设置为

        URL url = new URL(PageURL);

        HttpURLConnection c = (HttpURLConnection) url.openConnection();
        c.setConnectTimeout(8000);
        c.setReadTimeout(15000);
        BufferedReader inn = new BufferedReader(new InputStreamReader(
                c.getInputStream()));
        Log.d("TAG", "-----> Got response on Thread" + String.valueOf(j));
        StringBuffer sb = new StringBuffer("");
        String l = null;
        while ((l = inn.readLine()) != null) {
            sb.append(l);
        }
        inn.close();


        Document xmlDocument = builder.parse(sb.toString());

        XPath xPath =  XPathFactory.newInstance().newXPath();

从文档上看，您似乎需要输入源