在Android中，最快的抓取HTML网页的方法是什么？_Android_Html_Web Scraping

在Android中，最快的抓取HTML网页的方法是什么？

android html web-scraping

在Android中，最快的抓取HTML网页的方法是什么？,android,html,web-scraping,Android,Html,Web Scraping,我需要在Android中从非结构化网页中提取信息。我想要的信息嵌入到一个没有id的表中 <table> <tr><td>Description</td><td></td><td>I want this field next to the description cell</td></tr> </table> description我希望此字段位于description单

我需要在Android中从非结构化网页中提取信息。我想要的信息嵌入到一个没有id的表中

<table> 
<tr><td>Description</td><td></td><td>I want this field next to the description cell</td></tr> 
</table>


description我希望此字段位于description单元格旁边

我应该用吗

模式匹配
使用BufferedReader提取信息

或者有没有更快的方法获取这些信息？

为什么不创建一个脚本，使用cURL进行刮取，然后从该页面获取所需的值？这些工具与PHP配合使用，但其他工具也适用于您需要的任何语言。

一种方法是将html放入字符串中，然后手动搜索并解析字符串。如果您知道标签将以特定的顺序出现，那么您应该能够在其中爬行并找到数据。然而，这有点草率，所以这是一个问题，你希望它现在工作吗？还是工作得好

int position=（字符串）html.indexOf（“”）//html是包含html代码的字符串
字符串字段=html.substring（html.indexOf（“，html.indexOf（“，position））+4，html.indexOf（“，html.indexOf（“，position））；

就像我说的。。。真邋遢。但是如果你只做了一次，并且你需要它来工作，那么这可能会奏效。

你为什么不直接写呢

int start=data.indexOf（“说明”）

之后，使用所需的子字符串。

最快的方法是自己解析特定信息。您似乎事先就知道HTML结构。这些方法和方法应该足够了。下面是一个启动示例，显示您自己问题的第一段：

public static void main(String... args) throws Exception {
    URL url = new URL("http://stackoverflow.com/questions/2971155");
    BufferedReader reader = null;
    StringBuilder builder = new StringBuilder();
    try {
        reader = new BufferedReader(new InputStreamReader(url.openStream(), "UTF-8"));
        for (String line; (line = reader.readLine()) != null;) {
            builder.append(line.trim());
        }
    } finally {
        if (reader != null) try { reader.close(); } catch (IOException logOrIgnore) {}
    }

    String start = "<div class=\"post-text\"><p>";
    String end = "</p>";
    String part = builder.substring(builder.indexOf(start) + start.length());
    String question = part.substring(0, part.indexOf(end));
    System.out.println(question);
}

现在还不清楚您所谈论的是什么网页，因此我无法给出更详细的示例，说明如何使用Jsoup从特定网页中选择特定信息。如果您仍然无法使用Jsoup和找到自己的URL，请随时在评论中发布URL，我将建议如何操作。

当您废弃Html网页时。你可以为此做两件事。第一个是使用正则表达式。另一个是Html解析器

并非所有人都喜欢使用正则表达式。因为它会在运行时导致逻辑异常

使用Html解析器要复杂得多。您无法确定是否会有正确的输出。根据我的经验，它在运行时也出现了一些例外

所以最好将url响应为Xml文件。而且do非常简单有效。

我认为在这种情况下，寻找一种快速提取信息的方法是没有意义的，因为与下载HTML所需的时间相比，答案中已经建议的方法几乎没有性能差异

因此，假设您所说的“最快”是指最方便、可读和可维护的代码，我建议您使用a解析相关HTML并使用s提取数据：

如果您碰巧检索到无效的HTML，我建议隔离相关部分（例如，使用

substring（indexOf）（“jsoup依赖于Apache Commons Langlibrary@Josef：我看不出这是一个否决投票的有效原因。你不应该用正则表达式解析HTML:Hi Josef我们如何遍历表中的所有表？
public static void main(String... args) throws Exception {
    URL url = new URL("http://stackoverflow.com/questions/2971155");
    BufferedReader reader = null;
    StringBuilder builder = new StringBuilder();
    try {
        reader = new BufferedReader(new InputStreamReader(url.openStream(), "UTF-8"));
        for (String line; (line = reader.readLine()) != null;) {
            builder.append(line.trim());
        }
    } finally {
        if (reader != null) try { reader.close(); } catch (IOException logOrIgnore) {}
    }

    String start = "<div class=\"post-text\"><p>";
    String end = "</p>";
    String part = builder.substring(builder.indexOf(start) + start.length());
    String question = part.substring(0, part.indexOf(end));
    System.out.println(question);
}

public static void main(String... args) throws Exception {
    Document document = Jsoup.connect("http://stackoverflow.com/questions/2971155").get();
    String question = document.select("#question .post-text p").first().text();
    System.out.println(question);
}

Document doc = DocumentBuilderFactory.newInstance()
  .newDocumentBuilder().parse(new InputSource(new StringReader(html)));

XPathExpression xpath = XPathFactory.newInstance()
  .newXPath().compile("//td[text()=\"Description\"]/following-sibling::td[2]");

String result = (String) xpath.evaluate(doc, XPathConstants.STRING);