Java Jsoup在两个标记之间获取html_Java_Html_Wiki_Jsoup

Java Jsoup在两个标记之间获取html

java html

Java Jsoup在两个标记之间获取html,java,html,wiki,jsoup,Java,Html,Wiki,Jsoup,在像这样的网站上，“Districts”、“Understand”、“Get in”等部分实际上并不包含HTML中的整个部分。节实际上只是标题中的跨类。正因为如此，我们不能简单地通过选择id来获取wiki文档的某些部分但是，是否可以收集两个标记之间的所有html？比如说我想要“四处走动”部分。我该如何发出一个选择器，在 <h2><span class="editsection">[<a href="/wiki/en/index.php?title=San_Fran

在像这样的网站上，“Districts”、“Understand”、“Get in”等部分实际上并不包含HTML中的整个部分。节实际上只是标题中的跨类。正因为如此，我们不能简单地通过选择id来获取wiki文档的某些部分

但是，是否可以收集两个标记之间的所有html？比如说我想要“四处走动”部分。我该如何发出一个选择器，在

<h2><span class="editsection">[<a href="/wiki/en/index.php?title=San_Francisco&amp;action=edit&amp;section=15" title="Edit section: Get around">edit</a>]</span> <span class="mw-headline" id="Get_around">Get around</span></h2>

[]四处走动

及

[]请参见

？

哎哟。这种HTML不太容易使用。我想你可能是在刮东西，所以我知道有时候这是我们要处理的事情。你给这个贴了标签，所以我要试试看。通常情况下，没有选择器可以处理这样的非结构化HTML。您可以做的是选择第一个h2的所有下一个同级，然后删除第二个h2的所有下一个同级。为了增加痛苦，我们只能通过其文本内容来识别节标题，因此我们需要使用

：contains

选择器。像这样：

Document doc = Jsoup.connect("http://wikitravel.org/en/San_Francisco").get();
//select all "next siblings" of the "Get around" h2
Elements section = doc.select("h2:contains(Get around) ~ *");
//select all "next siblings" of the "See" h2 and remove them
section.select("h2:contains(See) ~ *").remove();
//remove the second h2
section.select("h2").remove();
//section now contains the elements between "Get around" and "See"
String sectionHtml = section.html();

在对jQuery执行相同操作后，以下是一些Firebug输出：第一个选择器返回一个包含以下元素的Elements对象：

[h3，p，p，p，p，p，p，p，p，p，p，p，p，p，p，p，p，p，p，p，p，p，p，p，p，p，p，p，p，p，p，p，p，p，p，p，p，p，p，p，p，p，p，p，p，p，p，p，p，p，p，p，p，p，p，p，p，p，p，p，p，p，p，p，p，p，p，p，p，p，p，p，p，p，p，p，p，p，p，p，p，p，p，p，p，p，p，p，p，p，p，p，p，p，p，p，p，p，p，p，p，p，p，p，p，p，p，p，p，p，p，p，p，p，p，p，p，p，p，p拇指，p，p，p，p，p，p，ul、 第三，第四，第四，第三，第二，第三，第三，第三，第三，第三，第四，第四，第四，第四，第三，第四，第三，第三，第三，第三，第三，第三，第三，第二，第三，第三，第三，第二，拇指拇指拇指拇指拇指，第四，第，第四，第三，第，第三，第三，第二，第二，拇指，第三，拇指拇指，第二，第三，第二，第二，拇指，第三，第二，拇指，第二，第三，第二，拇指，第二，拇指，第三，第三，第三，第二，第二，拇指拇指，第三，第二，第二，第三，第二，第三，第三，第三，第三，第二，第二，第二，第三，拇指拇指，第三，第三，第三，第二，第二，第二，第二保险商实验室，保险商实验室，保险商实验室，保险商实验室，保险商实验室，保险商实验室，美国l、 ul，ul，ul，ul，ul，ul，ul，ul，ul，ul，ul，ul，h2，p，p，p，ul，p，ul，ul，ul，ul，ul，ul，ul，ul，ul，ul，ul，ul
第一个h3
表示“导航”，最后一个p
包含
（奇怪的HTML，是的）。第二个选择并删除将其简化为：
[h3，p，p，p，p，h3，p，p，p，p，p，p，p，p，p，p，p，p，p，ul，ul，ul，ul，p，ul，p，ul，ul，h3，p，p，p，p，p，p，p，p，p，p，p，p，p，p，h2]

其中第一个h3
仍然是表示“导航”的，最后一个h2
是您引用的“查看”的。选择（“h2”）和删除导致：
[h3，p，p，p，p，h3，p，p，p，p，p，p，p，p，p，p，p，p，p，p，p，p，ul，ul，ul，ul，ul，p，ul，ul，h3，p，p，p，p，p，p，p，p，p，p，p]

它包含“Get around”h2
和“See”h2
部分之间的所有元素。选择（“h2~*”）.remove（）；如果“See”不能保证是下一部分，并且“Get around”中没有其他h2元素，该解决方案是否有效节？嗯…我在这段代码中做错了什么？我无法隔离气候节。我认为：contains伪选择器的CSS标准区分大小写。请尝试大写“气候”和“文献”：使用该代码，它没有改变任何东西。我仍然明白。无可否认，我已经有一段时间没有使用Jsoup了。您可能需要将每个select（）、remove（）的结果分配回部分
对象。我们可能应该使用not（）
sosection
仍然包含原始匹配的元素。如下所示：Elements section=doc.select（“h2:contains（getrough）~*”；
section=section.not（“h2”）；
Document doc = Jsoup.connect("http://wikitravel.org/en/San_Francisco").get();
//select all "next siblings" of the "Get around" h2
Elements section = doc.select("h2:contains(Get around) ~ *");
//select all "next siblings" of the "See" h2 and remove them
section.select("h2:contains(See) ~ *").remove();
//remove the second h2
section.select("h2").remove();
//section now contains the elements between "Get around" and "See"
String sectionHtml = section.html();