Java 如何使用jsoup提取Wikipedia文章中的特定链接？_Java_Hyperlink_Jsoup_Wikipedia_Extraction

Java 如何使用jsoup提取Wikipedia文章中的特定链接？

java hyperlink

Java 如何使用jsoup提取Wikipedia文章中的特定链接？,java,hyperlink,jsoup,wikipedia,extraction,Java,Hyperlink,Jsoup,Wikipedia,Extraction,我正在做一个NLP项目，我需要知道如何提取这个wikipidia页面“简介”部分和“地理”部分的链接：你能帮帮我吗？维基百科并不容易做到这一点。我不认为这是优雅的，甚至不是很可重用的 Document doc = Jsoup.connect("http://en.wikipedia.org/wiki/Boston").timeout(5000).get(); Element intro = doc.body().select("p").first(); while (

我正在做一个NLP项目，我需要知道如何提取这个wikipidia页面“简介”部分和“地理”部分的链接：

你能帮帮我吗？

维基百科并不容易做到这一点。我不认为这是优雅的，甚至不是很可重用的

    Document doc = Jsoup.connect("http://en.wikipedia.org/wiki/Boston").timeout(5000).get();

    Element intro = doc.body().select("p").first();
    while (intro.tagName().equals("p")) {
        //here you will get an Elements object which you can
        //iterate through to get the links in the intro
        System.out.println(intro.select("a"));
        intro = intro.nextElementSibling();
    }

    for (Element h2 : doc.body().select("h2")) {
        if(h2.select("span").size() == 2) {
            if (h2.select("span").get(1).text().equals("Geography")) {
                Element nextsib = h2.nextElementSibling();
                while (nextsib != null) {
                    if (nextsib.tagName().equals("p")) {
                        //here you will get an Elements object which you
                        //can iterate through to get the links in the 
                        //geography section
                        System.out.println(nextsib.select("a"));
                        nextsib = nextsib.nextElementSibling();
                    } else if (nextsib.tagName().equals("h2")) {
                        nextsib = null;
                    } else {
                        nextsib = nextsib.nextElementSibling();
                    }
                }
            }
        }
    }
}

这本书并不是对你问题的答案，但也许它会让你更容易使用维基媒体下载。你尝试过什么？看起来您必须在元素上进行迭代，直到找到另一个

，表示一个节头。@beerbajay虽然相关，但显然不是重复的，因为这专门询问单个元素。@beerbajay它不是重复的！我想知道如何使用select（）方法从维基百科文章的特定部分提取链接谢谢！！！我尝试了你的代码，对于介绍部分来说效果很好……对于地理部分，我不能这么说：有些链接丢失了，有些来自气候部分！但无论如何，这是一个好的步骤！非常感谢您，如果您找到了解决方案，请告诉我！我也会这么做的！：）我移动了行nextsib=nextsib.nextElementSibling（）；在System.out.println（nextsib.select（“a”））下面；那应该会解决的。嗯……我想是一样的！！！现在我明白错过了什么！！地理部分（地理的第一段）开头包含的链接。抱歉，第一个版本正在运行！！！这是我的错误…现在我正在查看维基百科的其他页面！所以…我看到你的代码适用于波士顿、马萨诸塞州、新英格兰等一些文章，但不适用于伦敦！我会尽力去理解为什么！：）再次感谢