Java 使用Jsoup，如何获取每个链接中的每个信息？_Java_Jsoup

Java 使用Jsoup，如何获取每个链接中的每个信息？

java

Java 使用Jsoup，如何获取每个链接中的每个信息？,java,jsoup,Java,Jsoup,如果连接到URL，它将只解析当前页面。但是你可以1.）连接到URL，2.）解析你需要的信息，3.）选择所有其他链接，4.）连接到它们，5.）只要有新链接就继续注意事项：您需要一个列表（？）或其他存储已解析链接的位置您必须决定是否只需要此页面的链接或外部链接你必须跳过“关于”、“联系”等页面编辑：（注意：您必须添加一些更改/错误处理代码）解析下一个链接的方法如下： List<String> visitedUrls = new ArrayList<>();

如果连接到URL，它将只解析当前页面。但是你可以1.）连接到URL，2.）解析你需要的信息，3.）选择所有其他链接，4.）连接到它们，5.）只要有新链接就继续

注意事项：

您需要一个列表（？）或其他存储已解析链接的位置
您必须决定是否只需要此页面的链接或外部链接
你必须跳过“关于”、“联系”等页面

编辑：
（注意：您必须添加一些更改/错误处理代码）

解析下一个链接的方法如下：

List<String> visitedUrls = new ArrayList<>(); // Store all links you've already visited
Set<String> ignore = new HashSet<>(); // Store all keywords you want ignore

// ...


/*
 * Add keywords to the ignorelist. Each link that contains one of this
 * words will be skipped.
 * 
 * Do this in eg. constructor, static block or a init method.
 */
ignore.add(".twitter.com");

// ...


public void visitUrl(String url) throws IOException
{
    url = url.toLowerCase(); // Now its case insensitive

    if( !visitedUrls.contains(url) ) // Do this only if not visted yet
    {
        Document doc = Jsoup.connect(url).get(); // Connect to Url and parse Document

        /* ... Select your Data here ... */

        Elements nextLinks = doc.select("a[href]"); // Select next links - add more restriction!

        for( Element next : nextLinks ) // Iterate over all Links
        {
            boolean skip = false; // If false: parse the url, if true: skip it
            final String href = next.absUrl("href"); // Select the 'href' attribute -> next link to parse

            for( String s : ignore ) // Iterate over all ignored keywords - maybe there's a better solution for this
            {
                if( href.contains(s) ) // If the url contains ignored keywords it will be skipped
                {
                    skip = true;
                    break;
                }
            }

            if( !skip )
                visitUrl(next.absUrl("href")); // Recursive call for all next Links
        }
    }
}

但可能您应该在此部分中添加更多的停止条件。

谢谢ollo。我可以连接URL并获取所有链接名。但是我如何连接所有其他链接并解析链接的信息。。。给我一些建议。。。提前感谢。请参阅“编辑”以获取简短示例。如果我的帖子对你有所帮助，请随意投票。但是，它是否有效，或者您是否需要进一步的帮助？hai ollo，我需要知道如何跳过特定的url，以及如何转到下一个链接…我的问题是，如果我的url包含任何twitter链接，那么它将集中在twitter域上。。。它就像一个循环。我希望这个循环应该中断并进入列表的下一个链接。。。我试了很多。。。但是我卡住了…帮帮我，奥利奥。。。

List<String> visitedUrls = new ArrayList<>(); // Store all links you've already visited


public void visitUrl(String url) throws IOException
{
    url = url.toLowerCase(); // now its case insensitive

    if( !visitedUrls.contains(url) ) // Do this only if not visted yet
    {
        Document doc = Jsoup.connect(url).get(); // Connect to Url and parse Document

        /* ... Select your Data here ... */

        Elements nextLinks = doc.select("a[href]"); // Select next links - add more restriction!

        for( Element next : nextLinks ) // Iterate over all Links
        {
            visitUrl(next.absUrl("href")); // Recursive call for all next Links
        }
    }
}

List<String> visitedUrls = new ArrayList<>(); // Store all links you've already visited
Set<String> ignore = new HashSet<>(); // Store all keywords you want ignore

// ...


/*
 * Add keywords to the ignorelist. Each link that contains one of this
 * words will be skipped.
 * 
 * Do this in eg. constructor, static block or a init method.
 */
ignore.add(".twitter.com");

// ...


public void visitUrl(String url) throws IOException
{
    url = url.toLowerCase(); // Now its case insensitive

    if( !visitedUrls.contains(url) ) // Do this only if not visted yet
    {
        Document doc = Jsoup.connect(url).get(); // Connect to Url and parse Document

        /* ... Select your Data here ... */

        Elements nextLinks = doc.select("a[href]"); // Select next links - add more restriction!

        for( Element next : nextLinks ) // Iterate over all Links
        {
            boolean skip = false; // If false: parse the url, if true: skip it
            final String href = next.absUrl("href"); // Select the 'href' attribute -> next link to parse

            for( String s : ignore ) // Iterate over all ignored keywords - maybe there's a better solution for this
            {
                if( href.contains(s) ) // If the url contains ignored keywords it will be skipped
                {
                    skip = true;
                    break;
                }
            }

            if( !skip )
                visitUrl(next.absUrl("href")); // Recursive call for all next Links
        }
    }
}

final String href = next.absUrl("href");
/* ... */
visitUrl(next.absUrl("href"));