Java 使用jsoup提取https URL_Java_Web Crawler_Jsoup

Java 使用jsoup提取https URL

java web-crawler

Java 使用jsoup提取https URL,java,web-crawler,jsoup,Java,Web Crawler,Jsoup,下面的代码使用jsoup从给定页面提取URL import org.jsoup.Jsoup; import org.jsoup.helper.Validate; import org.jsoup.nodes.Document; import org.jsoup.nodes.Element; import org.jsoup.select.Elements; import java.io.IOException; /** * Example program to list links fro

下面的代码使用jsoup从给定页面提取URL

import org.jsoup.Jsoup;
import org.jsoup.helper.Validate;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;

import java.io.IOException;

/**
 * Example program to list links from a URL.
 */
public class ListLinks {
    public static void main(String[] args) throws IOException {

        String url = "http://shopping.yahoo.com";
        print("Fetching %s...", url);

        Document doc = Jsoup.connect(url).get();
        Elements links = doc.getElementsByTag("a");


        print("\nLinks: (%d)", links.size());
        for (Element link : links) {
       print(" * a: <%s>  (%s)", link.absUrl("href") /*link.attr("href")*/, trim(link.text(), 35));     
        }
    }

    private static void print(String msg, Object... args) {
        System.out.println(String.format(msg, args));
    }

    private static String trim(String s, int width) {
        if (s.length() > width)
            return s.substring(0, width-1) + ".";
        else
            return s;
    }
}

我想做的是构建一个只提取https站点的爬虫程序。我首先给爬虫一个种子链接，然后它应该提取所有https站点，然后获取每个提取的链接并对它们执行相同的操作，直到达到一定数量的收集URL

我的问题：上面的代码可以提取给定页面中的所有链接。我需要提取以https://开头的链接，我需要做什么才能实现这一点？

您可以使用jsoup的选择器。他们很强大

doc.select("a[href*=https]");//(This is the one you are looking for)selects if value of href contatins https
doc.select("a[href^=www]");//selects if value of href starts with www
doc.select("a[href$=.com]");//selects if value of href ends with .com.

等等。。尝试一下，你会找到正确的链接。

有些网站会自动将用户重定向到HTTPS站点，如果它们来自HTTP站点，你想要这样的链接吗？在这种情况下有点困难，因为您必须在这里启动HTTP请求。谢谢。不，我只是想从互联网上收集https网站。