Java 网络爬行具有广度而非深度_Java_Web Crawler

Java 网络爬行具有广度而非深度

java web-crawler

Java 网络爬行具有广度而非深度,java,web-crawler,Java,Web Crawler,我正在使用java和jsoup制作我的第一个网络爬虫。我发现这段代码可以工作，但不是我想要的。问题是它关注的是链接的深度，但我想在广度上抓取页面。花一些时间重新编写代码，重点放在广度上，但从第一个链接开始，代码仍然太深。你知道我该怎么爬吗 public class WebCrawlerWithDepth { private static final int MAX_DEPTH = 4; private HashSet<String> links; publi

我正在使用java和jsoup制作我的第一个网络爬虫。我发现这段代码可以工作，但不是我想要的。问题是它关注的是链接的深度，但我想在广度上抓取页面。花一些时间重新编写代码，重点放在广度上，但从第一个链接开始，代码仍然太深。你知道我该怎么爬吗

public class WebCrawlerWithDepth {
    private static final int MAX_DEPTH = 4;
    private HashSet<String> links;

    public WebCrawlerWithDepth() {
        links = new HashSet<>();
    }

    public void getPageLinks(String URL, int depth) {
        if ((!links.contains(URL) && (depth < MAX_DEPTH))) {
            System.out.println("Depth: " + depth + " " + URL);
                links.add(URL);

                Document document = Jsoup.connect(URL).get();
                Elements linksOnPage = document.select("a[href]");

                depth++;
                for (Element page : linksOnPage) {
                    getPageLinks(page.attr("abs:href"), depth);
               }
           }
       }

公共类WebCrawlerWithDepth{
专用静态最终int最大深度=4；
私有哈希集链接；
公共WebCrawlerWithDepth（）{
links=newhashset（）；
}
public void getPageLinks（字符串URL，int-depth）{
if（（！links.contains（URL）&（depth

基本上就像在算法编码中从深度优先到广度优先一样，你需要一个队列

将提取的每个链接添加到队列中，并检索要从该队列爬网的新页面

以下是我对您的代码的看法：

public class WebCrawlerWithDepth {

    private static final int MAX_DEPTH = 4;
    private Set<String> visitedLinks;
    private Queue<Link> remainingLinks;

    public WebCrawlerWithDepth() {
        visitedLinks = new HashSet<>();
        remainingLinks = new LinkedList<>();
    }

    public void getPageLinks(String url, int depth) throws IOException {
        remainingLinks.add(new Link(url, 0));
        int maxDepth = Math.max(1, Math.min(depth, MAX_DEPTH));
        processLinks(maxDepth);
    }

    private void processLinks(final int maxDepth) throws IOException {
        while (!remainingLinks.isEmpty()) {
            Link link = remainingLinks.poll();
            int depth = link.level;
            if (depth < maxDepth) {
                Document document = Jsoup.connect(link.url).get();
                Elements linksOnPage = document.select("a[href]");
                for (Element page : linksOnPage) {
                    String href = page.attr("href");
                    if (visitedLinks.add(href)) {
                        remainingLinks.offer(new Link(href, depth + 1));
                    }
                }
            }
        }
    }

    static class Link {

        final String url;
        final int level;

        Link(final String url, final int level) {
            this.url = url;
            this.level = level;
        }
    }
}

公共类WebCrawlerWithDepth{
专用静态最终int最大深度=4；
私人设置访问链接；
专用队列剩余链路；
公共WebCrawlerWithDepth（）{
visitedLinks=新哈希集（）；
remainingLinks=新建LinkedList（）；
}
public void getPageLinks（字符串url，int-depth）引发IOException{
添加（新链接（url，0））；
int maxDepth=Math.max（1，Math.min（depth，max_depth））；
进程链接（maxDepth）；
}
私有void processLinks（final int maxDepth）引发IOException{
而（！remainingLinks.isEmpty（））{
Link Link=remainingLinks.poll（）；
int depth=link.level；
if（深度<最大深度）{
Document Document=Jsoup.connect（link.url）.get（）；
Elements linksOnPage=document.select（“a[href]”）；
对于（元素页：linksOnPage）{
String href=page.attr（“href”）；
if（visitedLinks.add（href））{
提供（新链接（href，深度+1））；
}
}
}
}
}
静态类链接{
最终字符串url；
最终智力水平；
链接（最终字符串url，最终整数级别）{
this.url=url；
这个水平=水平；
}
}
}

您需要将它们存储在

队列中，而不是直接在当前页面中的链接上进行迭代。这应该存储所有页面中要访问的所有链接。然后您从队列中获得要访问的下一个链接。
您的意思是要使用广度优先搜索吗？