Multithreading Java 8 CompletedFuture web crawler不'；不要爬过一个URL_Multithreading_Concurrency_Java 8_Web Crawler_Completable Future

Multithreading Java 8 CompletedFuture web crawler不'；不要爬过一个URL

multithreading concurrency java-8 web-crawler

Multithreading Java 8 CompletedFuture web crawler不'；不要爬过一个URL,multithreading,concurrency,java-8,web-crawler,completable-future,Multithreading,Concurrency,Java 8,Web Crawler,Completable Future,我正在玩Java 8中新引入的并发特性，这是Cay S.Horstmann的《为真正不耐烦的人准备Java SE 8》一书中的工作练习。我使用新的和创建了以下web爬虫程序。基本思想是给定一个URL，它将在该页面上找到前m个URL，并重复该过程n次。m和n当然是参数。问题是程序获取初始页面的URL，但不会递归。我错过了什么 static class WebCrawler { CompletableFuture<Void> crawl(final String starting

我正在玩Java 8中新引入的并发特性，这是Cay S.Horstmann的《为真正不耐烦的人准备Java SE 8》一书中的工作练习。我使用新的和创建了以下web爬虫程序。基本思想是给定一个URL，它将在该页面上找到前m个URL，并重复该过程n次。m和n当然是参数。问题是程序获取初始页面的URL，但不会递归。我错过了什么

static class WebCrawler {
    CompletableFuture<Void> crawl(final String startingUrl,
        final int depth, final int breadth) {
        if (depth <= 0) {
            return completedFuture(startingUrl, depth);
        }

        final CompletableFuture<Void> allDoneFuture = allOf((CompletableFuture[]) of(
            startingUrl)
            .map(url -> supplyAsync(getContent(url)))
            .map(docFuture -> docFuture.thenApply(getURLs(breadth)))
            .map(urlsFuture -> urlsFuture.thenApply(doForEach(
                depth, breadth)))
            .toArray(size -> new CompletableFuture[size]));

        allDoneFuture.join();

        return allDoneFuture;
    }

    private CompletableFuture<Void> completedFuture(
        final String startingUrl, final int depth) {
        LOGGER.info("Link: {}, depth: {}.", startingUrl, depth);

        CompletableFuture<Void> future = new CompletableFuture<>();
        future.complete(null);

        return future;
    }

    private Supplier<Document> getContent(final String url) {
        return () -> {
            try {
                return connect(url).get();
            } catch (IOException e) {
                throw new UncheckedIOException(
                    " Something went wrong trying to fetch the contents of the URL: "
                        + url, e);
            }
        };
    }

    private Function<Document, Set<String>> getURLs(final int limit) {
        return doc -> {
            LOGGER.info("Getting URLs for document: {}.", doc.baseUri());

            return doc.select("a[href]").stream()
                .map(link -> link.attr("abs:href")).limit(limit)
                .peek(LOGGER::info).collect(toSet());
        };
    }

    private Function<Set<String>, Stream<CompletableFuture<Void>>> doForEach(
          final int depth, final int breadth) {
        return urls -> urls.stream().map(
            url -> crawl(url, depth - 1, breadth));
    }
}

以下代码中存在问题：

final CompletableFuture<Void> allDoneFuture = allOf(
  (CompletableFuture[]) of(startingUrl)
    .map(url -> supplyAsync(getContent(url)))
    .map(docFuture -> docFuture.thenApply(getURLs(breadth)))
    .map(urlsFuture -> urlsFuture.thenApply(doForEach(depth, breadth)))
    .toArray(size -> new CompletableFuture[size]));

以下代码中存在问题：

final CompletableFuture<Void> allDoneFuture = allOf(
  (CompletableFuture[]) of(startingUrl)
    .map(url -> supplyAsync(getContent(url)))
    .map(docFuture -> docFuture.thenApply(getURLs(breadth)))
    .map(urlsFuture -> urlsFuture.thenApply(doForEach(depth, breadth)))
    .toArray(size -> new CompletableFuture[size]));

以下代码中存在问题：

final CompletableFuture<Void> allDoneFuture = allOf(
  (CompletableFuture[]) of(startingUrl)
    .map(url -> supplyAsync(getContent(url)))
    .map(docFuture -> docFuture.thenApply(getURLs(breadth)))
    .map(urlsFuture -> urlsFuture.thenApply(doForEach(depth, breadth)))
    .toArray(size -> new CompletableFuture[size]));

以下代码中存在问题：

final CompletableFuture<Void> allDoneFuture = allOf(
  (CompletableFuture[]) of(startingUrl)
    .map(url -> supplyAsync(getContent(url)))
    .map(docFuture -> docFuture.thenApply(getURLs(breadth)))
    .map(urlsFuture -> urlsFuture.thenApply(doForEach(depth, breadth)))
    .toArray(size -> new CompletableFuture[size]));

在（开始URL）的

allOf（（CompletableFuture[]）中，什么是allOf
和of

？什么是

文档

？请发布一个可复制的示例。@SotiriosDelimanolis这是工作代码；

allOf

和

of

是静态导入；

Document

是一个

jsoup

类。我不想用一堆导入把帖子搞得乱七八糟。这是。对我来说似乎很好。你如何调用

>爬网

？请同时发布您得到的结果和您实际期望的结果。在（startingUrl）的

allOf（（CompletableFuture[]）中，什么是allOf
和of

？什么是

文档

？请发布一个可复制的示例。@SotiriosDelimanolis这是工作代码；

allOf

和

of

是静态导入；

Document

是一个

jsoup

类。我不想用一堆导入把帖子搞得乱七八糟。这是。对我来说似乎很好。你如何调用

>爬网

？请同时发布您得到的结果和您实际期望的结果。在（startingUrl）的

allOf（（CompletableFuture[]）中，什么是allOf
和of

？什么是

文档

？请发布一个可复制的示例。@SotiriosDelimanolis这是工作代码；

allOf

和

of

是静态导入；

Document

是一个

jsoup

类。我不想用一堆导入把帖子搞得乱七八糟。这是。对我来说似乎很好。你如何调用

>爬网

？请同时发布您得到的结果和您实际期望的结果。在（startingUrl）的

allOf（（CompletableFuture[]）中，什么是allOf
和of

？什么是

文档

？请发布一个可复制的示例。@SotiriosDelimanolis这是工作代码；

allOf

和

of

是静态导入；

Document

是一个

jsoup

类。我不想用一堆导入把帖子搞得乱七八糟。这是。对我来说似乎很好。你如何调用

>爬网

？请同时发布您得到的结果和您实际期望的结果。谢谢。我发现了问题，并以稍微不同的方式解决了它。我会接受您的答案，因为它更有意义。不幸的是，评论中的代码看起来很难看。

of（startingUrl）.map（url->supplySync（getContent（url））.map（docFuture->docFuture.thenApply（GetURL（宽度））.map（urlsFuture->urlsFuture.thenAccept（doForEach（深度，宽度））.findFirst（）.OrelsThrow（completionException（“抓取URL时出错：+startingUrl））.join（）

还做了一些修改来支持上述内容，我会展示这些修改，但不是作为注释。顺便说一句，在您的回答中，

然后撰写

可以更改为

然后接受

，我认为这更合适。在这种情况下，它们的工作原理相同。我也在做类似的事情，我在使用ExecutorService和mu运行时遇到了问题多线程…如果我将executor服务设置为SupplySync，那么它只会抓取N个页面（N=线程数）。有什么想法吗？谢谢。我解决了这个问题，并以稍微不同的方式解决了它。我会接受你的答案，因为它更有意义。不幸的是，注释中的代码看起来很难看。

of（startingUrl）.map（url->SupplySync（getContent（url）））.map（docFuture->docFuture.thenApply（getURLs（Width））.map（urlsFuture->urlsFuture.thenAccept（doForEach（depth，Width））.findFirst（）.OrelsThrow（completionException）（“抓取url时出错：+startingUrl））.join（）

还做了一些修改来支持上述内容，我会展示这些修改，但不是作为注释。顺便说一句，在您的回答中，

然后撰写

可以更改为

然后接受

of（startingUrl）.map（url->SupplySync（getContent（url）））.map（docFuture->docFuture.thenApply（getURLs（Width））.map（urlsFuture->urlsFuture.thenAccept（doForEach（depth，Width））.findFirst（）.OrelsThrow（completionException）（“抓取url时出错：+startingUrl））.join（）

还做了一些修改来支持上述内容，我会展示这些修改，但不是作为注释。顺便说一句，在您的回答中，

然后撰写

可以更改为

然后接受

of（startingUrl）.map（url->supplyAsync（getContent（url））.map（docFuture->docFuture.thenappy（getU