Java 为什么Crawler4j非阻塞方法不在队列中等待链接?

Java 为什么Crawler4j非阻塞方法不在队列中等待链接?,java,web-scraping,web-crawler,crawler4j,Java,Web Scraping,Web Crawler,Crawler4j,给出以下简单代码: CrawlConfig config = new CrawlConfig(); config.setMaxDepthOfCrawling(1); config.setPolitenessDelay(1000); config.setResumableCrawling(false); config.setIncludeBinaryContentInCrawling(false); config.setCrawlStorageFolder(Config.get(Config.CR

给出以下简单代码:

CrawlConfig config = new CrawlConfig();
config.setMaxDepthOfCrawling(1);
config.setPolitenessDelay(1000);
config.setResumableCrawling(false);
config.setIncludeBinaryContentInCrawling(false);
config.setCrawlStorageFolder(Config.get(Config.CRAWLER_SHARED_DIR) + "test/");
config.setShutdownOnEmptyQueue(false);
PageFetcher pageFetcher = new PageFetcher(config);
RobotstxtConfig robotstxtConfig = new RobotstxtConfig();
robotstxtConfig.setEnabled(false);
RobotstxtServer robotstxtServer = new RobotstxtServer(robotstxtConfig, pageFetcher);
CrawlController controller = new CrawlController(config, pageFetcher, robotstxtServer);
controller.addSeed("http://localhost/test");

controller.startNonBlocking(WebCrawler.class, 1);


long counter = 1;
while(Thread.currentThread().isAlive()) {
    System.out.println(config.toString());
    for (int i = 0; i < 4; i++) {
        System.out.println("Adding link");
        controller.addSeed("http://localhost/test" + ++counter + "/");
    }

    try {
        TimeUnit.SECONDS.sleep(5);
    } catch (InterruptedException e) {
        e.printStackTrace();
    }
}
为什么crawler4j不访问test6、test7及更高版本

如您所见,之前的所有4个链接都已正确添加和访问

当我将“”设置为seedUrl时(在启动爬虫程序之前),它最多处理13个链接,然后出现上述问题

我试图获得的是这样一种情况,即我可以将要访问的URL添加到运行爬虫程序的其他线程中(在运行时)

@编辑: 我已经看过@Seth的建议线程转储,但我不知道为什么它不起作用

"Thread-1" #25 prio=5 os_prio=0 tid=0x00007ff32854b800 nid=0x56e3 waiting on condition [0x00007ff2de403000]
   java.lang.Thread.State: TIMED_WAITING (sleeping)
    at java.lang.Thread.sleep(Native Method)
    at edu.uci.ics.crawler4j.crawler.CrawlController.sleep(CrawlController.java:367)
    at edu.uci.ics.crawler4j.crawler.CrawlController$1.run(CrawlController.java:243)
    - locked <0x00000005959baff8> (a java.lang.Object)
    at java.lang.Thread.run(Thread.java:745)

   Locked ownable synchronizers:
    - None

"Crawler 1" #24 prio=5 os_prio=0 tid=0x00007ff328544000 nid=0x56e2 in Object.wait() [0x00007ff2de504000]
   java.lang.Thread.State: WAITING (on object monitor)
    at java.lang.Object.wait(Native Method)
    - waiting on <0x0000000596afdd28> (a java.lang.Object)
    at java.lang.Object.wait(Object.java:502)
    at edu.uci.ics.crawler4j.frontier.Frontier.getNextURLs(Frontier.java:151)
    - locked <0x0000000596afdd28> (a java.lang.Object)
    at edu.uci.ics.crawler4j.crawler.WebCrawler.run(WebCrawler.java:259)
    at java.lang.Thread.run(Thread.java:745)

   Locked ownable synchronizers:
    - None
“线程-1”#25优先级=5 os_优先级=0 tid=0x00007ff32854b800 nid=0x56e3等待条件[0x00007ff2de403000]
java.lang.Thread.State:定时等待(休眠)
位于java.lang.Thread.sleep(本机方法)
位于edu.uci.ics.crawler4j.crawler.CrawlController.sleep(CrawlController.java:367)
位于edu.uci.ics.crawler4j.crawler.CrawlController$1.run(CrawlController.java:243)
-锁定(一个java.lang.Object)
运行(Thread.java:745)
锁定可拥有的同步器:
-没有
对象中的“爬虫程序1”#24优先级=5操作系统优先级=0 tid=0x00007ff328544000 nid=0x56e2。等待()[0x00007ff2de504000]
java.lang.Thread.State:正在等待(在对象监视器上)
在java.lang.Object.wait(本机方法)
-等待(一个java.lang.Object)
在java.lang.Object.wait(Object.java:502)
位于edu.uci.ics.crawler4j.frontier.frontier.getNextURLs(frontier.java:151)
-锁定(一个java.lang.Object)
位于edu.uci.ics.crawler4j.crawler.WebCrawler.run(WebCrawler.java:259)
运行(Thread.java:745)
锁定可拥有的同步器:
-没有

所以我找到了问题所在。问题与

为什么使用限制为4的for循环相同?@Seth我想模拟添加4个链接以从远程源爬网。这个数字在这里并不重要。我注意到,我目前正试图了解你的爬虫程序,这就是我问你的原因。@Seth我试图找到的是一个爬虫程序,在单独的线程中运行,它只对运行时由其他线程添加的站点进行爬虫。爬虫需要等待种子,如果它不是用种子初始化的。啊!好的,我明白了。请问你为什么这样做?您可以只收集链接,等到收集了10个,然后用这些链接调用爬网方法&再次等待。还是一项具体的任务?
"Thread-1" #25 prio=5 os_prio=0 tid=0x00007ff32854b800 nid=0x56e3 waiting on condition [0x00007ff2de403000]
   java.lang.Thread.State: TIMED_WAITING (sleeping)
    at java.lang.Thread.sleep(Native Method)
    at edu.uci.ics.crawler4j.crawler.CrawlController.sleep(CrawlController.java:367)
    at edu.uci.ics.crawler4j.crawler.CrawlController$1.run(CrawlController.java:243)
    - locked <0x00000005959baff8> (a java.lang.Object)
    at java.lang.Thread.run(Thread.java:745)

   Locked ownable synchronizers:
    - None

"Crawler 1" #24 prio=5 os_prio=0 tid=0x00007ff328544000 nid=0x56e2 in Object.wait() [0x00007ff2de504000]
   java.lang.Thread.State: WAITING (on object monitor)
    at java.lang.Object.wait(Native Method)
    - waiting on <0x0000000596afdd28> (a java.lang.Object)
    at java.lang.Object.wait(Object.java:502)
    at edu.uci.ics.crawler4j.frontier.Frontier.getNextURLs(Frontier.java:151)
    - locked <0x0000000596afdd28> (a java.lang.Object)
    at edu.uci.ics.crawler4j.crawler.WebCrawler.run(WebCrawler.java:259)
    at java.lang.Thread.run(Thread.java:745)

   Locked ownable synchronizers:
    - None