Java 履带4j的参数确定_Java_Html_Parsing_Web Crawler_Crawler4j

Java 履带4j的参数确定

java html parsing web-crawler

Java 履带4j的参数确定,java,html,parsing,web-crawler,crawler4j,Java,Html,Parsing,Web Crawler,Crawler4j,我正在尝试使用crawler4j，就像示例中显示的那样，无论我如何定义爬网程序的数量或更改根文件夹，我都会从以下代码中继续得到此错误： “所需参数：根文件夹（它将包含中间爬网数据） numberOfCralwers（并发线程数）” 主要代码如下： public class Controller { public static void main(String[] args) throws Exception { if (args.length != 2) {

我正在尝试使用crawler4j，就像示例中显示的那样，无论我如何定义爬网程序的数量或更改根文件夹，我都会从以下代码中继续得到此错误：

“所需参数：根文件夹（它将包含中间爬网数据） numberOfCralwers（并发线程数）” 主要代码如下：

public class Controller {

    public static void main(String[] args) throws Exception {

            if (args.length != 2) {
                    System.out.println("Needed parameters: ");
                    System.out.println("\t rootFolder (it will contain intermediate crawl data)");
                    System.out.println("\t numberOfCralwers (number of concurrent threads)");
                    return;
            }

            /*
             * crawlStorageFolder is a folder where intermediate crawl data is
             * stored.
             */
            String crawlStorageFolder = args[0];


            /*
             * numberOfCrawlers shows the number of concurrent threads that should
             * be initiated for crawling.
             */
            int numberOfCrawlers = Integer.parseInt(args[1]);

有一个类似的问题问我到底想知道什么，但我不太明白解决方案，比如我在哪里键入java BasicCrawler控制器“arg1”“arg2”。我在Eclipse上运行这段代码，我对编程世界还是相当陌生的。如果有人能帮助我理解这个问题，我将不胜感激。要在项目中使用crawler4j，您必须创建两个类。其中一个是爬行控制器（根据参数启动爬行器），另一个是爬行器

只需在控制器类中运行main方法，并查看已爬网的页面

以下是Controller.java文件：

import edu.uci.ics.crawler4j.crawler.CrawlConfig;
import edu.uci.ics.crawler4j.crawler.CrawlController;
import edu.uci.ics.crawler4j.fetcher.PageFetcher;
import edu.uci.ics.crawler4j.robotstxt.RobotstxtConfig;
import edu.uci.ics.crawler4j.robotstxt.RobotstxtServer;

public class Controller {
public static void main(String[] args) throws Exception {


    RobotstxtConfig robotstxtConfig2 = new RobotstxtConfig();

    System.out.println(robotstxtConfig2.getCacheSize());
    System.out.println(robotstxtConfig2.getUserAgentName());

    String crawlStorageFolder = "/crawler/testdata";
    int numberOfCrawlers = 4;
    CrawlConfig config = new CrawlConfig();
    config.setCrawlStorageFolder(crawlStorageFolder);

    PageFetcher pageFetcher = new PageFetcher(config);
    RobotstxtConfig robotstxtConfig = new RobotstxtConfig();

    System.out.println(robotstxtConfig.getCacheSize());
    System.out.println(robotstxtConfig.getUserAgentName());

    RobotstxtServer robotstxtServer = new RobotstxtServer(robotstxtConfig, pageFetcher);
    CrawlController controller = new CrawlController(config, 
                 pageFetcher, robotstxtServer);

    controller.addSeed("http://cyesilkaya.wordpress.com/");
    controller.start(Crawler.class, numberOfCrawlers);
  }
   }

   import java.io.IOException;
   import edu.uci.ics.crawler4j.crawler.Page;
   import edu.uci.ics.crawler4j.crawler.WebCrawler;
   import edu.uci.ics.crawler4j.url.WebURL;

   public class Crawler extends WebCrawler {

    @Override
    public boolean shouldVisit(WebURL url) {
         // you can write your own filter to decide crawl the incoming URL or not.
        return true;
    }

    @Override
    public void visit(Page page) {          
        String url = page.getWebURL().getURL();
        try {
        String url = page.getWebURL().getURL();
                System.out.println("URL: " + url);   
    }
    catch (IOException e) {
    }
      }
   }

以下是Crawler.java文件：

import edu.uci.ics.crawler4j.crawler.CrawlConfig;
import edu.uci.ics.crawler4j.crawler.CrawlController;
import edu.uci.ics.crawler4j.fetcher.PageFetcher;
import edu.uci.ics.crawler4j.robotstxt.RobotstxtConfig;
import edu.uci.ics.crawler4j.robotstxt.RobotstxtServer;

public class Controller {
public static void main(String[] args) throws Exception {


    RobotstxtConfig robotstxtConfig2 = new RobotstxtConfig();

    System.out.println(robotstxtConfig2.getCacheSize());
    System.out.println(robotstxtConfig2.getUserAgentName());

    String crawlStorageFolder = "/crawler/testdata";
    int numberOfCrawlers = 4;
    CrawlConfig config = new CrawlConfig();
    config.setCrawlStorageFolder(crawlStorageFolder);

    PageFetcher pageFetcher = new PageFetcher(config);
    RobotstxtConfig robotstxtConfig = new RobotstxtConfig();

    System.out.println(robotstxtConfig.getCacheSize());
    System.out.println(robotstxtConfig.getUserAgentName());

    RobotstxtServer robotstxtServer = new RobotstxtServer(robotstxtConfig, pageFetcher);
    CrawlController controller = new CrawlController(config, 
                 pageFetcher, robotstxtServer);

    controller.addSeed("http://cyesilkaya.wordpress.com/");
    controller.start(Crawler.class, numberOfCrawlers);
  }
   }

   import java.io.IOException;
   import edu.uci.ics.crawler4j.crawler.Page;
   import edu.uci.ics.crawler4j.crawler.WebCrawler;
   import edu.uci.ics.crawler4j.url.WebURL;

   public class Crawler extends WebCrawler {

    @Override
    public boolean shouldVisit(WebURL url) {
         // you can write your own filter to decide crawl the incoming URL or not.
        return true;
    }

    @Override
    public void visit(Page page) {          
        String url = page.getWebURL().getURL();
        try {
        String url = page.getWebURL().getURL();
                System.out.println("URL: " + url);   
    }
    catch (IOException e) {
    }
      }
   }

如果在运行文件时没有提供任何参数，则会出现该错误。将以下内容作为注释放入代码中或将其删除

if (args.length != 2) {
                System.out.println("Needed parameters: ");
                System.out.println("\t rootFolder (it will contain intermediate crawl data)");
                System.out.println("\t numberOfCralwers (number of concurrent threads)");
                return;
        }

然后将根文件夹设置为要存储元数据的文件夹。

在Eclipse中： ->点击run ->点击运行配置

在弹出窗口中：

首先，左栏：确保在sub-dir Java应用程序中选择了您的应用程序，否则创建一个新的（单击new）

然后在中央窗口中，继续“参数”

写第一个参数后，在“程序参数”下写下参数，按enter键输入第二个参数，依此类推。。。（=换行，因为args是[]）

然后单击应用

然后单击Run