Java 履带4j的参数确定
我正在尝试使用crawler4j,就像示例中显示的那样,无论我如何定义爬网程序的数量或更改根文件夹,我都会从以下代码中继续得到此错误: “所需参数: 根文件夹(它将包含中间爬网数据) numberOfCralwers(并发线程数)” 主要代码如下:Java 履带4j的参数确定,java,html,parsing,web-crawler,crawler4j,Java,Html,Parsing,Web Crawler,Crawler4j,我正在尝试使用crawler4j,就像示例中显示的那样,无论我如何定义爬网程序的数量或更改根文件夹,我都会从以下代码中继续得到此错误: “所需参数: 根文件夹(它将包含中间爬网数据) numberOfCralwers(并发线程数)” 主要代码如下: public class Controller { public static void main(String[] args) throws Exception { if (args.length != 2) {
public class Controller {
public static void main(String[] args) throws Exception {
if (args.length != 2) {
System.out.println("Needed parameters: ");
System.out.println("\t rootFolder (it will contain intermediate crawl data)");
System.out.println("\t numberOfCralwers (number of concurrent threads)");
return;
}
/*
* crawlStorageFolder is a folder where intermediate crawl data is
* stored.
*/
String crawlStorageFolder = args[0];
/*
* numberOfCrawlers shows the number of concurrent threads that should
* be initiated for crawling.
*/
int numberOfCrawlers = Integer.parseInt(args[1]);
有一个类似的问题问我到底想知道什么,但我不太明白解决方案,比如我在哪里键入java BasicCrawler控制器“arg1”“arg2”。我在Eclipse上运行这段代码,我对编程世界还是相当陌生的。如果有人能帮助我理解这个问题,我将不胜感激。要在项目中使用crawler4j,您必须创建两个类。其中一个是爬行控制器(根据参数启动爬行器),另一个是爬行器 只需在控制器类中运行main方法,并查看已爬网的页面 以下是Controller.java文件:
import edu.uci.ics.crawler4j.crawler.CrawlConfig;
import edu.uci.ics.crawler4j.crawler.CrawlController;
import edu.uci.ics.crawler4j.fetcher.PageFetcher;
import edu.uci.ics.crawler4j.robotstxt.RobotstxtConfig;
import edu.uci.ics.crawler4j.robotstxt.RobotstxtServer;
public class Controller {
public static void main(String[] args) throws Exception {
RobotstxtConfig robotstxtConfig2 = new RobotstxtConfig();
System.out.println(robotstxtConfig2.getCacheSize());
System.out.println(robotstxtConfig2.getUserAgentName());
String crawlStorageFolder = "/crawler/testdata";
int numberOfCrawlers = 4;
CrawlConfig config = new CrawlConfig();
config.setCrawlStorageFolder(crawlStorageFolder);
PageFetcher pageFetcher = new PageFetcher(config);
RobotstxtConfig robotstxtConfig = new RobotstxtConfig();
System.out.println(robotstxtConfig.getCacheSize());
System.out.println(robotstxtConfig.getUserAgentName());
RobotstxtServer robotstxtServer = new RobotstxtServer(robotstxtConfig, pageFetcher);
CrawlController controller = new CrawlController(config,
pageFetcher, robotstxtServer);
controller.addSeed("http://cyesilkaya.wordpress.com/");
controller.start(Crawler.class, numberOfCrawlers);
}
}
import java.io.IOException;
import edu.uci.ics.crawler4j.crawler.Page;
import edu.uci.ics.crawler4j.crawler.WebCrawler;
import edu.uci.ics.crawler4j.url.WebURL;
public class Crawler extends WebCrawler {
@Override
public boolean shouldVisit(WebURL url) {
// you can write your own filter to decide crawl the incoming URL or not.
return true;
}
@Override
public void visit(Page page) {
String url = page.getWebURL().getURL();
try {
String url = page.getWebURL().getURL();
System.out.println("URL: " + url);
}
catch (IOException e) {
}
}
}
以下是Crawler.java文件:
import edu.uci.ics.crawler4j.crawler.CrawlConfig;
import edu.uci.ics.crawler4j.crawler.CrawlController;
import edu.uci.ics.crawler4j.fetcher.PageFetcher;
import edu.uci.ics.crawler4j.robotstxt.RobotstxtConfig;
import edu.uci.ics.crawler4j.robotstxt.RobotstxtServer;
public class Controller {
public static void main(String[] args) throws Exception {
RobotstxtConfig robotstxtConfig2 = new RobotstxtConfig();
System.out.println(robotstxtConfig2.getCacheSize());
System.out.println(robotstxtConfig2.getUserAgentName());
String crawlStorageFolder = "/crawler/testdata";
int numberOfCrawlers = 4;
CrawlConfig config = new CrawlConfig();
config.setCrawlStorageFolder(crawlStorageFolder);
PageFetcher pageFetcher = new PageFetcher(config);
RobotstxtConfig robotstxtConfig = new RobotstxtConfig();
System.out.println(robotstxtConfig.getCacheSize());
System.out.println(robotstxtConfig.getUserAgentName());
RobotstxtServer robotstxtServer = new RobotstxtServer(robotstxtConfig, pageFetcher);
CrawlController controller = new CrawlController(config,
pageFetcher, robotstxtServer);
controller.addSeed("http://cyesilkaya.wordpress.com/");
controller.start(Crawler.class, numberOfCrawlers);
}
}
import java.io.IOException;
import edu.uci.ics.crawler4j.crawler.Page;
import edu.uci.ics.crawler4j.crawler.WebCrawler;
import edu.uci.ics.crawler4j.url.WebURL;
public class Crawler extends WebCrawler {
@Override
public boolean shouldVisit(WebURL url) {
// you can write your own filter to decide crawl the incoming URL or not.
return true;
}
@Override
public void visit(Page page) {
String url = page.getWebURL().getURL();
try {
String url = page.getWebURL().getURL();
System.out.println("URL: " + url);
}
catch (IOException e) {
}
}
}
如果在运行文件时没有提供任何参数,则会出现该错误。 将以下内容作为注释放入代码中或将其删除
if (args.length != 2) {
System.out.println("Needed parameters: ");
System.out.println("\t rootFolder (it will contain intermediate crawl data)");
System.out.println("\t numberOfCralwers (number of concurrent threads)");
return;
}
然后将根文件夹设置为要存储元数据的文件夹。在Eclipse中:
->点击run
->点击运行配置
在弹出窗口中:
首先,左栏:确保在sub-dir Java应用程序中选择了您的应用程序,否则创建一个新的(单击new)
然后在中央窗口中,继续“参数”
写第一个参数后,在“程序参数”下写下参数,按enter键输入第二个参数,依此类推。。。(=换行,因为args是[])
然后单击应用
然后单击Run