Java 具有身份验证的Crawler4j_Java_Web Crawler_Crawler4j

Java 具有身份验证的Crawler4j

java web-crawler

Java 具有身份验证的Crawler4j,java,web-crawler,crawler4j,Java,Web Crawler,Crawler4j,为了测试的目的，我正在尝试在个人redmine中执行crawler4j。我想对应用程序中的几个深度级别进行身份验证和爬网我遵循crawler4j的常见问题解答。并创建下一个代码段： import edu.uci.ics.crawler4j.crawler.Page; import edu.uci.ics.crawler4j.crawler.WebCrawler; import edu.uci.ics.crawler4j.parser.HtmlParseData; import edu.uci.

为了测试的目的，我正在尝试在个人redmine中执行crawler4j。我想对应用程序中的几个深度级别进行身份验证和爬网

我遵循crawler4j的常见问题解答。并创建下一个代码段：

import edu.uci.ics.crawler4j.crawler.Page;
import edu.uci.ics.crawler4j.crawler.WebCrawler;
import edu.uci.ics.crawler4j.parser.HtmlParseData;
import edu.uci.ics.crawler4j.url.WebURL;

public class CustomWebCrawler extends WebCrawler{

    @Override
    public void visit(final Page pPage) {
        if (pPage.getParseData() instanceof HtmlParseData) {
            System.out.println("URL: " + pPage.getWebURL().getURL());
        }
    }

    @Override
    public boolean shouldVisit(final Page pPage, final WebURL pUrl) {
        WebURL webUrl = new WebURL();
        webUrl.setURL(Test.URL_LOGOUT);
        if (pUrl.equals(webUrl)) {
            return false;
        }        
        if(Test.MY_REDMINE_HOST.equals(pUrl.getDomain())){
            return true;
        }
        return false;
    }
}

在这个类中，我从WebCrawler扩展而来，在每次访问中，只显示URL访问，检查URL是否在同一个域中，而不访问注销URL

我还有一个测试类来配置这个爬虫程序，带有身份验证信息和要爬网的站点

import org.apache.log4j.ConsoleAppender;
import org.apache.log4j.Level;
import org.apache.log4j.Logger;
import org.apache.log4j.PatternLayout;

import edu.uci.ics.crawler4j.crawler.CrawlConfig;
import edu.uci.ics.crawler4j.crawler.CrawlController;
import edu.uci.ics.crawler4j.crawler.authentication.AuthInfo;
import edu.uci.ics.crawler4j.crawler.authentication.BasicAuthInfo;
import edu.uci.ics.crawler4j.crawler.authentication.FormAuthInfo;
import edu.uci.ics.crawler4j.fetcher.PageFetcher;
import edu.uci.ics.crawler4j.robotstxt.RobotstxtConfig;
import edu.uci.ics.crawler4j.robotstxt.RobotstxtServer;

public class Test {

    private static Logger rootLogger;

    public final static String MY_REDMINE_HOST = "my-redmine-server";
    public final static String URL_LOGOUT = "http://"+MY_REDMINE_HOST+"/redmine/logout";

    public static void main(String[] args) throws Exception {
        configureLogger();

        // Create the configuration
        CrawlConfig config = new CrawlConfig();
        String frontier = "/tmp/webCrawler/tmp_" + System.currentTimeMillis();
        config.setCrawlStorageFolder(frontier);

        //Starting point to crawl
        String seed = "http://"+MY_REDMINE_HOST+"/redmine/";

        // Data for the authentication methods
        String userName = "my-user";
        String password = "my-passwd";
        String urlLogin = "http://"+MY_REDMINE_HOST+"/redmine/login";
        String nameUsername = "username";
        String namePassword = "password";
        AuthInfo authInfo1 = new FormAuthInfo(userName, password, urlLogin,
                nameUsername, namePassword);
        config.addAuthInfo(authInfo1);
        AuthInfo authInfo2 = new BasicAuthInfo(userName, password, urlLogin);
        config.addAuthInfo(authInfo2);
        config.setMaxDepthOfCrawling(3);

        PageFetcher pageFetcher = new PageFetcher(config);
        RobotstxtConfig robotstxtConfig = new RobotstxtConfig();
        RobotstxtServer robotstxtServer = new RobotstxtServer(robotstxtConfig,
                pageFetcher);
        CrawlController controller = new CrawlController(config, pageFetcher,
                robotstxtServer);

        controller.addSeed(seed);

        controller.start(CustomWebCrawler.class, 5);
        controller.shutdown();

    }

    private static void configureLogger() {
        // This is the root logger provided by log4j
        rootLogger = Logger.getRootLogger();
        rootLogger.setLevel(Level.INFO);

        // Define log pattern layout
        PatternLayout layout = new PatternLayout(
                "%d{ISO8601} [%t] %-5p %c %x - %m%n");

        // Add console appender to root logger
        if (!rootLogger.getAllAppenders().hasMoreElements()) {
            rootLogger.addAppender(new ConsoleAppender(layout));
        }
    }
}

我希望登录应用程序并在站点的其余部分保持爬网，但我不能。爬网只访问起点中的链接

我不知道是少了什么还是出了什么问题。或者，我可能对爬网的当前配置有错误的方法。

它是否进入登录页面？是否成功登录？日志对登录页面有何说明？此页面有任何更新吗？我也有同样的问题：我可以（根据日志记录）成功登录，但是我得到了错误的页面（我尝试了w/e.g.javaforum.org）。