Web crawler 统计数据库在Crawler4j开源代码中做什么?

Web crawler 统计数据库在Crawler4j开源代码中做什么?,web-crawler,crawler4j,Web Crawler,Crawler4j,我正在尝试理解Crawler4j开源网络爬虫。同时我也有一些疑问,如下所示 问题:- 统计数据库在计数器类中做什么,请解释以下代码部分 public Counters(Environment env, CrawlConfig config) throws DatabaseException { super(config); this.env = env; this.counterValues = new HashMap<String, Long>();

我正在尝试理解Crawler4j开源网络爬虫。同时我也有一些疑问,如下所示

问题:-

  • 统计数据库在计数器类中做什么,请解释以下代码部分

     public Counters(Environment env, CrawlConfig config) throws DatabaseException {
        super(config);
    
        this.env = env;
        this.counterValues = new HashMap<String, Long>();
    
        /*
         * When crawling is set to be resumable, we have to keep the statistics
         * in a transactional database to make sure they are not lost if crawler
         * is crashed or terminated unexpectedly.
         */
        if (config.isResumableCrawling()) {
            DatabaseConfig dbConfig = new DatabaseConfig();
            dbConfig.setAllowCreate(true);
            dbConfig.setTransactional(true);
            dbConfig.setDeferredWrite(false);
            statisticsDB = env.openDatabase(null, "Statistics", dbConfig);
    
            OperationStatus result;
            DatabaseEntry key = new DatabaseEntry();
            DatabaseEntry value = new DatabaseEntry();
            Transaction tnx = env.beginTransaction(null, null);
            Cursor cursor = statisticsDB.openCursor(tnx, null);
            result = cursor.getFirst(key, value, null);
    
            while (result == OperationStatus.SUCCESS) {
                if (value.getData().length > 0) {
                    String name = new String(key.getData());
                    long counterValue = Util.byteArray2Long(value.getData());
                    counterValues.put(name, counterValue);
                }
                result = cursor.getNext(key, value, null);
            }
            cursor.close();
            tnx.commit();
        }
    }
    
    公共计数器(环境环境,爬网配置)引发DatabaseException{
    超级(配置);
    this.env=env;
    this.counterValues=new HashMap();
    /*
    *当爬行设置为可恢复时,我们必须保留统计数据
    *在事务数据库中,以确保爬虫程序
    *意外崩溃或终止。
    */
    if(config.isResumableScrawling()){
    DatabaseConfig dbConfig=新建DatabaseConfig();
    dbConfig.setAllowCreate(true);
    dbConfig.setTransactional(true);
    dbConfig.setDeferredWrite(false);
    statisticsDB=env.openDatabase(null,“Statistics”,dbConfig);
    操作状态结果;
    DatabaseEntry key=新建DatabaseEntry();
    DatabaseEntry值=新建DatabaseEntry();
    事务tnx=env.beginTransaction(null,null);
    Cursor Cursor=statisticsDB.openCursor(tnx,null);
    结果=cursor.getFirst(键,值,null);
    while(result==OperationStatus.SUCCESS){
    如果(value.getData().length>0){
    字符串名称=新字符串(key.getData());
    long counterValue=Util.byteArray2Long(value.getData());
    counterValues.put(名称,counterValue);
    }
    结果=cursor.getNext(键、值、空);
    }
    cursor.close();
    提交();
    }
    }
    
  • 据我所知,它保存了已爬网的URL,这有助于防止爬网程序崩溃,然后web爬网程序不需要从头开始。 请逐行解释上述代码。

    二,。我没有找到任何好的链接来向我解释SleepyCat,因为Crawlers4j使用SleepyCat来存储中间信息。所以请告诉我一些好的资源,从那里我可以学习SleepyCat的基础知识。(我不知道事务的含义是什么,上面代码中使用了游标)


    请帮帮我。正在寻找您的回复。

    基本上,Crawler4j通过从数据库加载所有值,从数据库加载现有统计信息。 事实上,代码非常不正确,因为打开了一个事务,并且没有对数据库进行任何修改。因此,可以删除处理tnx的线路

    逐行注释:

    //Create a database configuration object 
    DatabaseConfig dbConfig = new DatabaseConfig();
    //Set some parameters : allow creation, set to transactional db and don't use deferred    write
    dbConfig.setAllowCreate(true);
    dbConfig.setTransactional(true);
    dbConfig.setDeferredWrite(false);
    //Open the database called "Statistics" with the upon created configuration
    statisticsDB = env.openDatabase(null, "Statistics", dbConfig);
    
     OperationStatus result;
    //Create new database entries key and values
        DatabaseEntry key = new DatabaseEntry();
        DatabaseEntry value = new DatabaseEntry();
    //Start a transaction
        Transaction tnx = env.beginTransaction(null, null);
    //Get the cursor on the DB
        Cursor cursor = statisticsDB.openCursor(tnx, null);
    //Position the cursor to the first occurrence of key/value
        result = cursor.getFirst(key, value, null);
    //While result is success
        while (result == OperationStatus.SUCCESS) {
    //If the value at the current cursor position is not null, get the name and the value of     the counter and add it to the Hashmpa countervalues
            if (value.getData().length > 0) {
                String name = new String(key.getData());
                long counterValue = Util.byteArray2Long(value.getData());
                counterValues.put(name, counterValue);
            }
            result = cursor.getNext(key, value, null);
        }
        cursor.close();
    //Commit the transaction, changes will be operated on th DB
        tnx.commit();
    
    我还回答了一个类似的问题。
    关于SleepyCat,你在说什么?

    基本上,Crawler4j通过从数据库加载所有值,从数据库加载现有统计信息。 事实上,代码非常不正确,因为打开了一个事务,并且没有对数据库进行任何修改。因此,可以删除处理tnx的线路

    逐行注释:

    //Create a database configuration object 
    DatabaseConfig dbConfig = new DatabaseConfig();
    //Set some parameters : allow creation, set to transactional db and don't use deferred    write
    dbConfig.setAllowCreate(true);
    dbConfig.setTransactional(true);
    dbConfig.setDeferredWrite(false);
    //Open the database called "Statistics" with the upon created configuration
    statisticsDB = env.openDatabase(null, "Statistics", dbConfig);
    
     OperationStatus result;
    //Create new database entries key and values
        DatabaseEntry key = new DatabaseEntry();
        DatabaseEntry value = new DatabaseEntry();
    //Start a transaction
        Transaction tnx = env.beginTransaction(null, null);
    //Get the cursor on the DB
        Cursor cursor = statisticsDB.openCursor(tnx, null);
    //Position the cursor to the first occurrence of key/value
        result = cursor.getFirst(key, value, null);
    //While result is success
        while (result == OperationStatus.SUCCESS) {
    //If the value at the current cursor position is not null, get the name and the value of     the counter and add it to the Hashmpa countervalues
            if (value.getData().length > 0) {
                String name = new String(key.getData());
                long counterValue = Util.byteArray2Long(value.getData());
                counterValues.put(name, counterValue);
            }
            result = cursor.getNext(key, value, null);
        }
        cursor.close();
    //Commit the transaction, changes will be operated on th DB
        tnx.commit();
    
    我还回答了一个类似的问题。
    关于SleepyCat,你在说什么?

    如果它回答了你的问题,请向上投票/接受question@JulienS. 它回答了我的问题。如果它回答了你的问题,请投赞成票/接受question@JulienS. 它回答了我的问题。