Warning: file_get_contents(/data/phpspider/zhask/data//catemap/9/three.js/2.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Web crawler Heritrix内容过滤_Web Crawler_Heritrix - Fatal编程技术网

Web crawler Heritrix内容过滤

Web crawler Heritrix内容过滤,web-crawler,heritrix,Web Crawler,Heritrix,我需要从几个不同的网站(主要是HTML页面和PDF文档)聚合内容。我目前正在试用Heritrix(3.2.0),看看它是否能满足我的需要 虽然文档非常详细,但引擎似乎并不像我预期的那样工作。我已经设置了一些简单的作业,并以多种不同的方式配置了DecideRules,但无论我做什么,我发现Heritrix要么下载了太多的内容,要么什么都没有 这是一个我正在尝试做的例子。我将Heritrix指向如下URL…example.com/news/speeches。这是一个网页,它有一个HTML表格,其中包

我需要从几个不同的网站(主要是HTML页面和PDF文档)聚合内容。我目前正在试用Heritrix(3.2.0),看看它是否能满足我的需要

虽然文档非常详细,但引擎似乎并不像我预期的那样工作。我已经设置了一些简单的作业,并以多种不同的方式配置了DecideRules,但无论我做什么,我发现Heritrix要么下载了太多的内容,要么什么都没有

这是一个我正在尝试做的例子。我将Heritrix指向如下URL…example.com/news/speeches。这是一个网页,它有一个HTML表格,其中包含指向各个演讲的链接(例如example.com/news/speech1.HTML、example.com/news/speech2.HTML等)。我真的只需要HTML和PDF文档从父页面向下一级。我希望防止Heritrix导航深度超过1级,防止它在example.com域上拉取不低于此特定路径的内容,防止它导航到另一个域,并将其限制为html和pdf内容

下面的配置是我认为应该可以工作的,但是没有

 <bean id="longerOverrides" class="org.springframework.beans.factory.config.PropertyOverrideConfigurer">
      <property name="properties">
       <props>
        <prop key="seeds.textSource.value">

    # URLS HERE
    example.com/news/speeches

        </prop>
       </props>
      </property>
     </bean>

<bean id="scope" class="org.archive.modules.deciderules.DecideRuleSequence">
  <!-- <property name="logToFile" value="false" /> -->
  <property name="rules">
   <list>
    <!-- Begin by REJECTing all... -->
    <bean class="org.archive.modules.deciderules.RejectDecideRule">
    </bean>
    <!-- ...then ACCEPT those within configured/seed-implied SURT prefixes... -->
    <bean class="org.archive.modules.deciderules.surt.SurtPrefixedDecideRule">
     <!-- <property name="seedsAsSurtPrefixes" value="true" /> -->
     <!-- <property name="alsoCheckVia" value="false" /> -->
     <!-- <property name="surtsSourceFile" value="" /> -->
     <!-- <property name="surtsDumpFile" value="${launchId}/surts.dump" /> -->
      <property name="surtsSource">
           <bean class="org.archive.spring.ConfigString">
            <property name="value">
             <value>
            example.com/news/speeches
             </value>
            </property> 
           </bean>
          </property>
    </bean>
     <!-- ...and REJECT those from a configurable (initially empty) set of URI regexes... -->
<bean class="org.archive.modules.deciderules.MatchesListRegexDecideRule">
      <property name="decision" value="REJECT"/>
      <property name="listLogicalOr" value="true" />
      <property name="regexList">
       <list>
         <value>.*(?i)(\.(avi|wmv|mpe?g|mp3))$</value>
         <value>.*(?i)(\.(rar|zip|tar|gz))$</value>
         <value>.*(?i)(\.(xls|odt))$</value>
         <value>.*(?i)(\.(xml))$</value>
         <value>.*(?i)(\.(txt|conf|pdf))$</value>
         <value>.*(?i)(\.(swf))$</value>
         <value>.*(?i)(\.(js|css))$</value>
         <value>.*(?i)(\.(bmp|gif|jpe?g|png|svg|tiff?))$</value>
       </list>
      </property>
</bean>
    <!-- ...but REJECT those more than a configured link-hop-count from start... -->
    <bean class="org.archive.modules.deciderules.TooManyHopsDecideRule">
     <!-- <property name="maxHops" value="20" /> -->
    </bean>
    <!-- ...but ACCEPT those more than a configured link-hop-count from start... -->
    <!--bean class="org.archive.modules.deciderules.TransclusionDecideRule"-->
     <!-- <property name="maxTransHops" value="2" /> -->
     <!-- <property name="maxSpeculativeHops" value="1" /> -->
    <!--/bean-->
    <!-- ...but REJECT those from a configurable (initially empty) set of REJECT SURTs... -->
    <bean class="org.archive.modules.deciderules.surt.SurtPrefixedDecideRule">
          <property name="decision" value="REJECT"/>
          <property name="seedsAsSurtPrefixes" value="false"/>
          <property name="surtsDumpFile" value="${launchId}/negative-surts.dump" /> 
     <!-- <property name="surtsSource">
           <bean class="org.archive.spring.ConfigFile">
            <property name="path" value="negative-surts.txt" />
           </bean>
          </property> -->
    </bean>
    <!-- ...and REJECT those from a configurable (initially empty) set of URI regexes... -->
    <bean class="org.archive.modules.deciderules.MatchesListRegexDecideRule">
          <property name="decision" value="REJECT"/>
     <!-- <property name="listLogicalOr" value="true" /> -->
     <!-- <property name="regexList">
           <list>
           </list>
          </property> -->
    </bean>
    <!-- ...and REJECT those with suspicious repeating path-segments... -->
    <bean class="org.archive.modules.deciderules.PathologicalPathDecideRule">
     <!-- <property name="maxRepetitions" value="2" /> -->
    </bean>
    <!-- ...and REJECT those with more than threshold number of path-segments... -->
    <bean class="org.archive.modules.deciderules.TooManyPathSegmentsDecideRule">
     <!-- <property name="maxPathDepth" value="20" /> -->
    </bean>
    <!-- ...but always ACCEPT those marked as prerequisitee for another URI... -->
    <bean class="org.archive.modules.deciderules.PrerequisiteAcceptDecideRule">
    </bean>
    <!-- ...but always REJECT those with unsupported URI schemes -->
    <bean class="org.archive.modules.deciderules.SchemeNotInSetDecideRule">
    </bean>
   </list>
  </property>
 </bean>

#此处的URL
example.com/news/speechs
example.com/news/speechs
*(\(avi | wmv | mpe | g | mp3))$
.*i(\(rar|zip|tar|gz))$
(\(xls|odt))$
.*i)(\(xml))$
.*i(\.(txt|conf|pdf))$
.*i)(\(swf))$
.*i(\.(js|css))$
.*i)(\(bmp | gif | jpe | g | png | svg | tiff?)$

我希望我的爬网只能拉下十几个html文档,因为/speech路径中只包含这些文档。大约半小时后,我停止了爬网,因为它正在下载800多个文档,因为我发现它正在向后遍历到父级路径。我也尝试过正则表达式规则,但运气不好。任何帮助都将不胜感激。

调试此类问题的一个好方法是启用范围决策的日志记录。(用
日志文件取消注释行,并使其
为true
。这将为每个URI提供决定包含或拒绝它的规则。因此,您将能够看到哪些规则未正确配置,并接受应被拒绝的URI。

调试此类问题的一个好方法是启用范围决定的日志记录。(使用
logToFile
取消注释行,并使其
为true
。这将为每个URI提供决定包含或拒绝它的规则。因此,您将能够看到哪些规则未正确配置,并接受应被拒绝的URI