Performance Nutch segments文件夹每天都在增长_Performance_Solr_Nutch_Segments

Performance Nutch segments文件夹每天都在增长

performance solr

Performance Nutch segments文件夹每天都在增长,performance,solr,nutch,segments,Performance,Solr,Nutch,Segments,我已将nutch/solr 1.6配置为每12小时对包含约4000个文档和html页面的intranet进行爬网/索引如果我用空数据库执行爬虫程序，这个过程大约需要30分钟。当爬行执行了几天之后，它会变得非常缓慢。查看日志文件，似乎今晚的最后一步（SolrIndexer）是在1小时20分钟后开始的，花了1个多小时因为索引的文档数量没有增长，我想知道为什么现在这么慢使用以下命令执行Nutch： bin/nutch crawl -urlDir urls -solr http://local

我已将nutch/solr 1.6配置为每12小时对包含约4000个文档和html页面的intranet进行爬网/索引

如果我用空数据库执行爬虫程序，这个过程大约需要30分钟。当爬行执行了几天之后，它会变得非常缓慢。查看日志文件，似乎今晚的最后一步（SolrIndexer）是在1小时20分钟后开始的，花了1个多小时

因为索引的文档数量没有增长，我想知道为什么现在这么慢

使用以下命令执行Nutch：

bin/nutch crawl -urlDir urls -solr http://localhost:8983/solr -dir nutchdb -depth 15 -topN 3000

nutch-site.xml包含：

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<!-- Put site-specific property overrides in this file. -->

<configuration>
    <property>
        <name>http.agent.name</name>
        <value>Internet Site Agent</value>
    </property>
    <property>
        <name>plugin.includes</name>
        <value>protocol-http|urlfilter-regex|parse-(tika|metatags)|index-(basic|anchor|metadata|more|http-header)|scoring-opic|urlnormalizer-(pass|regex|basic)</value>
    </property>
    <!-- Used only if plugin parse-metatags is enabled. -->
    <property>
        <name>metatags.names</name>
        <value>description;keywords;published;modified</value>
        <description> Names of the metatags to extract, separated by;.
            Use '*' to extract all metatags. Prefixes the names with 'metatag.'
            in the parse-metadata. For instance to index description and keywords,
            you need to activate the plugin index-metadata and set the value of the
            parameter 'index.parse.md' to 'metatag.description;metatag.keywords'.
        </description>
    </property>
    <property>
        <name>index.parse.md</name>
        <value>metatag.description,metatag.keywords,metatag.published,metatag.modified</value>
        <description> Comma-separated list of keys to be taken from the parse metadata to generate fields.
            Can be used e.g. for 'description' or 'keywords' provided that these values are generated
            by a parser (see parse-metatags plugin)
        </description>
    </property>       
    <property>
    <name>db.ignore.external.links</name>
    <value>true</value>
    <description>Set this to false if you start crawling your website from
       for example http://www.example.com but you would like to crawl
       xyz.example.com. Set it to true otherwise if you want to exclude external links
    </description>
    </property>
    <property>
        <name>http.content.limit</name>
        <value>10000000</value>
        <description>The length limit for downloaded content using the http
            protocol, in bytes. If this value is nonnegative (>=0), content longer
            than it will be truncated; otherwise, no truncation at all. Do not
            confuse this setting with the file.content.limit setting.
        </description>
    </property> 

    <property>
        <name>fetcher.max.crawl.delay</name>
        <value>1</value>
        <description>
            If the Crawl-Delay in robots.txt is set to greater than this value (in
            seconds) then the fetcher will skip this page, generating an error report.
            If set to -1 the fetcher will never skip such pages and will wait the
            amount of time retrieved from robots.txt Crawl-Delay, however long that
            might be.
        </description>
    </property>

    <property>
        <name>fetcher.threads.fetch</name>
        <value>10</value>
        <description>The number of FetcherThreads the fetcher should use.
        This is also determines the maximum number of requests that are
        made at once (each FetcherThread handles one connection). The total
        number of threads running in distributed mode will be the number of
        fetcher threads * number of nodes as fetcher has one map task per node.
        </description>
    </property>

    <property>
        <name>fetcher.threads.fetch</name>
        <value>10</value>
        <description>The number of FetcherThreads the fetcher should use.
            This is also determines the maximum number of requests that are
            made at once (each FetcherThread handles one connection). The total
            number of threads running in distributed mode will be the number of
            fetcher threads * number of nodes as fetcher has one map task per node.
        </description>
    </property>

    <property>
        <name>fetcher.server.delay</name>
        <value>1.0</value>
        <description>The number of seconds the fetcher will delay between
            successive requests to the same server.</description>
    </property>

    <property>
        <name>http.redirect.max</name>
        <value>0</value>
        <description>The maximum number of redirects the fetcher will follow when
            trying to fetch a page. If set to negative or 0, fetcher won't immediately
            follow redirected URLs, instead it will record them for later fetching.
        </description>
    </property>

    <property>
        <name>fetcher.threads.per.queue</name>
        <value>2</value>
        <description>This number is the maximum number of threads that
           should be allowed to access a queue at one time. Replaces
           deprecated parameter 'fetcher.threads.per.host'.
        </description>
    </property>

    <property>
        <name>link.delete.gone</name>
        <value>true</value>
        <description>Whether to delete gone pages from the web graph.</description>
   </property>

   <property>
       <name>link.loops.depth</name>
       <value>20</value>
       <description>The depth for the loops algorithm.</description>
   </property>

<!-- moreindexingfilter plugin properties -->

    <property>
      <name>moreIndexingFilter.indexMimeTypeParts</name>
      <value>false</value>
      <description>Determines whether the index-more plugin will split the mime-type
      in sub parts, this requires the type field to be multi valued. Set to true for backward
      compatibility. False will not split the mime-type.
      </description>
    </property>

    <property>
      <name>moreIndexingFilter.mapMimeTypes</name>
      <value>false</value>
      <description>Determines whether MIME-type mapping is enabled. It takes a
      plain text file with mapped MIME-types. With it the user can map both
      application/xhtml+xml and text/html to the same target MIME-type so it
      can be treated equally in an index. See conf/contenttype-mapping.txt.
      </description>
    </property>

    <!-- Fetch Schedule Configuration --> 
    <property>
      <name>db.fetch.interval.default</name>
              <!-- for now always re-fetch everything -->
      <value>10</value>
      <description>The default number of seconds between re-fetches of a page (less than 1 day).
      </description>
    </property>

    <property>
      <name>db.fetch.interval.max</name>
              <!-- for now always re-fetch everything -->
      <value>10</value>
      <description>The maximum number of seconds between re-fetches of a page
      (less than one day). After this period every page in the db will be re-tried, no
       matter what is its status.
      </description>
    </property>

    <!--property>
      <name>db.fetch.schedule.class</name>
      <value>org.apache.nutch.crawl.AdaptiveFetchSchedule</value>
      <description>The implementation of fetch schedule. DefaultFetchSchedule simply
      adds the original fetchInterval to the last fetch time, regardless of
      page changes.</description>
    </property>

    <property>
      <name>db.fetch.schedule.adaptive.inc_rate</name>
      <value>0.4</value>
      <description>If a page is unmodified, its fetchInterval will be
      increased by this rate. This value should not
      exceed 0.5, otherwise the algorithm becomes unstable.</description>
    </property>

    <property>
      <name>db.fetch.schedule.adaptive.dec_rate</name>
      <value>0.2</value>
      <description>If a page is modified, its fetchInterval will be
      decreased by this rate. This value should not
      exceed 0.5, otherwise the algorithm becomes unstable.</description>
    </property>

    <property>
      <name>db.fetch.schedule.adaptive.min_interval</name>
      <value>60.0</value>
      <description>Minimum fetchInterval, in seconds.</description>
    </property>

    <property>
      <name>db.fetch.schedule.adaptive.max_interval</name>
      <value>31536000.0</value>
      <description>Maximum fetchInterval, in seconds (365 days).
      NOTE: this is limited by db.fetch.interval.max. Pages with
      fetchInterval larger than db.fetch.interval.max
      will be fetched anyway.</description>
    </property>

    <property>
      <name>db.fetch.schedule.adaptive.sync_delta</name>
      <value>true</value>
      <description>If true, try to synchronize with the time of page change.
      by shifting the next fetchTime by a fraction (sync_rate) of the difference
      between the last modification time, and the last fetch time.</description>
    </property>

    <property>
      <name>db.fetch.schedule.adaptive.sync_delta_rate</name>
      <value>0.3</value>
      <description>See sync_delta for description. This value should not
      exceed 0.5, otherwise the algorithm becomes unstable.</description>
    </property>

    <property>
      <name>db.fetch.schedule.adaptive.sync_delta_rate</name>
      <value>0.3</value>
      <description>See sync_delta for description. This value should not
      exceed 0.5, otherwise the algorithm becomes unstable.</description>
    </property-->

    <property>
      <name>fetcher.threads.fetch</name>
      <value>1</value>
      <description>The number of FetcherThreads the fetcher should use.
         This is also determines the maximum number of requests that are
         made at once (each FetcherThread handles one connection). The total
         number of threads running in distributed mode will be the number of
         fetcher threads * number of nodes as fetcher has one map task per node.
      </description>
    </property>

    <property>
       <name>hadoop.tmp.dir</name>
       <value>/opt/apache-nutch/tmp/</value>
    </property>

    <!-- Boilerpipe -->
    <property>
      <name>tika.boilerpipe</name>
      <value>true</value>
    </property>
    <property>
      <name>tika.boilerpipe.extractor</name>
      <value>ArticleExtractor</value>
    </property>
</configuration>


http.agent.name
互联网站点代理
plugin.includes
协议http | urlfilter regex | parse-（tika | metatags）| index-（basic | anchor | metadata | more | http头）| scoring opic | urlnormalizer-（pass | regex | basic）
metatags.name
描述关键词；出版；被改进的
要提取的元标记的名称，以；分隔；。
使用“*”提取所有元标记。在名称前加上“metatag”
在分析元数据中。例如索引描述和关键字，
您需要激活插件索引元数据并设置
参数“index.parse.md”到“metatag.description”；metatag.keywords'。
index.parse.md
metatag.description，metatag.keywords，metatag.published，metatag.modified
要从分析元数据中获取以逗号分隔的键列表，以生成字段。
可用于“描述”或“关键字”，前提是生成这些值
通过解析器（请参阅parse metatags插件）
db.ignore.external.links
真的
如果从开始爬网网站，请将此设置为false
例如http://www.example.com 但是你想爬
xyz.example.com。如果要排除外部链接，请将其设置为true
http.content.limit
10000000
使用http的下载内容的长度限制
协议，以字节为单位。如果此值为非负（>=0），则内容将更长
它将被截断；否则，根本不需要截断。不要
将此设置与file.content.limit设置混淆。
fetcher.max.crawl.delay
1.
如果robots.txt中的爬行延迟设置为大于此值（在
秒），然后获取程序将跳过此页，生成错误报告。
如果设置为-1，抓取程序将永远不会跳过这些页面，并将等待
从robots.txt检索的时间量爬网延迟，无论多长
可能是。
fetcher.threads.fetch
10
获取程序应使用的获取线程数。
这也决定了请求的最大数量
立即创建（每个FetcherThread处理一个连接）。总数
在分布式模式下运行的线程数将是
获取程序线程数*作为获取程序，每个节点有一个映射任务的节点数。
fetcher.threads.fetch
10
获取程序应使用的获取线程数。
这也决定了请求的最大数量
立即创建（每个FetcherThread处理一个连接）。总数
在分布式模式下运行的线程数将是
获取程序线程数*作为获取程序，每个节点有一个映射任务的节点数。
fetcher.server.delay
1
获取程序将在两个时间间隔内延迟的秒数
对同一服务器的连续请求。
http.redirect.max
0
获取程序在以下情况下将遵循的最大重定向数：
正在尝试获取页面。如果设置为负数或0，则不会立即启动抓取程序
遵循重定向的URL，它将记录它们以备以后获取。
fetcher.threads.per.queue
2.
此数字是可执行的最大线程数
应允许一次访问队列。取代
不推荐使用的参数“fetcher.threads.per.host”。
link.delete.gone
真的
是否从web图表中删除已消失的页面。
链接、循环、深度
20
循环算法的深度。
moreIndexingFilter.IndexMetypeParts
假的
确定index more插件是否将拆分mime类型
在子部件中，这要求类型字段为多值字段。设置为true表示向后
兼容性。False不会拆分mime类型。
moreIndexingFilter.mapMimeTypes
假的
确定是否启用MIME类型映射。这需要一段时间
具有映射MIME类型的纯文本文件。有了它，用户可以映射两者
将application/xhtml+xml和text/html转换为相同的目标MIME类型，以便
可以在索引中同等对待。请参阅conf/contenttype-mapping.txt。
db.fetch.interval.default
10
重新获取页面之间的默认秒数（少于1天）。
db.fetch.interval.max
10
重新获取页面之间的最大秒数
（不到一天）。在此期间之后，将重新尝试数据库中的每个页面，否
不管它的地位如何。
fetcher.threads.fetch
1.
获取程序应使用的获取线程数。
这也决定了请求的最大数量
立即创建（每个FetcherThread处理一个连接）。总数
在分布式模式下运行的线程数将是