Web crawler 分析一页后，Nutch爬网停止_Web Crawler_Nutch

Web crawler 分析一页后，Nutch爬网停止

web-crawler

Web crawler 分析一页后，Nutch爬网停止,web-crawler,nutch,Web Crawler,Nutch,使用nutch进行爬行时，只解析一页，不向前移动。谁能帮忙吗。下面是nutch的输出解析第一个页面后，它将停止并不再移动。未成功解析 [Naveen@01hw5189 apache-nutch-1.7]$ bin/nutch crawl urls -dir crawlwiki -depth 10 -topN 10 solrUrl is not set, indexing will be skipped... crawl started in: crawlwiki rootUrlDir = ur

使用nutch进行爬行时，只解析一页，不向前移动。谁能帮忙吗。下面是nutch的输出

解析第一个页面后，它将停止并不再移动。未成功解析

[Naveen@01hw5189 apache-nutch-1.7]$ bin/nutch crawl urls -dir crawlwiki -depth 10 -topN 10
solrUrl is not set, indexing will be skipped...
crawl started in: crawlwiki
rootUrlDir = urls
threads = 10
depth = 10
solrUrl=null
topN = 10
Injector: starting at 2013-09-12 15:51:45
Injector: crawlDb: crawlwiki/crawldb
Injector: urlDir: urls
Injector: Converting injected urls to crawl db entries.
Injector: total number of urls rejected by filters: 1
Injector: total number of urls injected after normalization and filtering: 1
Injector: Merging injected urls into crawl db.
Injector: finished at 2013-09-12 15:51:47, elapsed: 00:00:02
Generator: starting at 2013-09-12 15:51:47
Generator: Selecting best-scoring urls due for fetch.
Generator: filtering: true
Generator: normalizing: true
Generator: topN: 10
Generator: jobtracker is 'local', generating exactly one partition.
Generator: Partitioning selected urls for politeness.
Generator: segment: crawlwiki/segments/20130912155149
Generator: finished at 2013-09-12 15:51:50, elapsed: 00:00:03
Fetcher: Your 'http.agent.name' value should be listed first in 'http.robots.agents' property.
Fetcher: starting at 2013-09-12 15:51:50
Fetcher: segment: crawlwiki/segments/20130912155149
Using queue mode : byHost
Fetcher: threads: 10
Fetcher: time-out divisor: 2
QueueFeeder finished: total 1 records + hit by time limit :0
Using queue mode : byHost
Using queue mode : byHost
-finishing thread FetcherThread, activeThreads=1
Using queue mode : byHost
fetching http://en.wikipedia.org/ (queue crawl delay=5000ms)
-finishing thread FetcherThread, activeThreads=1
Using queue mode : byHost
-finishing thread FetcherThread, activeThreads=1
Using queue mode : byHost
-finishing thread FetcherThread, activeThreads=1
Using queue mode : byHost
-finishing thread FetcherThread, activeThreads=1
Using queue mode : byHost
-finishing thread FetcherThread, activeThreads=1
Using queue mode : byHost
-finishing thread FetcherThread, activeThreads=1
Using queue mode : byHost
-finishing thread FetcherThread, activeThreads=1
Using queue mode : byHost
-finishing thread FetcherThread, activeThreads=1
Fetcher: throughput threshold: -1
Fetcher: throughput threshold retries: 5
-activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0
-finishing thread FetcherThread, activeThreads=0
-activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0
-activeThreads=0
Fetcher: finished at 2013-09-12 15:51:53, elapsed: 00:00:03
ParseSegment: starting at 2013-09-12 15:51:53
ParseSegment: segment: crawlwiki/segments/20130912155149
ParseSegment: finished at 2013-09-12 15:51:54, elapsed: 00:00:01
CrawlDb update: starting at 2013-09-12 15:51:54
CrawlDb update: db: crawlwiki/crawldb
CrawlDb update: segments: [crawlwiki/segments/20130912155149]
CrawlDb update: additions allowed: true
CrawlDb update: URL normalizing: true
CrawlDb update: URL filtering: true
CrawlDb update: 404 purging: false
CrawlDb update: Merging segment data into db.
CrawlDb update: finished at 2013-09-12 15:51:56, elapsed: 00:00:02
Generator: starting at 2013-09-12 15:51:56
Generator: Selecting best-scoring urls due for fetch.
Generator: filtering: true
Generator: normalizing: true
Generator: topN: 10
Generator: jobtracker is 'local', generating exactly one partition.
Generator: Partitioning selected urls for politeness.
Generator: segment: crawlwiki/segments/20130912155159
Generator: finished at 2013-09-12 15:52:00, elapsed: 00:00:04
Fetcher: Your 'http.agent.name' value should be listed first in 'http.robots.agents' property.
Fetcher: starting at 2013-09-12 15:52:00
Fetcher: segment: crawlwiki/segments/20130912155159
Using queue mode : byHost
Fetcher: threads: 10
Fetcher: time-out divisor: 2
QueueFeeder finished: total 1 records + hit by time limit :0
Using queue mode : byHost
fetching http://en.wikipedia.org/wiki/Main_Page (queue crawl delay=5000ms)
Using queue mode : byHost
-finishing thread FetcherThread, activeThreads=1
Using queue mode : byHost
-finishing thread FetcherThread, activeThreads=1
Using queue mode : byHost
-finishing thread FetcherThread, activeThreads=1
Using queue mode : byHost
-finishing thread FetcherThread, activeThreads=1
Using queue mode : byHost
-finishing thread FetcherThread, activeThreads=1
Using queue mode : byHost
-finishing thread FetcherThread, activeThreads=1
Using queue mode : byHost
-finishing thread FetcherThread, activeThreads=1
Using queue mode : byHost
-finishing thread FetcherThread, activeThreads=1
Using queue mode : byHost
Fetcher: throughput threshold: -1
-finishing thread FetcherThread, activeThreads=1
Fetcher: throughput threshold retries: 5
-finishing thread FetcherThread, activeThreads=0
-activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0
-activeThreads=0
Fetcher: finished at 2013-09-12 15:52:02, elapsed: 00:00:02
ParseSegment: starting at 2013-09-12 15:52:02
ParseSegment: segment: crawlwiki/segments/20130912155159
Parsed (8ms):http://en.wikipedia.org/wiki/Main_Page

查看wikipedia的robots.txt文件

robots.txt可能会拒绝进一步的深度搜索。robot文件定义了网络爬虫可以访问的内容，Nutch遵守这个“NetiQuit”

希望这有助于解决同样的问题。你明白了吗？我现在对Nutch2.3也有同样的问题。不幸的是，我不得不改用Nutch1.9。我仍在寻找解决此问题的方法。