Hadoop 无法在Nutch部署模式下运行获取程序作业_Hadoop_Nutch

Hadoop 无法在Nutch部署模式下运行获取程序作业

hadoop

Hadoop 无法在Nutch部署模式下运行获取程序作业,hadoop,nutch,Hadoop,Nutch,我已经成功地在我的Ubuntu11.10系统上使用本地模式运行Nutch（v1.4）进行爬网。但是，当切换到“部署”模式时（所有其他模式都相同），我在获取周期中遇到一个错误我让Hadoop在机器上以伪分布式模式成功运行（复制因子为1，我只有1个map和1个reduce作业设置）。“jps”显示所有Hadoop守护进程都已启动并正在运行。 18920日元 14799数据节点 15127工作追踪者 14554名称节点 15361任务跟踪器 15044第二名称节点我还将HADOOP_HOME/bi

我已经成功地在我的Ubuntu11.10系统上使用本地模式运行Nutch（v1.4）进行爬网。但是，当切换到“部署”模式时（所有其他模式都相同），我在获取周期中遇到一个错误

我让Hadoop在机器上以伪分布式模式成功运行（复制因子为1，我只有1个map和1个reduce作业设置）。“jps”显示所有Hadoop守护进程都已启动并正在运行。 18920日元 14799数据节点 15127工作追踪者 14554名称节点 15361任务跟踪器 15044第二名称节点

我还将HADOOP_HOME/bin路径添加到我的path变量中

PATH=$PATH:/home/jimb/hadoop/bin

然后，我从nutch/deploy目录运行爬网，如下所示：

bin/nutch-crawl/data/runs/ar/seedurl-dir/data/runs/ar/crawls

以下是我得到的输出：

  12/01/25 13:55:49 INFO crawl.Crawl: crawl started in: /data/runs/ar/crawls
  12/01/25 13:55:49 INFO crawl.Crawl: rootUrlDir = /data/runs/ar/seedurls
  12/01/25 13:55:49 INFO crawl.Crawl: threads = 10
  12/01/25 13:55:49 INFO crawl.Crawl: depth = 5
  12/01/25 13:55:49 INFO crawl.Crawl: solrUrl=null
  12/01/25 13:55:49 INFO crawl.Injector: Injector: starting at 2012-01-25 13:55:49
  12/01/25 13:55:49 INFO crawl.Injector: Injector: crawlDb: /data/runs/ar/crawls/crawldb
  12/01/25 13:55:49 INFO crawl.Injector: Injector: urlDir: /data/runs/ar/seedurls
  12/01/25 13:55:49 INFO crawl.Injector: Injector: Converting injected urls to crawl db entries.
  12/01/25 13:56:53 INFO mapred.FileInputFormat: Total input paths to process : 1
...
...
  12/01/25 13:57:21 INFO crawl.Injector: Injector: Merging injected urls into crawl db.
...
  12/01/25 13:57:48 INFO crawl.Injector: Injector: finished at 2012-01-25 13:57:48, elapsed: 00:01:59
  12/01/25 13:57:48 INFO crawl.Generator: Generator: starting at 2012-01-25 13:57:48
  12/01/25 13:57:48 INFO crawl.Generator: Generator: Selecting best-scoring urls due for fetch.
  12/01/25 13:57:48 INFO crawl.Generator: Generator: filtering: true
  12/01/25 13:57:48 INFO crawl.Generator: Generator: normalizing: true
  12/01/25 13:57:48 INFO mapred.FileInputFormat: Total input paths to process : 2
...
  12/01/25 13:58:15 INFO crawl.Generator: Generator: Partitioning selected urls for politeness.
  12/01/25 13:58:16 INFO crawl.Generator: Generator: segment: /data/runs/ar/crawls/segments/20120125135816
...
  12/01/25 13:58:42 INFO crawl.Generator: Generator: finished at 2012-01-25 13:58:42, elapsed: 00:00:54
  12/01/25 13:58:42 ERROR fetcher.Fetcher: Fetcher: No agents listed in 'http.agent.name' property.

Exception in thread "main" java.lang.IllegalArgumentException: Fetcher: No agents listed in 'http.agent.name' property.
        at org.apache.nutch.fetcher.Fetcher.checkConfiguration(Fetcher.java:1261)
        at org.apache.nutch.fetcher.Fetcher.fetch(Fetcher.java:1166)
        at org.apache.nutch.crawl.Crawl.run(Crawl.java:136)
        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
        at org.apache.nutch.crawl.Crawl.main(Crawl.java:55)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
        at java.lang.reflect.Method.invoke(Method.java:597)
        at org.apache.hadoop.util.RunJar.main(RunJar.java:156)

现在，“本地”模式的配置文件设置良好（因为本地模式下的爬网成功）。对于在部署模式下运行，由于“deploy”文件夹没有任何“conf”子目录，我假设： a）需要在“deploy/conf”下复制conf文件，或者 b） conf文件需要放在HDFS上

我已证实上述（a）项没有帮助。所以，我假设Nutch配置文件需要存在于HDFS中，HDFS获取程序才能成功运行？然而，我不知道应该将这些Nutch conf文件放在HDFS中的哪个路径上，或者可能我找错了方向

如果Nutch在“部署”模式下从“local/conf”下的文件中读取配置文件，那么为什么本地爬网工作正常，而部署模式爬网工作不正常

我错过了什么

提前谢谢

这可能是因为您尚未重建。你能跑“蚂蚁”看看会发生什么吗？显然，如果尚未更新nutch-site.xml中的http.agent.name，则需要进行更新。

尝试以下操作：

在nutch源目录中，修改文件

conf/nutch site.xml

以正确设置

http.agent.name

使用

ant重新构建代码


转到runtime/deploy
目录，设置所需的环境变量，然后再次尝试爬网
Brother你能解释一下在2.x版本中如何在部署模式下使用Nutch吗