Hadoop nutch 1.10作业失败,错误的请求错误索引到solr 5.3.1

Hadoop nutch 1.10作业失败,错误的请求错误索引到solr 5.3.1,hadoop,solr,nutch,Hadoop,Solr,Nutch,我已经在一个测试环境中组装了一个爬虫程序,它在2个小站点上运行良好,包括成功地索引到solr。因此,nutch和solr之间的集成似乎很好 我所做的唯一更改是在seed.txt中添加另一个站点,并在regex-urlfilters.txt中添加另一行,使用与其他站点完全相同的语法 现在,当我运行爬虫程序时,它会正常运行一段时间,然后崩溃为“作业失败!”错误和很少有用的信息 这是控制台的输出。值得注意的是,这是在爬网中创建的第三个段,因此在出现错误之前,它已经成功地为2个段编制了索引。新网站中是否

我已经在一个测试环境中组装了一个爬虫程序,它在2个小站点上运行良好,包括成功地索引到solr。因此,nutch和solr之间的集成似乎很好

我所做的唯一更改是在seed.txt中添加另一个站点,并在regex-urlfilters.txt中添加另一行,使用与其他站点完全相同的语法

现在,当我运行爬虫程序时,它会正常运行一段时间,然后崩溃为“作业失败!”错误和很少有用的信息

这是控制台的输出。值得注意的是,这是在爬网中创建的第三个段,因此在出现错误之前,它已经成功地为2个段编制了索引。新网站中是否存在导致腐败的内容

Indexing 20151030150906 to index
/opt/apache-nutch-1.10/bin/nutch index -Dsolr.server.url=http://localhost:8983/solr/TestCrawlCore TestCrawl//crawldb -linkdb TestCrawl//linkdb TestCrawl//segments/20151030150906
Indexer: starting at 2015-10-30 15:14:00
Indexer: deleting gone documents: false
Indexer: URL filtering: false
Indexer: URL normalizing: false
Active IndexWriters :
SOLRIndexWriter
    solr.server.url : URL of the SOLR instance (mandatory)
    solr.commit.size : buffer size when sending to SOLR (default 1000)
    solr.mapping.file : name of the mapping file for fields (default solrindex-mapping.xml)
    solr.auth : use authentication (default false)
    solr.auth.username : username for authentication
    solr.auth.password : password for authentication


Indexer: java.io.IOException: Job failed!
    at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1357)
    at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:113)
    at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:177)
    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
    at org.apache.nutch.indexer.IndexingJob.main(IndexingJob.java:187)

Error running:
  /opt/apache-nutch-1.10/bin/nutch index -Dsolr.server.url=http://localhost:8983/solr/TestCrawlCore TestCrawl//crawldb -linkdb TestCrawl//linkdb TestCrawl//segments/20151030150906
Failed with exit value 255.
这是hadoop.log中的相关数据

2015-10-30 15:14:00,854 INFO  indexer.IndexingJob - Indexer: starting at 2015-10-30 15:14:00
2015-10-30 15:14:00,909 INFO  indexer.IndexingJob - Indexer: deleting gone documents: false
2015-10-30 15:14:00,909 INFO  indexer.IndexingJob - Indexer: URL filtering: false
2015-10-30 15:14:00,910 INFO  indexer.IndexingJob - Indexer: URL normalizing: false
2015-10-30 15:14:01,113 INFO  indexer.IndexWriters - Adding org.apache.nutch.indexwriter.solr.SolrIndexWriter
2015-10-30 15:14:01,113 INFO  indexer.IndexingJob - Active IndexWriters :
SOLRIndexWriter
        solr.server.url : URL of the SOLR instance (mandatory)
        solr.commit.size : buffer size when sending to SOLR (default 1000)
        solr.mapping.file : name of the mapping file for fields (default solrindex-mapping.xml)
        solr.auth : use authentication (default false)
        solr.auth.username : username for authentication
        solr.auth.password : password for authentication


2015-10-30 15:14:01,118 INFO  indexer.IndexerMapReduce - IndexerMapReduce: crawldb: TestCrawl/crawldb
2015-10-30 15:14:01,118 INFO  indexer.IndexerMapReduce - IndexerMapReduce: linkdb: TestCrawl/linkdb
2015-10-30 15:14:01,119 INFO  indexer.IndexerMapReduce - IndexerMapReduces: adding segment: TestCrawl/segments/20151030150906
2015-10-30 15:14:01,264 WARN  util.NativeCodeLoader - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
2015-10-30 15:14:01,722 INFO  anchor.AnchorIndexingFilter - Anchor deduplication is: off
2015-10-30 15:14:02,253 INFO  indexer.IndexWriters - Adding org.apache.nutch.indexwriter.solr.SolrIndexWriter
2015-10-30 15:14:02,271 INFO  solr.SolrMappingReader - source: content dest: content
2015-10-30 15:14:02,271 INFO  solr.SolrMappingReader - source: title dest: title
2015-10-30 15:14:02,271 INFO  solr.SolrMappingReader - source: host dest: host
2015-10-30 15:14:02,271 INFO  solr.SolrMappingReader - source: segment dest: segment
2015-10-30 15:14:02,271 INFO  solr.SolrMappingReader - source: boost dest: boost
2015-10-30 15:14:02,271 INFO  solr.SolrMappingReader - source: digest dest: digest
2015-10-30 15:14:02,271 INFO  solr.SolrMappingReader - source: tstamp dest: tstamp
2015-10-30 15:14:02,370 INFO  solr.SolrIndexWriter - Indexing 38 documents
2015-10-30 15:14:02,487 INFO  solr.SolrIndexWriter - Indexing 38 documents
2015-10-30 15:14:02,524 WARN  mapred.LocalJobRunner - job_local593696138_0001
org.apache.solr.common.SolrException: Bad Request

Bad Request

request: http://localhost:8983/solr/TestCrawlCore/update?wt=javabin&version=2
        at org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:430)
        at org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:244)
        at org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:105)
        at org.apache.nutch.indexwriter.solr.SolrIndexWriter.close(SolrIndexWriter.java:153)
        at org.apache.nutch.indexer.IndexWriters.close(IndexWriters.java:115)
        at org.apache.nutch.indexer.IndexerOutputFormat$1.close(IndexerOutputFormat.java:44)
        at org.apache.hadoop.mapred.ReduceTask$OldTrackingRecordWriter.close(ReduceTask.java:467)
        at org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:535)
        at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:421)
        at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:398)
2015-10-30 15:14:03,508 ERROR indexer.IndexingJob - Indexer: java.io.IOException: Job failed!
        at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1357)
        at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:113)
        at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:177)
        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
        at org.apache.nutch.indexer.IndexingJob.main(IndexingJob.java:187)

我只是想弄清楚这件事,所以我不知道解决这个问题的下一步是什么。任何帮助都将不胜感激。如果有什么特别的东西会有帮助的话,我很乐意提供更多信息。

结果是nutch和solr模式之间不匹配

多亏了TMBT(见上面的评论),我在Solr日志中发现了一个额外的错误,声称“未识别字段:“锚”


我所要做的就是将锚字段声明从nutch模式复制到Solr模式中,然后重新启动Solr服务。现在运行良好。

Solr日志中是否有任何内容?啊。很好的问题。Solr日志中有一些未定义的字段:“锚”“这里面有错误。这听起来像是一个模式问题?正在为其他两个站点工作的Sams架构。我想可能是架构问题,但需要查看日志才能说明任何确定的信息或进一步排除故障。我想我可能已经找到了它。nutch模式中有一个“anchor”字段,我在Solr模式中没有这个字段。我马上就知道它是否有效。谢谢你给我指明了正确的方向。很高兴听到这个消息。Solr模式有时会有点棘手和烦人。尽管我已经将
schema.xml
从我的Nutch 1.12复制到Solr 4.3的
example/Solr/collection1/conf
文件夹中,但我仍然有这个问题。。。这个错误非常恼人…我要尝试的第一件事是Solrun的更新版本幸运的是,Solr和Nutch的发展方向还不清楚。只有少数版本相互兼容。我决定选择Nutch2.3.1和Solr4.10.3,正如这里所建议的:我仍然有很多问题。因为创作者的过错,我从来没有遇到过这么多问题。