Warning: file_get_contents(/data/phpspider/zhask/data//catemap/9/solr/3.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Apache 使用Nutch爬虫的Solr索引_Apache_Solr_Lucene_Nutch - Fatal编程技术网

Apache 使用Nutch爬虫的Solr索引

Apache 使用Nutch爬虫的Solr索引,apache,solr,lucene,nutch,Apache,Solr,Lucene,Nutch,我使用的是ApacheNutch-1.13和Solr6.6.0版本 我正在运行以下命令对内容进行爬网: bin/crawl -i -D solr.server.url=http://localhost:8983/solr/nutch urls/seed.txt TestCrawl 2 我得到了一个例外: Indexer: java.io.IOException: Job failed! at org.apache.hadoop.mapred.JobClient.runJob(JobCl

我使用的是ApacheNutch-1.13和Solr6.6.0版本

我正在运行以下命令对内容进行爬网:

bin/crawl -i -D solr.server.url=http://localhost:8983/solr/nutch urls/seed.txt TestCrawl 2
我得到了一个例外:

Indexer: java.io.IOException: Job failed!
    at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:865)
    at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:147)
    at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:230)
    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
    at org.apache.nutch.indexer.IndexingJob.main(IndexingJob.java:239)

Error running:
  /Users/myedlapalli/documents/nutch-solr-3/apache-nutch-1.13/runtime/local/bin/nutch index -Dsolr.server.url=http://localhost:8983/solr/nutch TestCrawl/crawldb -linkdb TestCrawl/linkdb TestCrawl/segments/20171017090519
Failed with exit value 255.
在日志中:

2017-10-17 09:36:35,032 INFO  solr.SolrIndexWriter - Indexing 1/1 documents
2017-10-17 09:36:35,032 INFO  solr.SolrIndexWriter - Deleting 0 documents
2017-10-17 09:36:35,161 INFO  solr.SolrIndexWriter - Indexing 1/1 documents
2017-10-17 09:36:35,161 INFO  solr.SolrIndexWriter - Deleting 0 documents
2017-10-17 09:36:35,174 WARN  mapred.LocalJobRunner - job_local193014604_0001
java.lang.Exception: org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException: Error from server at http://localhost:8983/solr/nutch: ERROR: [doc=http://www.cmo.com/features/articles/2017/8/21/5-emerging-technologies-rewrite-the-media-and-entertainment-script-.html] unknown field 'sp_type'
    at org.apache.hadoop.mapred.LocalJobRunner$Job.runTasks(LocalJobRunner.java:462)
    at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:529)
Caused by: org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException: Error from server at http://localhost:8983/solr/nutch: ERROR: [doc=http://www.cmo.com/features/articles/2017/8/21/5-emerging-technologies-rewrite-the-media-and-entertainment-script-.html] unknown field 'sp_type'
    at org.apache.solr.client.solrj.impl.HttpSolrClient.executeMethod(HttpSolrClient.java:576)
    at org.apache.solr.client.solrj.impl.HttpSolrClient.request(HttpSolrClient.java:240)
    at org.apache.solr.client.solrj.impl.HttpSolrClient.request(HttpSolrClient.java:229)
    at org.apache.solr.client.solrj.SolrClient.request(SolrClient.java:1219)
    at org.apache.nutch.indexwriter.solr.SolrIndexWriter.push(SolrIndexWriter.java:210)
    at org.apache.nutch.indexwriter.solr.SolrIndexWriter.commit(SolrIndexWriter.java:188)
    at org.apache.nutch.indexwriter.solr.SolrIndexWriter.close(SolrIndexWriter.java:179)
    at org.apache.nutch.indexer.IndexWriters.close(IndexWriters.java:117)
    at org.apache.nutch.indexer.IndexerOutputFormat$1.close(IndexerOutputFormat.java:44)
    at org.apache.hadoop.mapred.ReduceTask$OldTrackingRecordWriter.close(ReduceTask.java:502)
    at org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:456)
    at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:392)
    at org.apache.hadoop.mapred.LocalJobRunner$Job$ReduceTaskRunnable.run(LocalJobRunner.java:319)
    at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
    at java.util.concurrent.FutureTask.run(FutureTask.java:266)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)
2017-10-17 09:36:36,109 ERROR indexer.IndexingJob - Indexer: java.io.IOException: Job failed!
    at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:865)
    at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:147)
    at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:230)
    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
    at org.apache.nutch.indexer.IndexingJob.main(IndexingJob.java:239)
有人能帮我吗?
提前感谢。

在这种情况下,检查Solr端的日志通常是个好主意,但这是一个特殊的错误。您已经有了答案,尤其是以下部分:

ERROR: [doc=http://www.cmo.com/features/articles/2017/8/21/5-emerging-technologies-rewrite-the-media-and-entertainment-script-.html] unknown field 'sp_type'
at org.apache.solr.client.solrj.impl.HttpSolrClient.executeMethod(HttpSolrClient.java:576)
at org.apache.solr.client.solrj.impl.HttpSolrClient.request(HttpSolrClient.java:240)
Solr正在抱怨您正在发送一份文档(id
http://www.cmo.com/features/articles/2017/8/21/5
…)具有架构中未定义的一个字段:
sp\u type

您应该检查在该字段中发送的内容,或者只在Solr模式中添加该字段

请记住,如果您有更多未在Solr模式中定义的字段,则此错误将继续出现。通常最好运行
bin/nutch indexchecker
命令,查看nutch将向Solr发送什么


我可以告诉您的第一件事是在crawl命令上使用bin/nutch来跟踪爬行过程。给你更多的细节