“Nutch爬行后的Solr索引失败,报告”;“作业失败”;

“Nutch爬行后的Solr索引失败,报告”;“作业失败”;,solr,nutch,Solr,Nutch,我有一个网站托管在我的本地机器上,我正试图用Nutch和Solr中的索引(也都在我的本地机器上)对其进行爬网。我按照Nutch网站()上给出的说明安装了Solr 4.6.1和Nutch 1.7,我的浏览器中运行Solr没有问题 我正在运行以下命令: bin/nutch crawl urls -solr http://localhost:8983/solr/ -depth 1 -topN 2 爬网工作正常,但当它尝试将数据放入Solr时,失败并输出以下结果: Indexer: starting

我有一个网站托管在我的本地机器上,我正试图用Nutch和Solr中的索引(也都在我的本地机器上)对其进行爬网。我按照Nutch网站()上给出的说明安装了Solr 4.6.1和Nutch 1.7,我的浏览器中运行Solr没有问题

我正在运行以下命令:

bin/nutch crawl urls -solr http://localhost:8983/solr/ -depth 1 -topN 2
爬网工作正常,但当它尝试将数据放入Solr时,失败并输出以下结果:

Indexer: starting at 2014-02-06 16:29:28
Indexer: deleting gone documents: false
Indexer: URL filtering: false
Indexer: URL normalizing: false
Active IndexWriters :
SOLRIndexWriter
    solr.server.url : URL of the SOLR instance (mandatory)
    solr.commit.size : buffer size when sending to SOLR (default 1000)
    solr.mapping.file : name of the mapping file for fields (default solrindex-mapping.xml)
    solr.auth : use authentication (default false)
    solr.auth.username : use authentication (default false)
    solr.auth : username for authentication
    solr.auth.password : password for authentication


Exception in thread "main" java.io.IOException: Job failed!
    at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1357)
    at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:123)
    at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:81)
    at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:65)
    at org.apache.nutch.crawl.Crawl.run(Crawl.java:155)
    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
    at org.apache.nutch.crawl.Crawl.main(Crawl.java:55)
我转到Nutch logs目录并跟踪hadoop.log文件,它显示如下:

2014-02-06 16:29:28,920 INFO  solr.SolrIndexWriter - Indexing 1 documents
2014-02-06 16:29:28,921 INFO  httpclient.HttpMethodDirector - I/O exception (org.apache.commons.httpclient.NoHttpResponseException) caught when processing request: The server localhost failed to respond
2014-02-06 16:29:28,921 INFO  httpclient.HttpMethodDirector - Retrying request
2014-02-06 16:29:28,924 WARN  mapred.LocalJobRunner - job_local331896790_0009
java.io.IOException
    at org.apache.nutch.indexwriter.solr.SolrIndexWriter.makeIOException(SolrIndexWriter.java:173)
    at org.apache.nutch.indexwriter.solr.SolrIndexWriter.close(SolrIndexWriter.java:159)
    at org.apache.nutch.indexer.IndexWriters.close(IndexWriters.java:118)
    at org.apache.nutch.indexer.IndexerOutputFormat$1.close(IndexerOutputFormat.java:44)
    at org.apache.hadoop.mapred.ReduceTask$OldTrackingRecordWriter.close(ReduceTask.java:467)
    at org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:535)
    at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:421)
    at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:398)
Caused by: org.apache.solr.client.solrj.SolrServerException: java.net.SocketException: Connection reset
    at org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:478)
    at org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:244)
    at org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:105)
    at org.apache.nutch.indexwriter.solr.SolrIndexWriter.close(SolrIndexWriter.java:155)
    ... 6 more
Caused by: java.net.SocketException: Connection reset
    at java.net.SocketInputStream.read(SocketInputStream.java:168)
    at java.io.BufferedInputStream.fill(BufferedInputStream.java:218)
    at java.io.BufferedInputStream.read(BufferedInputStream.java:237)
    at org.apache.commons.httpclient.HttpParser.readRawLine(HttpParser.java:78)
    at org.apache.commons.httpclient.HttpParser.readLine(HttpParser.java:106)
    at org.apache.commons.httpclient.HttpConnection.readLine(HttpConnection.java:1116)
    at org.apache.commons.httpclient.HttpMethodBase.readStatusLine(HttpMethodBase.java:1973)
    at org.apache.commons.httpclient.HttpMethodBase.readResponse(HttpMethodBase.java:1735)
    at org.apache.commons.httpclient.HttpMethodBase.execute(HttpMethodBase.java:1098)
    at org.apache.commons.httpclient.HttpMethodDirector.executeWithRetry(HttpMethodDirector.java:398)
    at org.apache.commons.httpclient.HttpMethodDirector.executeMethod(HttpMethodDirector.java:171)
    at org.apache.commons.httpclient.HttpClient.executeMethod(HttpClient.java:397)
    at org.apache.commons.httpclient.HttpClient.executeMethod(HttpClient.java:323)
    at org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:422)

然而,我仍然可以在浏览器中访问Solr。这是我第一次尝试Solr/Nutch-如果您有更多的知识,我们将不胜感激。谢谢

当并非nutch的所有必填字段都在solr的
schema.xml
中时,就会发生这种情况。您是否添加了Nutch的
schema.xml
中的字段

如果您在“字段”部分中添加以下内容,则应该可以:

<field name="id" type="string" stored="true" indexed="true"/>
<!-- core fields -->
<field name="segment" type="string" stored="true" indexed="false"/>
<field name="digest" type="string" stored="true" indexed="false"/>
<field name="boost" type="float" stored="true" indexed="false"/>

<!-- fields for index-basic plugin -->
<field name="host" type="string" stored="false" indexed="true"/>
<field name="url" type="url" stored="true" indexed="true"
    required="true"/>
<field name="content" type="text_general" stored="false" indexed="true"/>
<field name="title" type="text_general" stored="true" indexed="true"/>
<field name="cache" type="string" stored="true" indexed="false"/>
<field name="tstamp" type="date" stored="true" indexed="false"/>

<!-- fields for index-anchor plugin -->
<field name="anchor" type="string" stored="true" indexed="true"
    multiValued="true"/>

<!-- fields for index-more plugin -->
<field name="type" type="string" stored="true" indexed="true"
    multiValued="true"/>
<field name="contentLength" type="long" stored="true"
    indexed="false"/>
<field name="lastModified" type="date" stored="true"
    indexed="false"/>
<field name="date" type="date" stored="true" indexed="true"/>

<!-- fields for languageidentifier plugin -->
<field name="lang" type="string" stored="true" indexed="true"/>

<!-- fields for subcollection plugin -->
<field name="subcollection" type="string" stored="true"
    indexed="true" multiValued="true"/>

<!-- fields for feed plugin (tag is also used by microformats-reltag)-->
<field name="author" type="string" stored="true" indexed="true"/>
<field name="tag" type="string" stored="true" indexed="true" multiValued="true"/>
<field name="feed" type="string" stored="true" indexed="true"/>
<field name="publishedDate" type="date" stored="true"
    indexed="true"/>
<field name="updatedDate" type="date" stored="true"
    indexed="true"/>

<!-- fields for creativecommons plugin -->
<field name="cc" type="string" stored="true" indexed="true"
    multiValued="true"/>

<!-- fields for tld plugin -->    
<field name="tld" type="string" stored="false" indexed="false"/>

我对Nutch 1.8和Solr 4.8.0也有类似的问题。事实上,迪亚的回答帮助我解决了这个问题。在删除了schema.xml与Diaa字段列表的一些交叉点,并更改了标记为“added by wb”和“changed by wb”的两个条目之后,我最终得到了以下对我有用的字段列表。与nutch和solr的早期版本不同,“字段”不再有标签。标记为“field”的条目仅在“schema”中。这是完整的字段列表:

   <field name="_root_" type="string" indexed="true" stored="false"/>

   <!-- Only remove the "id" field if you have a very good reason to. While not strictly
     required, it is highly recommended. A <uniqueKey> is present in almost all Solr 
     installations. See the <uniqueKey> declaration below where <uniqueKey> is set to "id".
   -->   
   <field name="id" type="string" indexed="true" stored="true" required="true" multiValued="false" /> 

   <field name="sku" type="text_en_splitting_tight" indexed="true" stored="true" omitNorms="true"/>
   <field name="name" type="text_general" indexed="true" stored="true"/>
   <field name="manu" type="text_general" indexed="true" stored="true" omitNorms="true"/>
   <field name="cat" type="string" indexed="true" stored="true" multiValued="true"/>
   <field name="features" type="text_general" indexed="true" stored="true" multiValued="true"/>
   <field name="includes" type="text_general" indexed="true" stored="true" termVectors="true" termPositions="true" termOffsets="true" />

   <field name="weight" type="float" indexed="true" stored="true"/>
   <field name="price"  type="float" indexed="true" stored="true"/>
   <field name="popularity" type="int" indexed="true" stored="true" />
   <field name="inStock" type="boolean" indexed="true" stored="true" />

   <field name="store" type="location" indexed="true" stored="true"/>

   <!-- Common metadata fields, named specifically to match up with
     SolrCell metadata when parsing rich documents such as Word, PDF.
     Some fields are multiValued only because Tika currently may return
     multiple values for them. Some metadata is parsed from the documents,
     but there are some which come from the client context:
       "content_type": From the HTTP headers of incoming stream
       "resourcename": From SolrCell request param resource.name
   -->
   <field name="title" type="text_general" indexed="true" stored="true" multiValued="true"/>
   <field name="subject" type="text_general" indexed="true" stored="true"/>
   <field name="description" type="text_general" indexed="true" stored="true"/>
   <field name="comments" type="text_general" indexed="true" stored="true"/>
   <field name="author" type="text_general" indexed="true" stored="true"/>
   <field name="keywords" type="text_general" indexed="true" stored="true"/>
   <field name="category" type="text_general" indexed="true" stored="true"/>
   <field name="resourcename" type="text_general" indexed="true" stored="true"/>

   <!-- added by wb: required="true" -->
   <field name="url" type="text_general" indexed="true" stored="true" required="true"/> 

   <field name="content_type" type="string" indexed="true" stored="true" multiValued="true"/>
   <field name="last_modified" type="date" indexed="true" stored="true"/>
   <field name="links" type="string" indexed="true" stored="true" multiValued="true"/>

   <!-- Main body of document extracted by SolrCell.
        NOTE: This field is not indexed by default, since it is also copied to "text"
        using copyField below. This is to save space. Use this field for returning and
        highlighting document content. Use the "text" field to search the content. -->

   <!-- changedby wb: indexed="true" -->
   <field name="content" type="text_general" indexed="true" stored="true" multiValued="true"/> 


   <!-- catchall field, containing all other searchable text fields (implemented
        via copyField further on in this schema  -->
   <field name="text" type="text_general" indexed="true" stored="false" multiValued="true"/>

   <!-- catchall text field that indexes tokens both normally and in reverse for efficient
        leading wildcard queries. -->
   <field name="text_rev" type="text_general_rev" indexed="true" stored="false" multiValued="true"/>

   <!-- non-tokenized version of manufacturer to make it easier to sort or group
        results by manufacturer.  copied from "manu" via copyField -->
   <field name="manu_exact" type="string" indexed="true" stored="false"/>

   <field name="payloads" type="payloads" indexed="true" stored="true"/>

   <!-- Fields needed for Nutch 1.8 integration: -->

    <field name="segment" type="string" stored="true" indexed="false"/>
    <field name="digest" type="string" stored="true" indexed="false"/>
    <field name="boost" type="float" stored="true" indexed="false"/>

    <!-- fields for index-basic plugin -->
    <field name="host" type="string" stored="false" indexed="true"/>
    <field name="cache" type="string" stored="true" indexed="false"/>
    <field name="tstamp" type="date" stored="true" indexed="false"/>

    <!-- fields for index-anchor plugin -->
    <field name="anchor" type="string" stored="true" indexed="true" multiValued="true"/>

    <!-- fields for index-more plugin -->
    <field name="type" type="string" stored="true" indexed="true" multiValued="true"/>
    <field name="contentLength" type="long" stored="true" indexed="false"/>
    <field name="lastModified" type="date" stored="true" indexed="false"/>
    <field name="date" type="date" stored="true" indexed="true"/>

    <!-- fields for languageidentifier plugin -->
    <field name="lang" type="string" stored="true" indexed="true"/>

    <!-- fields for subcollection plugin -->
    <field name="subcollection" type="string" stored="true" indexed="true" multiValued="true"/>

    <!-- fields for feed plugin (tag is also used by microformats-reltag)-->
    <field name="tag" type="string" stored="true" indexed="true" multiValued="true"/>
    <field name="feed" type="string" stored="true" indexed="true"/>
    <field name="publishedDate" type="date" stored="true" indexed="true"/>
    <field name="updatedDate" type="date" stored="true" indexed="true"/>

    <!-- fields for creativecommons plugin -->
    <field name="cc" type="string" stored="true" indexed="true" multiValued="true"/>

    <!-- fields for tld plugin -->    
    <field name="tld" type="string" stored="false" indexed="false"/>

   <!-- End of fields needed for Nutch 1.8 integration: -->

您好,我知道这个问题很老了,但对于2017年使用nutch和solr版本(nutch 1.13,solr 5.5.0)的人来说,我遇到了与以下解决方案相同的问题

bin/crawl-i-dsolr.server.url=url/TestCrawl2/1

上面是我用来爬网的命令,但我在使用这个命令时也遇到了同样的错误

bin/crawl-i-dsolr.server.url=url TestCrawl2

我只是删除了url/TestCrawl2/后面的“/”,它对我来说很有用
谢谢

我试过了,但还是出了同样的错误,你能帮忙吗。我使用的是Solr4.8和Nutch1.12