elasticsearch,nutch,Apache,Indexing,elasticsearch,Nutch" /> elasticsearch,nutch,Apache,Indexing,elasticsearch,Nutch" />

使用elasticsearch的Apache Nutch索引

使用elasticsearch的Apache Nutch索引,apache,indexing,elasticsearch,nutch,Apache,Indexing,elasticsearch,Nutch,我目前正在使用ApacheNutch和ElasticSearch堆栈制作一个搜索引擎。我正在使用ApacheNutch2.1和ElasticSearch 1.7.3 我目前正试图按照以下说明直接从Nutch进行索引:。Nutch和Elasticsearch都在我的本地主机上运行,集群名为“Elasticsearch” 以下是我更改的nutch-site.xml的一些部分: <property> <name>plugin.includes</name>

我目前正在使用ApacheNutch和ElasticSearch堆栈制作一个搜索引擎。我正在使用ApacheNutch2.1和ElasticSearch 1.7.3

我目前正试图按照以下说明直接从Nutch进行索引:。Nutch和Elasticsearch都在我的本地主机上运行,集群名为“Elasticsearch”

以下是我更改的nutch-site.xml的一些部分:

<property>
    <name>plugin.includes</name>
    <value>protocol-selenium|protocol-httpclient|urlfilter-regex|parse-(html|tika|metatags)|index-(basic|anchor|metadata)|indexer-elastic|scoring-opic|urlnormalizer-(pass|regex|basic)</value>
    <description>Regular expression naming plugin directory names to
    include.  Any plugin not matching this expression is excluded.
    In any case you need at least include the nutch-extensionpoints plugin. By
    default Nutch includes crawling just HTML and plain text via HTTP,
    and basic indexing and search plugins. In order to use HTTPS please enable
    protocol-httpclient, but be aware of possible intermittent problems with the
    underlying commons-httpclient library.
    </description>
</property>
但它回信说:

Exception in thread "main" java.lang.RuntimeException: job failed: name=elastic-index [elasticsearch], jobid=job_local_0001
at org.apache.nutch.util.NutchJob.waitForCompletion(NutchJob.java:54)
at org.apache.nutch.indexer.elastic.ElasticIndexerJob.run(ElasticIndexerJob.java:52)
at org.apache.nutch.indexer.elastic.ElasticIndexerJob.indexElastic(ElasticIndexerJob.java:60)
at org.apache.nutch.indexer.elastic.ElasticIndexerJob.run(ElasticIndexerJob.java:73)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.nutch.indexer.elastic.ElasticIndexerJob.main(ElasticIndexerJob.java:78)
我不确定哪里出了错。这是我的hadoop.log:

    2016-01-15 15:46:24,106 INFO  elastic.ElasticIndexerJob - Starting
2016-01-15 15:46:24,733 INFO  plugin.PluginRepository - Plugins: looking in: /home/gabrielgagno/apache-nutch-2.1/runtime/local/plugins
2016-01-15 15:46:24,817 INFO  plugin.PluginRepository - Plugin Auto-activation mode: [true]
2016-01-15 15:46:24,817 INFO  plugin.PluginRepository - Registered Plugins:
2016-01-15 15:46:24,817 INFO  plugin.PluginRepository -     the nutch core extension points (nutch-extensionpoints)
2016-01-15 15:46:24,817 INFO  plugin.PluginRepository -     Basic URL Normalizer (urlnormalizer-basic)
2016-01-15 15:46:24,817 INFO  plugin.PluginRepository -     Basic Indexing Filter (index-basic)
2016-01-15 15:46:24,817 INFO  plugin.PluginRepository -     Html Parse Plug-in (parse-html)
2016-01-15 15:46:24,817 INFO  plugin.PluginRepository -     Http / Https Protocol Plug-in (protocol-httpclient)
2016-01-15 15:46:24,817 INFO  plugin.PluginRepository -     HTTP Framework (lib-http)
2016-01-15 15:46:24,817 INFO  plugin.PluginRepository -     Regex URL Filter (urlfilter-regex)
2016-01-15 15:46:24,817 INFO  plugin.PluginRepository -     Pass-through URL Normalizer (urlnormalizer-pass)
2016-01-15 15:46:24,817 INFO  plugin.PluginRepository -     Regex URL Normalizer (urlnormalizer-regex)
2016-01-15 15:46:24,817 INFO  plugin.PluginRepository -     Tika Parser Plug-in (parse-tika)
2016-01-15 15:46:24,817 INFO  plugin.PluginRepository -     OPIC Scoring Plug-in (scoring-opic)
2016-01-15 15:46:24,817 INFO  plugin.PluginRepository -     CyberNeko HTML Parser (lib-nekohtml)
2016-01-15 15:46:24,817 INFO  plugin.PluginRepository -     Anchor Indexing Filter (index-anchor)
2016-01-15 15:46:24,817 INFO  plugin.PluginRepository -     Regex URL Filter Framework (lib-regex-filter)
2016-01-15 15:46:24,818 INFO  plugin.PluginRepository - Registered Extension-Points:
2016-01-15 15:46:24,818 INFO  plugin.PluginRepository -     Nutch URL Normalizer (org.apache.nutch.net.URLNormalizer)
2016-01-15 15:46:24,818 INFO  plugin.PluginRepository -     Nutch Protocol (org.apache.nutch.protocol.Protocol)
2016-01-15 15:46:24,818 INFO  plugin.PluginRepository -     Parse Filter (org.apache.nutch.parse.ParseFilter)
2016-01-15 15:46:24,818 INFO  plugin.PluginRepository -     Nutch URL Filter (org.apache.nutch.net.URLFilter)
2016-01-15 15:46:24,818 INFO  plugin.PluginRepository -     Nutch Indexing Filter (org.apache.nutch.indexer.IndexingFilter)
2016-01-15 15:46:24,818 INFO  plugin.PluginRepository -     Nutch Content Parser (org.apache.nutch.parse.Parser)
2016-01-15 15:46:24,818 INFO  plugin.PluginRepository -     Nutch Scoring (org.apache.nutch.scoring.ScoringFilter)
2016-01-15 15:46:24,822 INFO  basic.BasicIndexingFilter - Maximum title length for indexing set to: 100
2016-01-15 15:46:24,822 INFO  indexer.IndexingFilters - Adding org.apache.nutch.indexer.basic.BasicIndexingFilter
2016-01-15 15:46:24,824 INFO  anchor.AnchorIndexingFilter - Anchor deduplication is: off
2016-01-15 15:46:24,824 INFO  indexer.IndexingFilters - Adding org.apache.nutch.indexer.anchor.AnchorIndexingFilter
2016-01-15 15:46:25,827 WARN  util.NativeCodeLoader - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
2016-01-15 15:46:26,521 INFO  mapreduce.GoraRecordReader - gora.buffer.read.limit = 10000
2016-01-15 15:46:26,727 INFO  elasticsearch.node - [Layla Miller] version[1.7.3], pid[18188], build[05d4530/2015-10-15T09:14:17Z]
2016-01-15 15:46:26,727 INFO  elasticsearch.node - [Layla Miller] initializing ...
2016-01-15 15:46:26,852 INFO  elasticsearch.plugins - [Layla Miller] loaded [], sites []
2016-01-15 15:46:28,229 WARN  elasticsearch.bootstrap - JNA not found. native methods will be disabled.
2016-01-15 15:46:28,756 INFO  elasticsearch.node - [Layla Miller] initialized
2016-01-15 15:46:28,756 INFO  elasticsearch.node - [Layla Miller] starting ...
2016-01-15 15:46:28,824 INFO  elasticsearch.transport - [Layla Miller] bound_address {inet[/0:0:0:0:0:0:0:0:9301]}, publish_address {inet[/172.16.3.72:9301]}
2016-01-15 15:46:28,836 INFO  elasticsearch.discovery - [Layla Miller] elasticsearch/_tzxV-I7SSeduY9b8enpPw
2016-01-15 15:46:58,836 WARN  elasticsearch.discovery - [Layla Miller] waited for 30s and no initial state was set by the discovery
2016-01-15 15:46:58,845 INFO  elasticsearch.http - [Layla Miller] bound_address {inet[/0:0:0:0:0:0:0:0:9201]}, publish_address {inet[/172.16.3.72:9201]}
2016-01-15 15:46:58,845 INFO  elasticsearch.node - [Layla Miller] started
2016-01-15 15:46:58,848 INFO  basic.BasicIndexingFilter - Maximum title length for indexing set to: 100
2016-01-15 15:46:58,848 INFO  indexer.IndexingFilters - Adding org.apache.nutch.indexer.basic.BasicIndexingFilter
2016-01-15 15:46:58,848 INFO  anchor.AnchorIndexingFilter - Anchor deduplication is: off
2016-01-15 15:46:58,848 INFO  indexer.IndexingFilters - Adding org.apache.nutch.indexer.anchor.AnchorIndexingFilter
2016-01-15 15:46:59,438 INFO  elastic.ElasticWriter - Processing remaining requests [docs = 147, length = 1011442, total docs = 147]
2016-01-15 15:46:59,445 INFO  elastic.ElasticWriter - Processing to finalize last execute
2016-01-15 15:47:59,452 WARN  mapred.FileOutputCommitter - Output path is null in cleanup
2016-01-15 15:47:59,453 WARN  mapred.LocalJobRunner - job_local_0001
org.elasticsearch.cluster.block.ClusterBlockException: blocked by: [SERVICE_UNAVAILABLE/1/state not recovered / initialized];[SERVICE_UNAVAILABLE/2/no master];
    at org.elasticsearch.cluster.block.ClusterBlocks.globalBlockedException(ClusterBlocks.java:151)
    at org.elasticsearch.cluster.block.ClusterBlocks.globalBlockedRaiseException(ClusterBlocks.java:141)
    at org.elasticsearch.action.bulk.TransportBulkAction.executeBulk(TransportBulkAction.java:215)
    at org.elasticsearch.action.bulk.TransportBulkAction.access$000(TransportBulkAction.java:67)
    at org.elasticsearch.action.bulk.TransportBulkAction$1.onFailure(TransportBulkAction.java:153)
    at org.elasticsearch.action.support.TransportAction$ThreadedActionListener$2.run(TransportAction.java:137)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
    at java.lang.Thread.run(Thread.java:745)

有人能帮我吗?谢谢

确保在nutch elastic dependency和本地服务器中运行相同的版本


如果它们不一样,那么不要浪费时间,使用http协议直接从nutch推送到elastic,而不是Java api。

很抱歉,我刚才看到了您的评论。保存版本是什么意思?你是说“一样”,对吗?如果是这样的话,我所做的是,在我的ivy.xml中,我改变了这个:我的本地nutch是2.1,本地elasticsearch是1.7.3。我哪里做错了。我希望尽可能地坚持默认的nutch索引,因为我需要nutch的ParseMetatags功能。此外,在日志中,我注意到它显示的不是我使用的集群名称(即cluster1),而是其他名称[在上面的日志中,它显示的是“Layla Miller”]。inet地址也与我分配的地址不同。我想知道这是否有助于解决此问题。本地elasticsearch服务器的默认群集名称是elasticsearch。你能分享你的nutch elasticsearch配置吗?最后,运行bin/crawl而不是bin/nutchI几乎没有触及ES设置,默认集群名称保持不变。至于nutch,除了我的nutch-site.xml之外,我也没有做任何更改,在那里我明确规定了以下内容:elastic.cluster elasticsearch要查找的集群名称。必须定义主机和potr或群集。elastic.index globe\搜索要将文档发送到的默认索引。我还要求我的elastic.host为localhost,elastic.port为9300,但似乎不是这样。它指向错误的弹性主机,如上所示。
    2016-01-15 15:46:24,106 INFO  elastic.ElasticIndexerJob - Starting
2016-01-15 15:46:24,733 INFO  plugin.PluginRepository - Plugins: looking in: /home/gabrielgagno/apache-nutch-2.1/runtime/local/plugins
2016-01-15 15:46:24,817 INFO  plugin.PluginRepository - Plugin Auto-activation mode: [true]
2016-01-15 15:46:24,817 INFO  plugin.PluginRepository - Registered Plugins:
2016-01-15 15:46:24,817 INFO  plugin.PluginRepository -     the nutch core extension points (nutch-extensionpoints)
2016-01-15 15:46:24,817 INFO  plugin.PluginRepository -     Basic URL Normalizer (urlnormalizer-basic)
2016-01-15 15:46:24,817 INFO  plugin.PluginRepository -     Basic Indexing Filter (index-basic)
2016-01-15 15:46:24,817 INFO  plugin.PluginRepository -     Html Parse Plug-in (parse-html)
2016-01-15 15:46:24,817 INFO  plugin.PluginRepository -     Http / Https Protocol Plug-in (protocol-httpclient)
2016-01-15 15:46:24,817 INFO  plugin.PluginRepository -     HTTP Framework (lib-http)
2016-01-15 15:46:24,817 INFO  plugin.PluginRepository -     Regex URL Filter (urlfilter-regex)
2016-01-15 15:46:24,817 INFO  plugin.PluginRepository -     Pass-through URL Normalizer (urlnormalizer-pass)
2016-01-15 15:46:24,817 INFO  plugin.PluginRepository -     Regex URL Normalizer (urlnormalizer-regex)
2016-01-15 15:46:24,817 INFO  plugin.PluginRepository -     Tika Parser Plug-in (parse-tika)
2016-01-15 15:46:24,817 INFO  plugin.PluginRepository -     OPIC Scoring Plug-in (scoring-opic)
2016-01-15 15:46:24,817 INFO  plugin.PluginRepository -     CyberNeko HTML Parser (lib-nekohtml)
2016-01-15 15:46:24,817 INFO  plugin.PluginRepository -     Anchor Indexing Filter (index-anchor)
2016-01-15 15:46:24,817 INFO  plugin.PluginRepository -     Regex URL Filter Framework (lib-regex-filter)
2016-01-15 15:46:24,818 INFO  plugin.PluginRepository - Registered Extension-Points:
2016-01-15 15:46:24,818 INFO  plugin.PluginRepository -     Nutch URL Normalizer (org.apache.nutch.net.URLNormalizer)
2016-01-15 15:46:24,818 INFO  plugin.PluginRepository -     Nutch Protocol (org.apache.nutch.protocol.Protocol)
2016-01-15 15:46:24,818 INFO  plugin.PluginRepository -     Parse Filter (org.apache.nutch.parse.ParseFilter)
2016-01-15 15:46:24,818 INFO  plugin.PluginRepository -     Nutch URL Filter (org.apache.nutch.net.URLFilter)
2016-01-15 15:46:24,818 INFO  plugin.PluginRepository -     Nutch Indexing Filter (org.apache.nutch.indexer.IndexingFilter)
2016-01-15 15:46:24,818 INFO  plugin.PluginRepository -     Nutch Content Parser (org.apache.nutch.parse.Parser)
2016-01-15 15:46:24,818 INFO  plugin.PluginRepository -     Nutch Scoring (org.apache.nutch.scoring.ScoringFilter)
2016-01-15 15:46:24,822 INFO  basic.BasicIndexingFilter - Maximum title length for indexing set to: 100
2016-01-15 15:46:24,822 INFO  indexer.IndexingFilters - Adding org.apache.nutch.indexer.basic.BasicIndexingFilter
2016-01-15 15:46:24,824 INFO  anchor.AnchorIndexingFilter - Anchor deduplication is: off
2016-01-15 15:46:24,824 INFO  indexer.IndexingFilters - Adding org.apache.nutch.indexer.anchor.AnchorIndexingFilter
2016-01-15 15:46:25,827 WARN  util.NativeCodeLoader - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
2016-01-15 15:46:26,521 INFO  mapreduce.GoraRecordReader - gora.buffer.read.limit = 10000
2016-01-15 15:46:26,727 INFO  elasticsearch.node - [Layla Miller] version[1.7.3], pid[18188], build[05d4530/2015-10-15T09:14:17Z]
2016-01-15 15:46:26,727 INFO  elasticsearch.node - [Layla Miller] initializing ...
2016-01-15 15:46:26,852 INFO  elasticsearch.plugins - [Layla Miller] loaded [], sites []
2016-01-15 15:46:28,229 WARN  elasticsearch.bootstrap - JNA not found. native methods will be disabled.
2016-01-15 15:46:28,756 INFO  elasticsearch.node - [Layla Miller] initialized
2016-01-15 15:46:28,756 INFO  elasticsearch.node - [Layla Miller] starting ...
2016-01-15 15:46:28,824 INFO  elasticsearch.transport - [Layla Miller] bound_address {inet[/0:0:0:0:0:0:0:0:9301]}, publish_address {inet[/172.16.3.72:9301]}
2016-01-15 15:46:28,836 INFO  elasticsearch.discovery - [Layla Miller] elasticsearch/_tzxV-I7SSeduY9b8enpPw
2016-01-15 15:46:58,836 WARN  elasticsearch.discovery - [Layla Miller] waited for 30s and no initial state was set by the discovery
2016-01-15 15:46:58,845 INFO  elasticsearch.http - [Layla Miller] bound_address {inet[/0:0:0:0:0:0:0:0:9201]}, publish_address {inet[/172.16.3.72:9201]}
2016-01-15 15:46:58,845 INFO  elasticsearch.node - [Layla Miller] started
2016-01-15 15:46:58,848 INFO  basic.BasicIndexingFilter - Maximum title length for indexing set to: 100
2016-01-15 15:46:58,848 INFO  indexer.IndexingFilters - Adding org.apache.nutch.indexer.basic.BasicIndexingFilter
2016-01-15 15:46:58,848 INFO  anchor.AnchorIndexingFilter - Anchor deduplication is: off
2016-01-15 15:46:58,848 INFO  indexer.IndexingFilters - Adding org.apache.nutch.indexer.anchor.AnchorIndexingFilter
2016-01-15 15:46:59,438 INFO  elastic.ElasticWriter - Processing remaining requests [docs = 147, length = 1011442, total docs = 147]
2016-01-15 15:46:59,445 INFO  elastic.ElasticWriter - Processing to finalize last execute
2016-01-15 15:47:59,452 WARN  mapred.FileOutputCommitter - Output path is null in cleanup
2016-01-15 15:47:59,453 WARN  mapred.LocalJobRunner - job_local_0001
org.elasticsearch.cluster.block.ClusterBlockException: blocked by: [SERVICE_UNAVAILABLE/1/state not recovered / initialized];[SERVICE_UNAVAILABLE/2/no master];
    at org.elasticsearch.cluster.block.ClusterBlocks.globalBlockedException(ClusterBlocks.java:151)
    at org.elasticsearch.cluster.block.ClusterBlocks.globalBlockedRaiseException(ClusterBlocks.java:141)
    at org.elasticsearch.action.bulk.TransportBulkAction.executeBulk(TransportBulkAction.java:215)
    at org.elasticsearch.action.bulk.TransportBulkAction.access$000(TransportBulkAction.java:67)
    at org.elasticsearch.action.bulk.TransportBulkAction$1.onFailure(TransportBulkAction.java:153)
    at org.elasticsearch.action.support.TransportAction$ThreadedActionListener$2.run(TransportAction.java:137)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
    at java.lang.Thread.run(Thread.java:745)