elasticsearch Stormcrawler未为elasticsearch获取/索引页面,elasticsearch,web-crawler,apache-storm,stormcrawler,elasticsearch,Web Crawler,Apache Storm,Stormcrawler" /> elasticsearch Stormcrawler未为elasticsearch获取/索引页面,elasticsearch,web-crawler,apache-storm,stormcrawler,elasticsearch,Web Crawler,Apache Storm,Stormcrawler" />

elasticsearch Stormcrawler未为elasticsearch获取/索引页面

elasticsearch Stormcrawler未为elasticsearch获取/索引页面,elasticsearch,web-crawler,apache-storm,stormcrawler,elasticsearch,Web Crawler,Apache Storm,Stormcrawler,我正在使用带有Elasticsearch的Stormcrawler,在对网页进行爬网时,Kibana中没有显示带有抓取状态的页面 仍然在控制台上,网页似乎被抓取和解析 48239 [Thread-26-fetcher-executor[3 3]] INFO c.d.s.b.FetcherBolt - [Fetcher #3] Threads : 0 queues : 1 in_queues : 1 48341 [FetcherThread #7] INFO c.d.s.b.F

我正在使用带有Elasticsearch的Stormcrawler,在对网页进行爬网时,Kibana中没有显示带有
抓取
状态的页面

仍然在控制台上,网页似乎被抓取和解析

48239 [Thread-26-fetcher-executor[3 3]] INFO  c.d.s.b.FetcherBolt - [Fetcher #3] Threads : 0    queues : 1      in_queues : 1
48341 [FetcherThread #7] INFO  c.d.s.b.FetcherBolt - [Fetcher #3] Fetched http://books.toscrape.com/catalogue/category/books_1/index.html with status 200 in msec 86
48346 [Thread-46-parse-executor[5 5]] INFO  c.d.s.b.JSoupParserBolt - Parsing : starting http://books.toscrape.com/catalogue/category/books_1/index.html
48362 [Thread-46-parse-executor[5 5]] INFO  c.d.s.b.JSoupParserBolt - Parsed http://books.toscrape.com/catalogue/category/books_1/index.html in 13 msec
此外,Elasticsearch的索引似乎也会得到一些项目,即使这些项目没有标题

我扩展了
com.digitalpebble.stormclawler.elasticsearch.bolt.IndexerBolt
,将网页的元数据存储在本地文件中,看起来它根本没有任何元组。因为索引器还将url的状态标记为
FETCHED
,这可以解释Kibana中提到的观察结果

这种行为有什么解释吗?我已经将爬虫程序配置恢复为标准配置,除了crawler.flux中的索引螺栓以运行我的类

拓扑配置:

name: "crawler"

includes:
    - resource: true
      file: "/crawler-default.yaml"
      override: false



    - resource: false
      file: "es-conf.yaml"
      override: true

spouts:
  - id: "spout"
    className: "com.digitalpebble.stormcrawler.elasticsearch.persistence.AggregationSpout"
    parallelism: 10

bolts:
  - id: "partitioner"
    className: "com.digitalpebble.stormcrawler.bolt.URLPartitionerBolt"
    parallelism: 1
  - id: "fetcher"
    className: "com.digitalpebble.stormcrawler.bolt.FetcherBolt"
    parallelism: 1
  - id: "sitemap"
    className: "com.digitalpebble.stormcrawler.bolt.SiteMapParserBolt"
    parallelism: 1
  - id: "parse"
    className: "com.digitalpebble.stormcrawler.bolt.JSoupParserBolt"
    parallelism: 1
  - id: "index"
    className: "de.hpi.bpStormcrawler.IndexerBolt"
    parallelism: 1
  - id: "status"
    className: "com.digitalpebble.stormcrawler.elasticsearch.persistence.StatusUpdaterBolt"
    parallelism: 1
  - id: "status_metrics"
    className: "com.digitalpebble.stormcrawler.elasticsearch.metrics.StatusMetricsBolt"
    parallelism: 1

streams:
  - from: "spout"
    to: "partitioner"
    grouping:
      type: SHUFFLE

  - from: "spout"
    to: "status_metrics"
    grouping:
      type: SHUFFLE

  - from: "partitioner"
    to: "fetcher"
    grouping:
      type: FIELDS
      args: ["url"]

  - from: "fetcher"
    to: "sitemap"
    grouping:
      type: LOCAL_OR_SHUFFLE

  - from: "sitemap"
    to: "parse"
    grouping:
      type: LOCAL_OR_SHUFFLE

  - from: "parse"
    to: "index"
    grouping:
      type: LOCAL_OR_SHUFFLE

  - from: "fetcher"
    to: "status"
    grouping:
      type: FIELDS
      args: ["url"]
      streamId: "status"

  - from: "sitemap"
    to: "status"
    grouping:
      type: FIELDS
      args: ["url"]
      streamId: "status"

  - from: "parse"
    to: "status"
    grouping:
      type: FIELDS
      args: ["url"]
      streamId: "status"

  - from: "index"
    to: "status"
    grouping:
      type: FIELDS
      args: ["url"]
      streamId: "status"
重新配置的索引器螺栓

package de.hpi.bpStormcrawler;
/**
*根据一项或多项协议向DigitalPebble有限公司许可
*贡献者许可协议。请参阅随附的通知文件
*本作品提供了有关版权所有权的更多信息。
*DigitalPebble根据Apache许可证2.0版向您许可此文件
*(以下简称“许可证”);除非符合以下要求,否则不得使用此文件
*执照。您可以通过以下方式获得许可证副本:
*
*     http://www.apache.org/licenses/LICENSE-2.0
*
*除非适用法律要求或书面同意,软件
*根据许可证进行的分发是按“原样”进行分发的,
*无任何明示或暗示的保证或条件。
*请参阅许可证以了解管理权限和权限的特定语言
*许可证下的限制。
*/
导入静态com.digitalpebble.stormcrawler.Constants.StatusStreamName;
导入静态org.elasticsearch.common.xcontent.XContentFactory.jsonBuilder;
导入java.io.*;
导入java.util.Iterator;
导入java.util.Map;
导入org.apache.storm.metric.api.MultiCountMetric;
导入org.apache.storm.task.OutputCollector;
导入org.apache.storm.task.TopologyContext;
导入org.apache.storm.tuple.tuple;
导入org.apache.storm.tuple.Values;
导入org.elasticsearch.action.DocWriteRequest;
导入org.elasticsearch.action.index.IndexRequest;
导入org.elasticsearch.common.xcontent.XContentBuilder;
导入org.slf4j.Logger;
导入org.slf4j.LoggerFactory;
导入com.digitalpebble.stormcrawler.Metadata;
导入com.digitalpebble.stormcrawler.elasticsearch.ElasticSearchConnection;
导入com.digitalpebble.stormcrawler.indexing.AbstractIndexerBolt;
导入com.digitalpebble.stormcrawler.persistence.Status;
导入com.digitalpebble.stormcrawler.util.ConfUtils;
/**
*将文档发送到ElasticSearch。索引元组或数组中的所有字段
*映射字符串,来自命名字段的对象。
*/
@抑制警告(“串行”)
公共类IndexerBolt扩展了AbstractIndexerBolt{
专用静态最终记录器日志=LoggerFactory
.getLogger(索引器螺栓等级);
私有静态最终字符串ESBoltType=“indexer”;
静态最终字符串ESIndexNameParamName=“es.indexer.index.name”;
静态最终字符串ESDocTypeParamName=“es.indexer.doc.type”;
私有静态最终字符串ESCreateParamName=“es.indexer.create”;
专用输出采集器\u采集器;
私有字符串indexName;
私有字符串docType;
//是否仅在文档不存在或不存在时创建文档
//覆盖
私有布尔创建=false;
文件索引文件;
专用多计数事件计数器;
专用弹性连接;
@SuppressWarnings({“unchecked”,“rawtypes”})
@凌驾
公共空间准备(地图形态、地形上下文、,
输出采集器(采集器){
super.prepare(conf、context、collector);
_收集器=收集器;
indexName=ConfUtils.getString(conf,IndexerBolt.ESIndexNameParamName,
“取货人”);
docType=ConfUtils.getString(conf,IndexerBolt.ESDocTypeParamName,
“文件”);
create=ConfUtils.getBoolean(conf,IndexerBolt.ESCreateParamName,
假);
试一试{
连接=弹性连接
.getConnection(conf,ESBoltType);
}捕获(异常e1){
日志错误(“无法连接到ElasticSearch”,e1);
抛出新的运行时异常(e1);
}
this.eventCounter=context.registerMetric(“ElasticSearchIndexer”,
新的多重度量(),10);
indexFile=新文件(“/Users/jonaspohlmann/code/HPI/BP/stormclawlerspike/spikestormclawler2/index.log”);
}
@凌驾
公共空间清理(){
if(连接!=null)
connection.close();
}
@凌驾
公共void执行(元组){
字符串url=tuple.getStringByField(“url”);
//区分用于索引的值
//从用于状态的
字符串normalisedurl=valueForURL(元组);
元数据元数据=(元数据)tuple.getValueByField(“元数据”);
字符串文本=tuple.getStringByField(“文本”);
//BP:添加的内容字段
字符串内容=新字符串(tuple.getBinaryByField(“内容”);
布尔值keep=filterDocument(元数据);
如果(!保持){
eventCounter.scope(“过滤”).incrBy(1);
//将其视为已成功处理,即使
//我们没有索引它
_emit(StatusStreamName、元组、新值(url、元数据、,
状态(已获取);
_collector.ack(元组);
回来
}
试一试{
XContentBuilder=jsonBuilder().startObject();
//是否显示文档的文本?
如果(fieldNameForText()!=null){
field(fieldNameForText(),trimText(text));
}
   - resource: false
     file: "crawler-conf.yaml"
     override: true
  indexer.url.fieldname: "url"
  indexer.text.fieldname: "content"
  indexer.canonical.name: "canonical"
  indexer.md.mapping:
  - parse.title=title
  - parse.keywords=keywords
  - parse.description=description
  - domain=domain