Web crawler Nutch Crawler没有'；无法检索新闻文章内容_Web Crawler_Nutch

Web crawler Nutch Crawler没有'；无法检索新闻文章内容

web-crawler

Web crawler Nutch Crawler没有'；无法检索新闻文章内容,web-crawler,nutch,Web Crawler,Nutch,我试图从链接中抓取新闻文章：- 但我并没有将页面中的文本放到索引（elasticsearch）中的内容字段中爬网的结果是：- { "took": 2, "timed_out": false, "_shards": { "total": 5, "successful": 5, "failed": 0 }, "hits": { "total": 2, "max_score": 0.09492774, "hits": [

我试图从链接中抓取新闻文章：-

但我并没有将页面中的文本放到索引（elasticsearch）中的内容字段中

爬网的结果是：-

{
  "took": 2,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "failed": 0
  },
  "hits": {
    "total": 2,
    "max_score": 0.09492774,
    "hits": [
      {
        "_index": "news",
        "_type": "doc",
        "_id": "http://www.bloomberg.com/press-releases/2016-07-08/network-1-announces-settlement-of-patent-litigation-with-apple-inc",
        "_score": 0.09492774,
        "_source": {
          "tstamp": "2016-08-04T07:21:59.614Z",
          "segment": "20160804125156",
          "digest": "d583a81c0c4c7510f5c842ea3b557992",
          "host": "www.bloomberg.com",
          "boost": "1.0",
          "id": "http://www.bloomberg.com/press-releases/2016-07-08/network-1-announces-settlement-of-patent-litigation-with-apple-inc",
          "url": "http://www.bloomberg.com/press-releases/2016-07-08/network-1-announces-settlement-of-patent-litigation-with-apple-inc",
          "content": ""
        }
      },
      {
        "_index": "news",
        "_type": "doc",
        "_id": "http://www.bloomberg.com/press-releases/2016-07-05/apple-donate-life-america-bring-national-organ-donor-registration-to-iphone",
        "_score": 0.009845509,
        "_source": {
          "tstamp": "2016-08-04T07:22:05.708Z",
          "segment": "20160804125156",
          "digest": "2a94a32ffffd0e03647928755e055e30",
          "host": "www.bloomberg.com",
          "boost": "1.0",
          "id": "http://www.bloomberg.com/press-releases/2016-07-05/apple-donate-life-america-bring-national-organ-donor-registration-to-iphone",
          "url": "http://www.bloomberg.com/press-releases/2016-07-05/apple-donate-life-america-bring-national-organ-donor-registration-to-iphone",
          "content": ""
        }
      }
    ]
  }
}

我们可以注意到内容字段是空的。我在nutch-site.txt中尝试了不同的选项。但结果仍然是一样的。请在这方面帮助我。

只是一个断章取义的答案，但请尝试使用ApacheManifoldcf。它为弹性搜索提供内置连接器，并提供更好的记录历史，以找出数据未编入索引的原因。ManifoldCF中的连接器部分允许您指定内容应编入索引的字段。这是一个很好的开源替代品，可以尝试使用

不知道为什么nutch无法提取文章内容。但是我使用Jsoup找到了一个解决方法。我开发了一个自定义的parse-filter插件，它解析整个文档，并在parser-filter返回的ParseResult中设置解析文本。并在

parse plugins.xml

它将类似于：-

   document = Jsoup.parse(new String(content.getContent(),"UTF-8"),content.getUrl());
   parse = parseResult.get(content.getUrl());
   status = parse.getData().getStatus();
   title = document.title();
   parseData = new ParseData(status, title,parse.getData().getOutlinks(), parse.getData().getContentMeta(), parse.getData().getParseMeta());
   parseResult.put(content.getUrl(), new ParseText(document.body().text()), parseData);

谢谢：）。我会看一看，我想选择一个特定的div或任何其他标签内的链接，并获取该链接的内容和索引。我们是否可以用流形做这样的事情