Web crawler 如何在StormCrawler中将URL作为文本文件种子?

Web crawler 如何在StormCrawler中将URL作为文本文件种子?,web-crawler,stormcrawler,Web Crawler,Stormcrawler,我有许多URL(大约40000个)需要使用StormCrawler进行爬网。 是否有任何方法可以将这些URL作为文本文件而不是crawler.flux中的列表传递?大概是这样的: spouts: - id: "spout" className: "com.digitalpebble.stormcrawler.spout.MemorySpout" parallelism: 1 constructorArgs: - "URLs.txt" 对于Solr和Ela

我有许多URL(大约40000个)需要使用StormCrawler进行爬网。 是否有任何方法可以将这些URL作为文本文件而不是crawler.flux中的列表传递?大概是这样的:

spouts:
  - id: "spout"
    className: "com.digitalpebble.stormcrawler.spout.MemorySpout"
    parallelism: 1
    constructorArgs:
      - "URLs.txt"

对于Solr和Elasticsearch,有一些注入器可以从文件中读取URL,并将它们作为发现的项添加到状态索引中。当然,需要使用Solr或Elasticsearch来保存状态索引。喷油器以拓扑形式启动,例如

storm ... com.digitalpebble.stormcrawler.elasticsearch.ESSeedInjector .../seeds '*' -conf ...

对于Solr和Elasticsearch,有一些注入器可以从文件中读取URL,并将它们作为发现的项添加到状态索引中。当然,需要使用Solr或Elasticsearch来保存状态索引。喷油器以拓扑形式启动,例如

storm ... com.digitalpebble.stormcrawler.elasticsearch.ESSeedInjector .../seeds '*' -conf ...

正是出于这一目的,有一个明确的解决办法。@sebastian nagel提到的拓扑使用它,您也可以在自己的拓扑中使用它们,例如,请参见

正是出于这一目的,才有了一个解决方案。@sebastian nagel提到的拓扑使用它,您也可以在自己的拓扑中使用它们,例如,请参见

根据Julien Nioche的回答,我写了一个爬虫程序.flux,它满足了我的需求。这是文件:

name: "crawler"

includes:
    - resource: true
      file: "/crawler-default.yaml"
      override: false

    - resource: false
      file: "crawler-conf.yaml"
      override: true

    - resource: false
      file: "solr-conf.yaml"
      override: true



spouts:

  - id: "spout"
    className: "com.digitalpebble.stormcrawler.solr.persistence.SolrSpout"
    parallelism: 1

  - id: "filespout"
    className: "com.digitalpebble.stormcrawler.spout.FileSpout"
    parallelism: 1
    constructorArgs:
      - "."
      - "seeds"
      - true

bolts:
  - id: "partitioner"
    className: "com.digitalpebble.stormcrawler.bolt.URLPartitionerBolt"
    parallelism: 1
  - id: "fetcher"
    className: "com.digitalpebble.stormcrawler.bolt.FetcherBolt"
    parallelism: 1
  - id: "sitemap"
    className: "com.digitalpebble.stormcrawler.bolt.SiteMapParserBolt"
    parallelism: 1
  - id: "parse"
    className: "com.digitalpebble.stormcrawler.bolt.JSoupParserBolt"
    parallelism: 5
  - id: "index"
    className: "com.digitalpebble.stormcrawler.solr.bolt.IndexerBolt"
    parallelism: 1
  - id: "status"
    className: "com.digitalpebble.stormcrawler.solr.persistence.StatusUpdaterBolt"
    parallelism: 1

streams:
  - from: "spout"
    to: "partitioner"
    grouping:
      type: SHUFFLE

  - from: "partitioner"
    to: "fetcher"
    grouping:
      type: FIELDS
      args: ["key"]

  - from: "fetcher"
    to: "sitemap"
    grouping:
      type: LOCAL_OR_SHUFFLE

  - from: "sitemap"
    to: "parse"
    grouping:
      type: LOCAL_OR_SHUFFLE

  - from: "parse"
    to: "index"
    grouping:
      type: LOCAL_OR_SHUFFLE

  - from: "fetcher"
    to: "status"
    grouping:
      type: FIELDS
      args: ["url"]
      streamId: "status"

  - from: "sitemap"
    to: "status"
    grouping:
      type: FIELDS
      args: ["url"]
      streamId: "status"

  - from: "parse"
    to: "status"
    grouping:
      type: FIELDS
      args: ["url"]
      streamId: "status"

  - from: "index"
    to: "status"
    grouping:
      type: FIELDS
      args: ["url"]
      streamId: "status"


  - from: "filespout"
    to: "status"
    grouping:
      streamId: "status"
      type: CUSTOM
      customClass:
        className: "com.digitalpebble.stormcrawler.util.URLStreamGrouping"
        constructorArgs:
          - "byDomain"

你可以设置URL文件所在的目录,而不是设置URL文件名。

根据朱利安·尼切的回答,我编写了一个符合我要求的爬虫程序.flux。这是文件:

name: "crawler"

includes:
    - resource: true
      file: "/crawler-default.yaml"
      override: false

    - resource: false
      file: "crawler-conf.yaml"
      override: true

    - resource: false
      file: "solr-conf.yaml"
      override: true



spouts:

  - id: "spout"
    className: "com.digitalpebble.stormcrawler.solr.persistence.SolrSpout"
    parallelism: 1

  - id: "filespout"
    className: "com.digitalpebble.stormcrawler.spout.FileSpout"
    parallelism: 1
    constructorArgs:
      - "."
      - "seeds"
      - true

bolts:
  - id: "partitioner"
    className: "com.digitalpebble.stormcrawler.bolt.URLPartitionerBolt"
    parallelism: 1
  - id: "fetcher"
    className: "com.digitalpebble.stormcrawler.bolt.FetcherBolt"
    parallelism: 1
  - id: "sitemap"
    className: "com.digitalpebble.stormcrawler.bolt.SiteMapParserBolt"
    parallelism: 1
  - id: "parse"
    className: "com.digitalpebble.stormcrawler.bolt.JSoupParserBolt"
    parallelism: 5
  - id: "index"
    className: "com.digitalpebble.stormcrawler.solr.bolt.IndexerBolt"
    parallelism: 1
  - id: "status"
    className: "com.digitalpebble.stormcrawler.solr.persistence.StatusUpdaterBolt"
    parallelism: 1

streams:
  - from: "spout"
    to: "partitioner"
    grouping:
      type: SHUFFLE

  - from: "partitioner"
    to: "fetcher"
    grouping:
      type: FIELDS
      args: ["key"]

  - from: "fetcher"
    to: "sitemap"
    grouping:
      type: LOCAL_OR_SHUFFLE

  - from: "sitemap"
    to: "parse"
    grouping:
      type: LOCAL_OR_SHUFFLE

  - from: "parse"
    to: "index"
    grouping:
      type: LOCAL_OR_SHUFFLE

  - from: "fetcher"
    to: "status"
    grouping:
      type: FIELDS
      args: ["url"]
      streamId: "status"

  - from: "sitemap"
    to: "status"
    grouping:
      type: FIELDS
      args: ["url"]
      streamId: "status"

  - from: "parse"
    to: "status"
    grouping:
      type: FIELDS
      args: ["url"]
      streamId: "status"

  - from: "index"
    to: "status"
    grouping:
      type: FIELDS
      args: ["url"]
      streamId: "status"


  - from: "filespout"
    to: "status"
    grouping:
      streamId: "status"
      type: CUSTOM
      customClass:
        className: "com.digitalpebble.stormcrawler.util.URLStreamGrouping"
        constructorArgs:
          - "byDomain"

您可以设置URL文件所在的目录,而不是”,您可以将URL文件名放置在线程“main”中,而不是“seeds”

使用SolrSpout:
异常获取此错误java.lang.IllegalArgumentException:无法为参数为“[Discovery]”的类“com.digitalpebble.stormcrawler.util.StringTabScheme”找到合适的构造函数。
检查您使用的构造函数是否与您的SC版本匹配。请参阅#664是否可以更新此分支和此文件?它似乎过时了,对我的案子不起作用!我是否应该对FileSpout.java应用任何修改以使其正常工作?谢谢,这家分店没有营业。需要进行一次性比较。使用SolrSpout获取此错误:
线程“main”java.lang.IllegalArgumentException中的异常:无法为类“com.digitalpebble.stormcrawler.util.StringTabScheme”找到合适的构造函数,该类带有参数“[已发现]“
检查您使用的构造函数是否与您的SC版本匹配。请参阅#664是否可以更新此分支和此文件?它似乎过时了,对我的案子不起作用!我是否应该对FileSpout.java应用任何修改以使其正常工作?谢谢,这家分店没有营业。需要进行一次性比较。