Web crawler 如何在StormCrawler中将URL作为文本文件种子？_Web Crawler_Stormcrawler

Web crawler 如何在StormCrawler中将URL作为文本文件种子？

web-crawler

Web crawler 如何在StormCrawler中将URL作为文本文件种子？,web-crawler,stormcrawler,Web Crawler,Stormcrawler,我有许多URL（大约40000个）需要使用StormCrawler进行爬网。是否有任何方法可以将这些URL作为文本文件而不是crawler.flux中的列表传递？大概是这样的： spouts: - id: "spout" className: "com.digitalpebble.stormcrawler.spout.MemorySpout" parallelism: 1 constructorArgs: - "URLs.txt" 对于Solr和Ela

我有许多URL（大约40000个）需要使用StormCrawler进行爬网。是否有任何方法可以将这些URL作为文本文件而不是crawler.flux中的列表传递？大概是这样的：

spouts: - id: "spout" className: "com.digitalpebble.stormcrawler.spout.MemorySpout" parallelism: 1 constructorArgs: - "URLs.txt"

对于Solr和Elasticsearch，有一些注入器可以从文件中读取URL，并将它们作为发现的项添加到状态索引中。当然，需要使用Solr或Elasticsearch来保存状态索引。喷油器以拓扑形式启动，例如

storm ... com.digitalpebble.stormcrawler.elasticsearch.ESSeedInjector .../seeds '*' -conf ...

对于Solr和Elasticsearch，有一些注入器可以从文件中读取URL，并将它们作为发现的项添加到状态索引中。当然，需要使用Solr或Elasticsearch来保存状态索引。喷油器以拓扑形式启动，例如

storm ... com.digitalpebble.stormcrawler.elasticsearch.ESSeedInjector .../seeds '*' -conf ...

正是出于这一目的，有一个明确的解决办法。@sebastian nagel提到的拓扑使用它，您也可以在自己的拓扑中使用它们，例如，请参见
正是出于这一目的，才有了一个解决方案。@sebastian nagel提到的拓扑使用它，您也可以在自己的拓扑中使用它们，例如，请参见
根据Julien Nioche的回答，我写了一个爬虫程序.flux，它满足了我的需求。这是文件：

name: "crawler" includes: - resource: true file: "/crawler-default.yaml" override: false - resource: false file: "crawler-conf.yaml" override: true - resource: false file: "solr-conf.yaml" override: true spouts: - id: "spout" className: "com.digitalpebble.stormcrawler.solr.persistence.SolrSpout" parallelism: 1 - id: "filespout" className: "com.digitalpebble.stormcrawler.spout.FileSpout" parallelism: 1 constructorArgs: - "." - "seeds" - true bolts: - id: "partitioner" className: "com.digitalpebble.stormcrawler.bolt.URLPartitionerBolt" parallelism: 1 - id: "fetcher" className: "com.digitalpebble.stormcrawler.bolt.FetcherBolt" parallelism: 1 - id: "sitemap" className: "com.digitalpebble.stormcrawler.bolt.SiteMapParserBolt" parallelism: 1 - id: "parse" className: "com.digitalpebble.stormcrawler.bolt.JSoupParserBolt" parallelism: 5 - id: "index" className: "com.digitalpebble.stormcrawler.solr.bolt.IndexerBolt" parallelism: 1 - id: "status" className: "com.digitalpebble.stormcrawler.solr.persistence.StatusUpdaterBolt" parallelism: 1 streams: - from: "spout" to: "partitioner" grouping: type: SHUFFLE - from: "partitioner" to: "fetcher" grouping: type: FIELDS args: ["key"] - from: "fetcher" to: "sitemap" grouping: type: LOCAL_OR_SHUFFLE - from: "sitemap" to: "parse" grouping: type: LOCAL_OR_SHUFFLE - from: "parse" to: "index" grouping: type: LOCAL_OR_SHUFFLE - from: "fetcher" to: "status" grouping: type: FIELDS args: ["url"] streamId: "status" - from: "sitemap" to: "status" grouping: type: FIELDS args: ["url"] streamId: "status" - from: "parse" to: "status" grouping: type: FIELDS args: ["url"] streamId: "status" - from: "index" to: "status" grouping: type: FIELDS args: ["url"] streamId: "status" - from: "filespout" to: "status" grouping: streamId: "status" type: CUSTOM customClass: className: "com.digitalpebble.stormcrawler.util.URLStreamGrouping" constructorArgs: - "byDomain"

你可以设置URL文件所在的目录，而不是设置URL文件名。
根据朱利安·尼切的回答，我编写了一个符合我要求的爬虫程序.flux。这是文件：

name: "crawler" includes: - resource: true file: "/crawler-default.yaml" override: false - resource: false file: "crawler-conf.yaml" override: true - resource: false file: "solr-conf.yaml" override: true spouts: - id: "spout" className: "com.digitalpebble.stormcrawler.solr.persistence.SolrSpout" parallelism: 1 - id: "filespout" className: "com.digitalpebble.stormcrawler.spout.FileSpout" parallelism: 1 constructorArgs: - "." - "seeds" - true bolts: - id: "partitioner" className: "com.digitalpebble.stormcrawler.bolt.URLPartitionerBolt" parallelism: 1 - id: "fetcher" className: "com.digitalpebble.stormcrawler.bolt.FetcherBolt" parallelism: 1 - id: "sitemap" className: "com.digitalpebble.stormcrawler.bolt.SiteMapParserBolt" parallelism: 1 - id: "parse" className: "com.digitalpebble.stormcrawler.bolt.JSoupParserBolt" parallelism: 5 - id: "index" className: "com.digitalpebble.stormcrawler.solr.bolt.IndexerBolt" parallelism: 1 - id: "status" className: "com.digitalpebble.stormcrawler.solr.persistence.StatusUpdaterBolt" parallelism: 1 streams: - from: "spout" to: "partitioner" grouping: type: SHUFFLE - from: "partitioner" to: "fetcher" grouping: type: FIELDS args: ["key"] - from: "fetcher" to: "sitemap" grouping: type: LOCAL_OR_SHUFFLE - from: "sitemap" to: "parse" grouping: type: LOCAL_OR_SHUFFLE - from: "parse" to: "index" grouping: type: LOCAL_OR_SHUFFLE - from: "fetcher" to: "status" grouping: type: FIELDS args: ["url"] streamId: "status" - from: "sitemap" to: "status" grouping: type: FIELDS args: ["url"] streamId: "status" - from: "parse" to: "status" grouping: type: FIELDS args: ["url"] streamId: "status" - from: "index" to: "status" grouping: type: FIELDS args: ["url"] streamId: "status" - from: "filespout" to: "status" grouping: streamId: "status" type: CUSTOM customClass: className: "com.digitalpebble.stormcrawler.util.URLStreamGrouping" constructorArgs: - "byDomain"

您可以设置URL文件所在的目录，而不是“”，您可以将URL文件名放置在线程“main”中，而不是“seeds”。
使用SolrSpout:
异常获取此错误java.lang.IllegalArgumentException:无法为参数为“[Discovery]”的类“com.digitalpebble.stormcrawler.util.StringTabScheme”找到合适的构造函数。
检查您使用的构造函数是否与您的SC版本匹配。请参阅#664是否可以更新此分支和此文件？它似乎过时了，对我的案子不起作用！我是否应该对FileSpout.java应用任何修改以使其正常工作？谢谢，这家分店没有营业。需要进行一次性比较。使用SolrSpout获取此错误：
线程“main”java.lang.IllegalArgumentException中的异常：无法为类“com.digitalpebble.stormcrawler.util.StringTabScheme”找到合适的构造函数，该类带有参数“[已发现]“
检查您使用的构造函数是否与您的SC版本匹配。请参阅#664是否可以更新此分支和此文件？它似乎过时了，对我的案子不起作用！我是否应该对FileSpout.java应用任何修改以使其正常工作？谢谢，这家分店没有营业。需要进行一次性比较。