Web crawler 如何在StormCrawler中将URL作为文本文件种子?
我有许多URL(大约40000个)需要使用StormCrawler进行爬网。 是否有任何方法可以将这些URL作为文本文件而不是crawler.flux中的列表传递?大概是这样的:Web crawler 如何在StormCrawler中将URL作为文本文件种子?,web-crawler,stormcrawler,Web Crawler,Stormcrawler,我有许多URL(大约40000个)需要使用StormCrawler进行爬网。 是否有任何方法可以将这些URL作为文本文件而不是crawler.flux中的列表传递?大概是这样的: spouts: - id: "spout" className: "com.digitalpebble.stormcrawler.spout.MemorySpout" parallelism: 1 constructorArgs: - "URLs.txt" 对于Solr和Ela
spouts:
- id: "spout"
className: "com.digitalpebble.stormcrawler.spout.MemorySpout"
parallelism: 1
constructorArgs:
- "URLs.txt"
对于Solr和Elasticsearch,有一些注入器可以从文件中读取URL,并将它们作为发现的项添加到状态索引中。当然,需要使用Solr或Elasticsearch来保存状态索引。喷油器以拓扑形式启动,例如
storm ... com.digitalpebble.stormcrawler.elasticsearch.ESSeedInjector .../seeds '*' -conf ...
对于Solr和Elasticsearch,有一些注入器可以从文件中读取URL,并将它们作为发现的项添加到状态索引中。当然,需要使用Solr或Elasticsearch来保存状态索引。喷油器以拓扑形式启动,例如
storm ... com.digitalpebble.stormcrawler.elasticsearch.ESSeedInjector .../seeds '*' -conf ...
正是出于这一目的,有一个明确的解决办法。@sebastian nagel提到的拓扑使用它,您也可以在自己的拓扑中使用它们,例如,请参见 正是出于这一目的,才有了一个解决方案。@sebastian nagel提到的拓扑使用它,您也可以在自己的拓扑中使用它们,例如,请参见 根据Julien Nioche的回答,我写了一个爬虫程序.flux,它满足了我的需求。这是文件:
name: "crawler"
includes:
- resource: true
file: "/crawler-default.yaml"
override: false
- resource: false
file: "crawler-conf.yaml"
override: true
- resource: false
file: "solr-conf.yaml"
override: true
spouts:
- id: "spout"
className: "com.digitalpebble.stormcrawler.solr.persistence.SolrSpout"
parallelism: 1
- id: "filespout"
className: "com.digitalpebble.stormcrawler.spout.FileSpout"
parallelism: 1
constructorArgs:
- "."
- "seeds"
- true
bolts:
- id: "partitioner"
className: "com.digitalpebble.stormcrawler.bolt.URLPartitionerBolt"
parallelism: 1
- id: "fetcher"
className: "com.digitalpebble.stormcrawler.bolt.FetcherBolt"
parallelism: 1
- id: "sitemap"
className: "com.digitalpebble.stormcrawler.bolt.SiteMapParserBolt"
parallelism: 1
- id: "parse"
className: "com.digitalpebble.stormcrawler.bolt.JSoupParserBolt"
parallelism: 5
- id: "index"
className: "com.digitalpebble.stormcrawler.solr.bolt.IndexerBolt"
parallelism: 1
- id: "status"
className: "com.digitalpebble.stormcrawler.solr.persistence.StatusUpdaterBolt"
parallelism: 1
streams:
- from: "spout"
to: "partitioner"
grouping:
type: SHUFFLE
- from: "partitioner"
to: "fetcher"
grouping:
type: FIELDS
args: ["key"]
- from: "fetcher"
to: "sitemap"
grouping:
type: LOCAL_OR_SHUFFLE
- from: "sitemap"
to: "parse"
grouping:
type: LOCAL_OR_SHUFFLE
- from: "parse"
to: "index"
grouping:
type: LOCAL_OR_SHUFFLE
- from: "fetcher"
to: "status"
grouping:
type: FIELDS
args: ["url"]
streamId: "status"
- from: "sitemap"
to: "status"
grouping:
type: FIELDS
args: ["url"]
streamId: "status"
- from: "parse"
to: "status"
grouping:
type: FIELDS
args: ["url"]
streamId: "status"
- from: "index"
to: "status"
grouping:
type: FIELDS
args: ["url"]
streamId: "status"
- from: "filespout"
to: "status"
grouping:
streamId: "status"
type: CUSTOM
customClass:
className: "com.digitalpebble.stormcrawler.util.URLStreamGrouping"
constructorArgs:
- "byDomain"
你可以设置URL文件所在的目录,而不是设置URL文件名。根据朱利安·尼切的回答,我编写了一个符合我要求的爬虫程序.flux。这是文件:
name: "crawler"
includes:
- resource: true
file: "/crawler-default.yaml"
override: false
- resource: false
file: "crawler-conf.yaml"
override: true
- resource: false
file: "solr-conf.yaml"
override: true
spouts:
- id: "spout"
className: "com.digitalpebble.stormcrawler.solr.persistence.SolrSpout"
parallelism: 1
- id: "filespout"
className: "com.digitalpebble.stormcrawler.spout.FileSpout"
parallelism: 1
constructorArgs:
- "."
- "seeds"
- true
bolts:
- id: "partitioner"
className: "com.digitalpebble.stormcrawler.bolt.URLPartitionerBolt"
parallelism: 1
- id: "fetcher"
className: "com.digitalpebble.stormcrawler.bolt.FetcherBolt"
parallelism: 1
- id: "sitemap"
className: "com.digitalpebble.stormcrawler.bolt.SiteMapParserBolt"
parallelism: 1
- id: "parse"
className: "com.digitalpebble.stormcrawler.bolt.JSoupParserBolt"
parallelism: 5
- id: "index"
className: "com.digitalpebble.stormcrawler.solr.bolt.IndexerBolt"
parallelism: 1
- id: "status"
className: "com.digitalpebble.stormcrawler.solr.persistence.StatusUpdaterBolt"
parallelism: 1
streams:
- from: "spout"
to: "partitioner"
grouping:
type: SHUFFLE
- from: "partitioner"
to: "fetcher"
grouping:
type: FIELDS
args: ["key"]
- from: "fetcher"
to: "sitemap"
grouping:
type: LOCAL_OR_SHUFFLE
- from: "sitemap"
to: "parse"
grouping:
type: LOCAL_OR_SHUFFLE
- from: "parse"
to: "index"
grouping:
type: LOCAL_OR_SHUFFLE
- from: "fetcher"
to: "status"
grouping:
type: FIELDS
args: ["url"]
streamId: "status"
- from: "sitemap"
to: "status"
grouping:
type: FIELDS
args: ["url"]
streamId: "status"
- from: "parse"
to: "status"
grouping:
type: FIELDS
args: ["url"]
streamId: "status"
- from: "index"
to: "status"
grouping:
type: FIELDS
args: ["url"]
streamId: "status"
- from: "filespout"
to: "status"
grouping:
streamId: "status"
type: CUSTOM
customClass:
className: "com.digitalpebble.stormcrawler.util.URLStreamGrouping"
constructorArgs:
- "byDomain"
您可以设置URL文件所在的目录,而不是“”,您可以将URL文件名放置在线程“main”中,而不是“seeds”。使用SolrSpout:
异常获取此错误java.lang.IllegalArgumentException:无法为参数为“[Discovery]”的类“com.digitalpebble.stormcrawler.util.StringTabScheme”找到合适的构造函数。
检查您使用的构造函数是否与您的SC版本匹配。请参阅#664是否可以更新此分支和此文件?它似乎过时了,对我的案子不起作用!我是否应该对FileSpout.java应用任何修改以使其正常工作?谢谢,这家分店没有营业。需要进行一次性比较。使用SolrSpout获取此错误:线程“main”java.lang.IllegalArgumentException中的异常:无法为类“com.digitalpebble.stormcrawler.util.StringTabScheme”找到合适的构造函数,该类带有参数“[已发现]“
检查您使用的构造函数是否与您的SC版本匹配。请参阅#664是否可以更新此分支和此文件?它似乎过时了,对我的案子不起作用!我是否应该对FileSpout.java应用任何修改以使其正常工作?谢谢,这家分店没有营业。需要进行一次性比较。