Web crawler StormCrawler:群集的最佳拓扑
我正在使用stormcrawler来抓取40k个站点,最大深度=2,我想尽可能快地抓取。 我有5个storm节点(具有不同的静态IP)和3个elastic节点。 目前,我最好的拓扑结构是:Web crawler StormCrawler:群集的最佳拓扑,web-crawler,stormcrawler,Web Crawler,Stormcrawler,我正在使用stormcrawler来抓取40k个站点,最大深度=2,我想尽可能快地抓取。 我有5个storm节点(具有不同的静态IP)和3个elastic节点。 目前,我最好的拓扑结构是: spouts: - id: "spout" className: "com.digitalpebble.stormcrawler.elasticsearch.persistence.CollapsingSpout" parallelism: 10 bolts: - id: "part
spouts:
- id: "spout"
className: "com.digitalpebble.stormcrawler.elasticsearch.persistence.CollapsingSpout"
parallelism: 10
bolts:
- id: "partitioner"
className: "com.digitalpebble.stormcrawler.bolt.URLPartitionerBolt"
parallelism: 1
- id: "fetcher"
className: "com.digitalpebble.stormcrawler.bolt.FetcherBolt"
parallelism: 5
- id: "sitemap"
className: "com.digitalpebble.stormcrawler.bolt.SiteMapParserBolt"
parallelism: 5
- id: "parse"
className: "com.digitalpebble.stormcrawler.bolt.JSoupParserBolt"
parallelism: 100
- id: "index"
className: "com.digitalpebble.stormcrawler.elasticsearch.bolt.IndexerBolt"
parallelism: 25
- id: "status"
className: "com.digitalpebble.stormcrawler.elasticsearch.persistence.StatusUpdaterBolt"
parallelism: 25
- id: "status_metrics"
className: "com.digitalpebble.stormcrawler.elasticsearch.metrics.StatusMetricsBolt"
parallelism: 5
和爬虫程序配置:
config:
topology.workers: 5
topology.message.timeout.secs: 300
topology.max.spout.pending: 250
topology.debug: false
fetcher.threads.number: 500
worker.heap.memory.mb: 4096
问题:
1) 我应该使用AggregationSpoot还是CollasingSpoot,有什么区别?我尝试了AggregationSpout,但性能相当于1台机器的默认配置的性能
2) 这种并行配置正确吗
3) 当我从1节点跳到5节点配置时,我发现“获取错误”增加了约20%,并且很多站点没有正确获取。原因可能是什么
更新:
es-conf.yaml:
# configuration for Elasticsearch resources
config:
# ES indexer bolt
# adresses can be specified as a full URL
# if not we assume that the protocol is http and the port 9200
es.indexer.addresses: "1.1.1.1"
es.indexer.index.name: "index"
es.indexer.doc.type: "doc"
es.indexer.create: false
es.indexer.settings:
cluster.name: "webcrawler-cluster"
# ES metricsConsumer
es.metrics.addresses: "http://1.1.1.1:9200"
es.metrics.index.name: "metrics"
es.metrics.doc.type: "datapoint"
es.metrics.settings:
cluster.name: "webcrawler-cluster"
# ES spout and persistence bolt
es.status.addresses: "http://1.1.1.1:9200"
es.status.index.name: "status"
es.status.doc.type: "status"
#es.status.user: "USERNAME"
#es.status.password: "PASSWORD"
# the routing is done on the value of 'partition.url.mode'
es.status.routing: true
# stores the value used for the routing as a separate field
# needed by the spout implementations
es.status.routing.fieldname: "metadata.hostname"
es.status.bulkActions: 500
es.status.flushInterval: "5s"
es.status.concurrentRequests: 1
es.status.settings:
cluster.name: "webcrawler-cluster"
################
# spout config #
################
# positive or negative filter parsable by the Lucene Query Parser
# es.status.filterQuery: "-(metadata.hostname:stormcrawler.net)"
# time in secs for which the URLs will be considered for fetching after a ack of fail
es.status.ttl.purgatory: 30
# Min time (in msecs) to allow between 2 successive queries to ES
es.status.min.delay.queries: 2000
es.status.max.buckets: 50
es.status.max.urls.per.bucket: 2
# field to group the URLs into buckets
es.status.bucket.field: "metadata.hostname"
# field to sort the URLs within a bucket
es.status.bucket.sort.field: "nextFetchDate"
# field to sort the buckets
es.status.global.sort.field: "nextFetchDate"
# Delay since previous query date (in secs) after which the nextFetchDate value will be reset
es.status.reset.fetchdate.after: -1
# CollapsingSpout : limits the deep paging by resetting the start offset for the ES query
es.status.max.start.offset: 500
# AggregationSpout : sampling improves the performance on large crawls
es.status.sample: false
# AggregationSpout (expert): adds this value in mins to the latest date returned in the results and
# use it as nextFetchDate
es.status.recentDate.increase: -1
es.status.recentDate.min.gap: -1
topology.metrics.consumer.register:
- class: "com.digitalpebble.stormcrawler.elasticsearch.metrics.MetricsConsumer"
parallelism.hint: 1
#whitelist:
# - "fetcher_counter"
# - "fetcher_average.bytes_fetched"
#blacklist:
# - "__receive.*"
1) 我应该使用AggregationSpoot还是CollasingSpoot,有什么区别
区别?我尝试了聚合喷口,但性能相当于
具有默认配置的1台计算机的性能
顾名思义,AggregationSpout使用聚合作为按主机(或域、IP或其他)对URL进行分组的机制,而CollasingSpout使用聚合。如果将后者配置为每个bucket有超过1个URL(es.status.max.URL.per.bucket),则后者可能会更慢,因为它会为每个bucket发出子查询。AggregationSpout应该具有良好的性能,特别是当es.status.sample设置为true时。在这一阶段,坍塌的喷口是实验性的
2) 这个并行配置正确吗
这可能比需要的JSoupParserBolts更多。实际上,与fetcherbolt相比,1:4的比率是可以接受的,即使有500个取数线程。Storm UI有助于发现瓶颈以及哪些组件需要扩展。其他一切看起来都正常,但实际上,您应该查看Storm UI和指标,以将拓扑调整到适合爬网的最佳设置
3) 我发现“获取错误”增加了约20%,而且很多站点没有
当我从1个节点跳到5个节点配置时,已正确获取。
原因可能是什么
这可能表明您的网络连接已饱和,但在相反的情况下,使用更多节点时不应出现这种情况。也许可以使用Storm UI检查FetcherBolt如何分布在节点上。是一个工作进程运行所有实例,还是它们的数量都相等?查看日志以了解发生了什么,例如,是否存在大量超时异常 你能分享你的es-conf.yaml吗?1)这很奇怪,但在我的集群上,塌陷喷口工作得更快。2) fetcherbolt具有良好的分布—每个节点一个fetcher,每个节点独立于专用服务器。也许我应该减少线程数?1)有趣的2)是的,值得一试。