Web crawler 如何使用Storm Crawler抓取文档(.pdf、.docx等)
我正在使用Storm crawler 1.10。我正在尝试使用crawler来抓取文档。我根据一些研究添加了tika解析器,但爬虫程序没有抓取.pdf URL。当我应用tika the函数时,html页面中的新行(\n)正在爬行,这在我签入kibana时看起来很奇怪。正则表达式中的文档没有限制。我正在共享我的配置。任何人都可以帮助我在这种情况下,只有抓取文件Web crawler 如何使用Storm Crawler抓取文档(.pdf、.docx等),web-crawler,stormcrawler,Web Crawler,Stormcrawler,我正在使用Storm crawler 1.10。我正在尝试使用crawler来抓取文档。我根据一些研究添加了tika解析器,但爬虫程序没有抓取.pdf URL。当我应用tika the函数时,html页面中的新行(\n)正在爬行,这在我签入kibana时看起来很奇怪。正则表达式中的文档没有限制。我正在共享我的配置。任何人都可以帮助我在这种情况下,只有抓取文件 **es-crawler.flux:** name: "crawler" includes: - resource: true
**es-crawler.flux:**
name: "crawler"
includes:
- resource: true
file: "/crawler-default.yaml"
override: false
- resource: false
file: "crawler-conf.yaml"
override: true
- resource: false
file: "es-conf.yaml"
override: true
spouts:
- id: "spout"
className: "com.digitalpebble.stormcrawler.elasticsearch.persistence.AggregationSpout"
parallelism: 10
bolts:
- id: "partitioner"
className: "com.digitalpebble.stormcrawler.bolt.URLPartitionerBolt"
parallelism: 1
- id: "fetcher"
className: "com.digitalpebble.stormcrawler.bolt.FetcherBolt"
parallelism: 1
- id: "sitemap"
className: "com.digitalpebble.stormcrawler.bolt.SiteMapParserBolt"
parallelism: 1
- id: "parse"
className: "com.digitalpebble.stormcrawler.bolt.JSoupParserBolt"
parallelism: 1
- id: "index"
className: "com.digitalpebble.stormcrawler.elasticsearch.bolt.IndexerBolt"
parallelism: 1
- id: "status"
className: "com.digitalpebble.stormcrawler.elasticsearch.persistence.StatusUpdaterBolt"
parallelism: 1
- id: "status_metrics"
className: "com.digitalpebble.stormcrawler.elasticsearch.metrics.StatusMetricsBolt"
parallelism: 1
- id: "redirection_bolt"
className: "com.digitalpebble.stormcrawler.tika.RedirectionBolt"
parallelism: 1
- id: "parser_bolt"
className: "com.digitalpebble.stormcrawler.tika.ParserBolt"
parallelism: 1
streams:
- from: "spout"
to: "partitioner"
grouping:
type: SHUFFLE
- from: "spout"
to: "status_metrics"
grouping:
type: SHUFFLE
- from: "partitioner"
to: "fetcher"
grouping:
type: FIELDS
args: ["key"]
- from: "fetcher"
to: "sitemap"
grouping:
type: LOCAL_OR_SHUFFLE
- from: "sitemap"
to: "parse"
grouping:
type: LOCAL_OR_SHUFFLE
- from: "parse"
to: "index"
grouping:
type: LOCAL_OR_SHUFFLE
- from: "fetcher"
to: "status"
grouping:
type: FIELDS
args: ["url"]
streamId: "status"
- from: "sitemap"
to: "status"
grouping:
type: FIELDS
args: ["url"]
streamId: "status"
- from: "parse"
to: "status"
grouping:
type: FIELDS
args: ["url"]
streamId: "status"
- from: "index"
to: "status"
grouping:
type: FIELDS
args: ["url"]
streamId: "status"
- from: "parse"
to: "redirection_bolt"
grouping:
type: LOCAL_OR_SHUFFLE
- from: "redirection_bolt"
to: "parser_bolt"
grouping:
type: LOCAL_OR_SHUFFLE
- from: "redirection_bolt"
to: "index"
grouping:
type: LOCAL_OR_SHUFFLE
- from: "parser_bolt"
to: "index"
grouping:
type: LOCAL_OR_SHUFFLE
- from: "redirection_bolt"
to: "parser_bolt"
grouping:
type: LOCAL_OR_SHUFFLE
streamId: "tika"
您是否在配置中设置了jsoup.treat.non.html.as.error:false?参见Tika模块中的 我尝试了你的流量拓扑,我可以看到PDF文档被索引,不知道你的问题在哪里。也许可以尝试在PDF URL上显示MemorySpout,例如
spouts:
- id: "spout"
className: "com.digitalpebble.stormcrawler.spout.MemorySpout"
parallelism: 1
constructorArgs:
- ["http://blog.marc-seeger.de/assets/papers/thesis_seeger-building_blocks_of_a_scalable_webcrawler.pdf"]
并在日志和ES索引中检查您是否正确获取了文档
或者,您可以尝试仅使用Tika进行解析,而不是使用JSoup解析器,以便它处理所有文档,而不管它们的mimetype如何。对于Xpath提取,它不如JSoup工作,这就是为什么后者是HTML内容的首选选项。感谢您的回复。我在crawler-conf.Hi-Julien中添加了jsoup.treat.non.html.as.error:false,我错过了CrawlTopology中的tika解析器配置,并添加了这些更改\。现在我可以索引pdf文档了。
**es-injector.flux:**
name: "injector"
includes:
- resource: true
file: "/crawler-default.yaml"
override: false
- resource: false
file: "crawler-conf.yaml"
override: true
- resource: false
file: "es-conf.yaml"
override: true
components:
- id: "scheme"
className: "com.digitalpebble.stormcrawler.util.StringTabScheme"
constructorArgs:
- DISCOVERED
spouts:
- id: "spout"
className: "com.digitalpebble.stormcrawler.spout.FileSpout"
parallelism: 1
constructorArgs:
- "."
- "seeds.txt"
- ref: "scheme"
bolts:
- id: "status"
className: "com.digitalpebble.stormcrawler.elasticsearch.persistence.StatusUpdaterBolt"
parallelism: 1
- id: "parser_bolt"
className: "com.digitalpebble.stormcrawler.tika.ParserBolt"
parallelism: 1
streams:
- from: "spout"
to: "status"
grouping:
type: CUSTOM
customClass:
className: "com.digitalpebble.stormcrawler.util.URLStreamGrouping"
constructorArgs:
- "byHost"
spouts:
- id: "spout"
className: "com.digitalpebble.stormcrawler.spout.MemorySpout"
parallelism: 1
constructorArgs:
- ["http://blog.marc-seeger.de/assets/papers/thesis_seeger-building_blocks_of_a_scalable_webcrawler.pdf"]