ApacheNutch:没有要获取的URL-检查种子列表和URL过滤器

ApacheNutch:没有要获取的URL-检查种子列表和URL过滤器,apache,nutch,Apache,Nutch,我用的是Nutch1.2。当我这样运行爬网命令时: bin/nutch crawl urls -dir crawl -depth 2 -topN 1000 Injector: starting at 2011-07-11 12:18:37 Injector: crawlDb: crawl/crawldb Injector: urlDir: urls Injector: Converting injected urls to crawl db entries. Injector: Merging

我用的是Nutch1.2。当我这样运行爬网命令时:

bin/nutch crawl urls -dir crawl -depth 2 -topN 1000

Injector: starting at 2011-07-11 12:18:37
Injector: crawlDb: crawl/crawldb
Injector: urlDir: urls
Injector: Converting injected urls to crawl db entries.
Injector: Merging injected urls into crawl db.
Injector: finished at 2011-07-11 12:18:44, elapsed: 00:00:07
Generator: starting at 2011-07-11 12:18:45
Generator: Selecting best-scoring urls due for fetch.
Generator: filtering: true
Generator: normalizing: true
Generator: topN: 1000
Generator: jobtracker is 'local', generating exactly one partition.
Generator: 0 records selected for fetching, exiting ...
Stopping at depth=0 - no more URLs to fetch.
**No URLs to fetch - check your seed list and URL filters.**
crawl finished: crawl
问题是它一直在抱怨:没有URL可获取-检查种子列表和URL过滤器

我在nutch_root/url/nutch文件下有一个要爬网的URL列表。还设置了my crawl-urlfilter.txt

为什么它会抱怨我的url列表和过滤器?它以前从未这样做过

这是我的crawl-urlfilter.txt

# skip file:, ftp:, & mailto: urls
-^(file|ftp|mailto):

# skip image and other suffixes we can't yet parse
-\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP)$

# skip URLs containing certain characters as probable queries, etc.


# skip URLs with slash-delimited segment that repeats 3+ times, to break loops
-.*(/[^/]+)/[^/]+\1/[^/]+\1/

# accept hosts in MY.DOMAIN.NAME
+^http://([a-z0-9]*\.)*152.111.1.87/
+^http://([a-z0-9]*\.)*152.111.1.88/

# skip everything else
-.

你的URL过滤规则看起来很奇怪,我认为它们与有效的URL不匹配,像这样应该更好,不是吗

+^http://152\.111\.1\.87/ +^http://152\.111\.1\.88/

您的URL筛选规则看起来很奇怪,我认为它们与有效的URL不匹配,像这样应该更好,不是吗

+^http://152\.111\.1\.87/ +^http://152\.111\.1\.88/

也许可以添加您的crawl-urlfilter.txt和seed.url文件。。。有可能筛选器实际上正在筛选您的种子扫描您发布的crawl-urlfilter.txt?可能添加您的crawl-urlfilter.txt和seed.url文件。。。可能筛选器实际上正在筛选您的种子扫描您发布的crawl-urlfilter.txt?因为点“.”匹配任何单个字符,它也应该匹配“.”我得到了一个工作配置,url筛选器设置为+^,因为点“.”匹配任何单个字符,它还应该与“.”匹配。我得到了一个工作配置,url筛选器设置为+^