Nutch regex urlfilter爬网多个网站_Regex_Url_Web Crawler_Nutch

Nutch regex urlfilter爬网多个网站

regex url web-crawler

Nutch regex urlfilter爬网多个网站,regex,url,web-crawler,nutch,Regex,Url,Web Crawler,Nutch,我见过这个。但我的问题与此完全不同。我的seed.txt看起来像： http://a.b.c/ http://d.e.f/ 我的regex-urlfilter.txt如下所示： # skip file: ftp: and mailto: urls -^(file|ftp|mailto): # skip image and other suffixes we can't yet parse # for a more extensive coverage use the urlfilter-

我见过这个。但我的问题与此完全不同。我的seed.txt看起来像：

http://a.b.c/ 
http://d.e.f/

我的regex-urlfilter.txt如下所示：

# skip file: ftp: and mailto: urls
-^(file|ftp|mailto):

# skip image and other suffixes we can't yet parse
# for a more extensive coverage use the urlfilter-suffix plugin
-\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|CSS|sit|SIT|eps|EPS|wmf|WMF|zip|ZIP|ppt|PPT|mpg|MPG|xls|XLS|gz|GZ|rpm|RPM|tgz|TGZ|mov|MOV|exe|EXE|jpeg|JPEG|bmp|BMP|js|JS)$

# skip URLs containing certain characters as probable queries, etc.
-[?*!@=]

# skip URLs with slash-delimited segment that repeats 3+ times, to break loops
#-.*(/[^/]+)/[^/]+\1/[^/]+\1/

# accept anything else
+^http://a.b.c/*

http://a.b.c/index.php?id=1
http://a.b.c/about.php
http://a.b.c/help.html
http://a.b.c/test1/test2/
http://a.b.c/index.php?usv=contact
http://a.b.c/index.php?usv=vdetailpro&id=104&sid=74

我想抓取一些url，如下所示：

# skip file: ftp: and mailto: urls
-^(file|ftp|mailto):

# skip image and other suffixes we can't yet parse
# for a more extensive coverage use the urlfilter-suffix plugin
-\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|CSS|sit|SIT|eps|EPS|wmf|WMF|zip|ZIP|ppt|PPT|mpg|MPG|xls|XLS|gz|GZ|rpm|RPM|tgz|TGZ|mov|MOV|exe|EXE|jpeg|JPEG|bmp|BMP|js|JS)$

# skip URLs containing certain characters as probable queries, etc.
-[?*!@=]

# skip URLs with slash-delimited segment that repeats 3+ times, to break loops
#-.*(/[^/]+)/[^/]+\1/[^/]+\1/

# accept anything else
+^http://a.b.c/*

http://a.b.c/index.php?id=1
http://a.b.c/about.php
http://a.b.c/help.html
http://a.b.c/test1/test2/
http://a.b.c/index.php?usv=contact
http://a.b.c/index.php?usv=vdetailpro&id=104&sid=74

诸如此类

我已经通过命令进行了测试：bin/nutch org.apache.nutch.net.URLFilterChecker-allCombined 并认识到正则表达式不匹配

谢谢你

在regex-urlfilter.txt中使用这些正则表达式

解决方案1

+^http://([a-z0-9]*\.)*a.b.c/
+^http://([a-z0-9]*\.)*d.e.f/

解决方案2

+^http://([a-z0-9]*\.)*(a.b.c|d.e.f)/

请注意，由于带问号，至少，[？*！@=]将与第一行匹配。这就是你所期待的吗？谢谢你，乔丹。这很简单