Solr 在Nutch中抓取同一域的所有链接_Solr_Nutch

Solr 在Nutch中抓取同一域的所有链接

solr

Solr 在Nutch中抓取同一域的所有链接,solr,nutch,Solr,Nutch,有人能告诉我如何抓取同一域的所有其他页面吗例如，我在seed.txt中为一个网站提供信息 nutch-site.xml中添加了以下属性 <property> <name>db.ignore.internal.links</name> <value>false</value> <description>If true, when adding new links to a page, links from the same

有人能告诉我如何抓取同一域的所有其他页面吗

例如，我在seed.txt中为一个网站提供信息

nutch-site.xml中添加了以下属性

<property>
<name>db.ignore.internal.links</name>
<value>false</value>
<description>If true, when adding new links to a page, links from
the same host are ignored.  This is an effective way to limit the
size of the link database, keeping only the highest quality
links.
</description>
</property>

<property>
 <name>db.ignore.external.links</name>
 <value>true</value>
 <description>If true, outlinks leading from a page to external hosts
 will be ignored. This will limit your crawl to the host on your seeds file.
 </description>
</property>


db.ignore.internal.links
假的
如果为true，则在向页面添加新链接时
忽略同一主机。这是一种有效的方法来限制
链接数据库的大小，仅保持最高质量
链接。

下面是在regex-urlfilter.txt中添加的

接受其他任何东西 +

注意：如果我添加seed.txt，我可以抓取所有其他页面，但不能抓取techcrunch.com的页面，尽管它也有许多其他页面

请帮助..？

我认为您使用了错误的属性，请先使用nutch-site.xml中的db.ignore.external.links

<property>
<name>db.ignore.internal.links</name>
<value>false</value>
<description>If true, when adding new links to a page, links from
the same host are ignored.  This is an effective way to limit the
size of the link database, keeping only the highest quality
links.
</description>
</property>

<property>
 <name>db.ignore.external.links</name>
 <value>true</value>
 <description>If true, outlinks leading from a page to external hosts
 will be ignored. This will limit your crawl to the host on your seeds file.
 </description>
</property>

然而我认为您的问题在于Nutch遵守robots.txt文件，在本例中，techcrunch的爬行延迟值为3600！！看见fetcher.max.crawl.delay的默认值为30秒，使Nutch从techcrunch中删除所有页面

+^(http|https)://.*techcrunch.com/

从nutch default中的fetcher.max.crawl.delay

"If the Crawl-Delay in robots.txt is set to greater than this value (in
seconds) then the fetcher will skip this page, generating an error report.
If set to -1 the fetcher will never skip such pages and will wait the
amount of time retrieved from robots.txt Crawl-Delay, however long that
might be."

您可能需要使用fetcher.threads.fetch和fetcher.threads.per.queue值来加快爬网速度。您还可以查看并使用Nutch代码。。或者，您甚至可能希望使用不同的方法来抓取具有长抓取延迟的站点

希望这对你有用

干杯

在

nutch default.xml

中，将

db.ignore.external.links

设置为

true

，将“db.ignore.external.links.mode”设置为

byDomain

。像这样：

<property>
 <name>db.ignore.external.links</name>
 <value>true</value>
</property>
<property>
 <name>db.ignore.external.links.mode</name>
 <value>byDomain</value>
</property>


db.ignore.external.links
真的
db.ignore.external.links.mode
byDomain

默认情况下，

db.ignore.external.links.mode

设置为

byHost

。也就是说，在crawing

http://www.techcrunch.com/

URL

http://subdomain1.techcrunch.com

将被视为外部，因此将被忽略。但是您也希望对

sudomain1

页面进行爬网-因此保留

db.ignore.external.links.mode

到

byDomain

regex-urlfilter.txt

中不需要解决任何问题。使用

regex urlfilter.txt

处理一些复杂情况

Hi，您能在爬行时与我们分享日志吗？它肯定会告诉你为什么它没有抓取其他网站。我怀疑有太多的链接可用，而爬行有一个设置，每个页面要爬行的大纲链接数量必须少，因此它不是爬行这个网站。