Solr 如何使用apache nutch防止外部链接爬行？_Solr_Web Crawler_Nutch_Information Retrieval_External Links

Solr 如何使用apache nutch防止外部链接爬行？

solr web-crawler

Solr 如何使用apache nutch防止外部链接爬行？,solr,web-crawler,nutch,information-retrieval,external-links,Solr,Web Crawler,Nutch,Information Retrieval,External Links,我只想抓取nutch上的特定域。为此，我将db.ignore.external.links设置为true，如本文所述问题是nutch开始只抓取种子列表中的链接。例如，如果我将“nutch.apache.org”放在seed.txt中，它只会找到相同的url（nutch.apache.org）我通过运行深度为200的爬网脚本得到结果。完成一个循环，生成下面的输出我怎样才能解决这个问题我正在使用ApacheNutch1.11 Generator: starting at 2016-04-05

我只想抓取nutch上的特定域。为此，我将

db.ignore.external.links

设置为true，如本文所述

问题是nutch开始只抓取种子列表中的链接。例如，如果我将“nutch.apache.org”放在seed.txt中，它只会找到相同的url（nutch.apache.org）

我通过运行深度为200的爬网脚本得到结果。完成一个循环，生成下面的输出

我怎样才能解决这个问题

我正在使用ApacheNutch1.11

Generator: starting at 2016-04-05 22:36:16
Generator: Selecting best-scoring urls due for fetch.
Generator: filtering: false
Generator: normalizing: true
Generator: topN: 50000
Generator: 0 records selected for fetching, exiting ...
Generate returned 1 (no new segments created)
Escaping loop: no more URLs to fetch now

致以最诚挚的问候

您只想从特定域获取页面

您已经尝试了

db.ignore.external.links

，但这限制了seek.txt URL以外的任何内容

您应该尝试

conf/regex urlfilter.txt

如以下示例所示：

您只想从特定域获取页面

您已经尝试了

db.ignore.external.links

，但这限制了seek.txt URL以外的任何内容

您应该尝试

conf/regex urlfilter.txt

如以下示例所示：

你在使用“爬网”脚本吗？如果是，请确保您给出的级别大于1。如果您运行类似于“bin/crawl-seedfoldername-crawdb-1”的程序。它将只抓取seed.txt中列出的URL

要对特定域进行爬网，可以使用regex-urlfiltee.txt文件。

是否使用“爬网”脚本？如果是，请确保您给出的级别大于1。如果您运行类似于“bin/crawl-seedfoldername-crawdb-1”的程序。它将只抓取seed.txt中列出的URL

要对特定域进行爬网，可以使用regex-urlfiltee.txt文件。

在nutch-site.xml中添加以下属性

<property> 
<name>db.ignore.external.links</name> 
<value>true</value> 
<description>If true, outlinks leading from a page to external hosts will be ignored. This is an effective way to limit the crawl to include only initially injected hosts, without creating complex URLFilters. </description> 
</property>


db.ignore.external.links
真的
如果为true，则将忽略从页面到外部主机的大纲链接。这是一种有效的方法，可以将爬网限制为仅包括最初注入的主机，而不创建复杂的URLfilter。

在nutch-site.xml中添加以下属性

<property> 
<name>db.ignore.external.links</name> 
<value>true</value> 
<description>If true, outlinks leading from a page to external hosts will be ignored. This is an effective way to limit the crawl to include only initially injected hosts, without creating complex URLFilters. </description> 
</property>


db.ignore.external.links
真的
如果为true，则将忽略从页面到外部主机的大纲链接。这是一种有效的方法，可以将爬网限制为仅包括最初注入的主机，而不创建复杂的URLfilter。

是的，我正在使用深度为200的爬网脚本。当我编辑引用url筛选器时，结果与以前相同。如果您正在运行爬网脚本，我建议删除您的crawldb文件夹，然后重新运行..并确保您的种子url页面具有爬网程序可以爬网到的其他链接是的，我正在使用200深度的爬网脚本。当我编辑引用url筛选器时，结果与以前相同。如果您正在运行爬网脚本，我建议删除您的crawldb文件夹，然后重新运行..并确保您的种子url页面具有爬网程序可以爬网到的其他链接