Xml 如何读取站点地图及其目录？_Xml_Web Crawler_Sitemap_Robots.txt

Xml 如何读取站点地图及其目录？

xml web-crawler

Xml 如何读取站点地图及其目录？,xml,web-crawler,sitemap,robots.txt,Xml,Web Crawler,Sitemap,Robots.txt,我正在为这个特定的网站构建一个网络爬虫检查robots.txt User-agent: * Disallow: /site= Disallow: /5480.iac. Disallow: /go/ Disallow: /audio.html/ Disallow: /houseads/ Disallow: /askhome/ Disallow: /cite.html Disallow: /23219321/iac. Allow: / Sitemap: http://www.dictionar

我正在为这个特定的网站构建一个网络爬虫

检查

robots.txt

User-agent: *
Disallow: /site=
Disallow: /5480.iac.
Disallow: /go/
Disallow: /audio.html/
Disallow: /houseads/
Disallow: /askhome/
Disallow: /cite.html
Disallow: /23219321/iac.

Allow: /
Sitemap: http://www.dictionary.com/dictionary-sitemap/sitemap.xml

从站点地图链接我可以下载并阅读它。所以我的问题是如何阅读网站地图并找到它不允许我使用的目录

对不起，如果我的问题太模糊，但我无法理解这是如何工作的，我是这个问题的新手

不能对路径以

/site=

、

/5480.iac.

、…、

/cite.html

或

/23219321/iac.

开头的URL进行爬网

比如说,

不允许您对如下URL进行爬网：

http://www.dictionary.com/go/ http://www.dictionary.com/go/foo http://www.dictionary.com/go/foo/bar
允许您对如下URL进行爬网：

http://www.dictionary.com/go http://www.dictionary.com/go.html http://www.dictionary.com/foo/go/
如果根据robots.txt，站点地图包含不允许爬网的URL，则仍然不允许爬网。

虽然在站点地图中包含不应爬网的URL似乎违反直觉，但这是有道理的（例如，因为站点地图被爬网程序以外的其他代理使用，或者因为只有少数特定的机器人不允许爬网）。
您不可以爬网路径以
/site=
，
/5480.iac.
…，
/cite.html
，或
/23219321/iac.
比如说,
不允许您对如下URL进行爬网：

http://www.dictionary.com/go/ http://www.dictionary.com/go/foo http://www.dictionary.com/go/foo/bar
允许您对如下URL进行爬网：

http://www.dictionary.com/go http://www.dictionary.com/go.html http://www.dictionary.com/foo/go/
如果根据robots.txt，站点地图包含不允许爬网的URL，则仍然不允许爬网。

虽然在站点地图中包含不应爬网的URL似乎违反直觉，但这是有意义的（例如，因为站点地图被爬网程序以外的其他代理使用，或者因为只有少数特定的机器人不允许爬网）。
站点地图的目的是帮助搜索引擎为网站编制索引。它不应该包含robots.txt文件中不允许的任何URL。@DanNagle所以我可以用自己的网络爬虫对网站进行“网络爬网”？网站地图的目的是帮助搜索引擎对网站进行索引。它不应该包含robots.txt文件中不允许的任何URL。@DanNagle那么我可以用我自己的网络爬虫对网站进行“网络爬网”了吗？