Jekyll 什么'；使用多个repo为github页面编写robots.txt的最佳方法是什么？_Jekyll_Sitemap_Github Pages_Robots.txt

Jekyll 什么'；使用多个repo为github页面编写robots.txt的最佳方法是什么？

jekyll

Jekyll 什么'；使用多个repo为github页面编写robots.txt的最佳方法是什么？,jekyll,sitemap,github-pages,robots.txt,Jekyll,Sitemap,Github Pages,Robots.txt,我正在使用Github页面与Jekyll建立我的个人网站。我在username.github.iorepo中有一个主站点，在projectArepo中有一个站点，在projectBrepo中有一个站点，以此类推。我在username.github.iorepo下放了一个CNAME文件，这样我的所有网站都在自定义域名下（www.mydomain.com）。我注意到，robots.txt文件指向每个repo下的sitemap.txt文件，sitemap.txt只能包含每个单独repo中页面的页面链接

我正在使用Github页面与Jekyll建立我的个人网站。我在

username.github.io

repo中有一个主站点，在

projectA

repo中有一个站点，在

projectB

repo中有一个站点，以此类推。我在

username.github.io

repo下放了一个

CNAME

文件，这样我的所有网站都在自定义域名下（

www.mydomain.com

）。我注意到，

robots.txt

文件指向每个repo下的

sitemap.txt

文件，

sitemap.txt

只能包含每个单独repo中页面的页面链接。因此，我有几个问题：

由于我的网站结构为

www.mydomain.com

，

www.mydomain.com/projectA

，

www.mydomain.com/projectB

等，与单一回购协议中的页面相对应，即使

username.github.io

head repo下的

sitemap.txt

仅在单个repo中生成页面链接，搜索引擎是否会识别所有我的网站页面

在我的例子中，编写

robots.txt

文件的最佳方法是什么

谢谢！气

简单的回答是：在您的网络服务器。资料来源：

您还可以从中了解到，不会对www.mydomain.com/folder/robots.txt url进行爬网

基本的www.mydomain.com/robots.txt可以是：

User-agent: *

这将指示爬虫通过以下链接浏览所有www.mydomain.com文件层次结构

如果www.mydomain.com上没有任何页面引用您的项目页面，您可以添加：

User-agent: *
allow: /ProjectA
allow: /projectB

标准和免责声明

Sitemap:

在robots.txt中是一个非标准的扩展。记住：

维基百科还将

allow:

列为非标准扩展

robots.txt中的多个站点地图在robots.txt中指定多个站点地图时，格式如下：

Sitemap: http://www.example.com/sitemap-host1.xml

Sitemap: http://www.example.com/sitemap-host2.xml

网站地图索引还有一种类型的站点地图文件是

如果您有一个站点地图索引文件，您可以只包含该文件的位置。您不需要列出索引文件中列出的每个站点地图

或者，您可以使用

建议或者，如果您使用的是站点地图索引文件

User-agent: *
Disallow: /project_to_disallow/
Disallow: /projectname/page_to_disallow.html

Sitemap: http://www.example.com/siteindex.xml

其中

http://www.example.com/siteindex.xml

看起来像

<?xml version="1.0" encoding="UTF-8"?>

<sitemapindex xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">

   <sitemap>

      <loc>http://www.example.com/sitemap.xml</loc>

   </sitemap>

   <sitemap>

      <loc>http://www.example.com/projectA/sitemap.xml</loc>

   </sitemap>

   <sitemap>

      <loc>http://www.example.com/projectB/sitemap.xml</loc>

   </sitemap>

</sitemapindex>


http://www.example.com/sitemap.xml
http://www.example.com/projectA/sitemap.xml
http://www.example.com/projectB/sitemap.xml

有关如何使用GitHub页面设置robots.txt的信息，请参阅我的答案。

我对每个项目回购都有特殊的规则，我将允许的链接放在每个项目回购下的

sitemap.txt

文件下。我可以使用类似

sitemap的东西吗：https://www.example.com/sitemap.txt; 网站地图：https://www.example.com/ProjectA/sitemap.txt; 网站地图：https://www.example.com/ProjectB/sitemap.txt

（当然是三行）？这样，如果在项目回购下有任何机器人规则被更改，我就不需要更新顶级回购。谢谢你的回复。

User-agent: *
Disallow: /project_to_disallow/
Disallow: /projectname/page_to_disallow.html

Sitemap: http://www.example.com/sitemap.xml

Sitemap: http://www.example.com/projectA/sitemap.xml

Sitemap: http://www.example.com/projectB/sitemap.xml

User-agent: *
Disallow: /project_to_disallow/
Disallow: /projectname/page_to_disallow.html

Sitemap: http://www.example.com/siteindex.xml

<?xml version="1.0" encoding="UTF-8"?>

<sitemapindex xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">

   <sitemap>

      <loc>http://www.example.com/sitemap.xml</loc>

   </sitemap>

   <sitemap>

      <loc>http://www.example.com/projectA/sitemap.xml</loc>

   </sitemap>

   <sitemap>

      <loc>http://www.example.com/projectB/sitemap.xml</loc>

   </sitemap>

</sitemapindex>