阅读Python中robots.txt的内容并打印_Python_Html_Python 2.7_Urllib2_Robots.txt

阅读Python中robots.txt的内容并打印

python html python-2.7

阅读Python中robots.txt的内容并打印,python,html,python-2.7,urllib2,robots.txt,Python,Html,Python 2.7,Urllib2,Robots.txt,我想检查一个给定的网站是否包含robot.txt，阅读该文件的所有内容并打印它。也许还可以将内容添加到字典中我试过玩，但不知道怎么玩我只想使用标准Python 2.7包附带的模块我按照@Stefano Sanfilippo的建议做了： from urllib.request import urlopen 返回 Traceback (most recent call last): File "<pyshell#1>", line 1, in <module&g

我想检查一个给定的网站是否包含robot.txt，阅读该文件的所有内容并打印它。也许还可以将内容添加到字典中

我试过玩，但不知道怎么玩

我只想使用标准Python 2.7包附带的模块

我按照@Stefano Sanfilippo的建议做了：

from urllib.request import urlopen

    Traceback (most recent call last):
  File "<pyshell#1>", line 1, in <module>
    from urllib.request import urlopen
ImportError: No module named request

但是得到：

Traceback (most recent call last):

文件“”，第1行，在以urlopen（“”）作为流： AttributeError:AddInfo实例没有属性“退出”

从外观上看，这似乎是2.7版本中不支持的。事实上，该代码在Python3上运行良好

知道如何解决这个问题吗？

是的，

robots.txt

只是一个文件，下载并打印它

Python 3：

from urllib.request import urlopen

with urlopen("https://www.google.com/robots.txt") as stream:
    print(stream.read().decode("utf-8"))

Python 2：

from urllib import urlopen
from contextlib import closing

with closing(urlopen("https://www.google.com/robots.txt")) as stream:
    print stream.read()

请注意，路径始终为

/robots.txt

如果您需要将内容放入词典，

.split（“：”

和

.strip（）

是您的朋友：

您不需要了解网站的结构，就可以知道

robots.txt

必须在哪里。总是在

which.site.name/robots.txt

@jonsharpe，我重新编写了这个问题。现在够窄了吗？问题解决了，但我想知道是否可以删除“暂停”状态。谢谢您的代码适用于Python3，但不适用于Python2.7。您能建议我如何使其适用于Python2.7吗？请参阅编辑。但是，您应该真正使用Python3，除非您有特定的理由坚持使用Python2。Python2是传统的，我不能这么说。谢谢@Stefano Sanfilippo，我会检查工具2to3来转换我的代码。我不知道为什么我觉得使用2.7版本仍然是个好主意。

from urllib import urlopen
from contextlib import closing

with closing(urlopen("https://www.google.com/robots.txt")) as stream:
    print stream.read()