Python urllib.robotparser.RobotFileParser()每次运行时都会给出不同的结果-http状态?
urllib.robotparser.RobotFileParser()每次运行时都会给我不同的结果 这样说-不允许Python urllib.robotparser.RobotFileParser()每次运行时都会给出不同的结果-http状态?,python,robots.txt,Python,Robots.txt,urllib.robotparser.RobotFileParser()每次运行时都会给我不同的结果 这样说-不允许/search.htm* # robots.txt for https://www.alza.cz/ User-Agent: * Disallow: /Order1.htm Disallow: /Order2.htm Disallow: /Order3.htm Disallow: /Order4.htm Disallow: /Order5.htm Disallow: /downl
/search.htm*
# robots.txt for https://www.alza.cz/
User-Agent: *
Disallow: /Order1.htm
Disallow: /Order2.htm
Disallow: /Order3.htm
Disallow: /Order4.htm
Disallow: /Order5.htm
Disallow: /download/
Disallow: /muj-ucet/
Disallow: /Secure/
Disallow: /LostPassword.htm
Disallow: /search.htm*
Sitemap: https://www.alza.cz/_sitemap-categories.xml
Sitemap: https://www.alza.cz/_sitemap-categories-producers.xml
Sitemap: https://www.alza.cz/_sitemap-live-product.xml
Sitemap: https://www.alza.cz/_sitemap-dead-product.xml
Sitemap: https://www.alza.cz/_sitemap-before_listing.xml
Sitemap: https://www.alza.cz/_sitemap-seo-sorted-categories.xml
Sitemap: https://www.alza.cz/_sitemap-bazaar-categories.xml
Sitemap: https://www.alza.cz/_sitemap-sale-categories.xml
Sitemap: https://www.alza.cz/_sitemap-parametrically-generated-pages.xml
Sitemap: https://www.alza.cz/_sitemap-parametrically-generated-pages-producer.xml
Sitemap: https://www.alza.cz/_sitemap-articles.xml
Sitemap: https://www.alza.cz/_sitemap-producers.xml
Sitemap: https://www.alza.cz/_sitemap-econtent.xml
Sitemap: https://www.alza.cz/_sitemap-dead-econtent.xml
Sitemap: https://www.alza.cz/_sitemap-branch-categories.xml
Sitemap: https://www.alza.cz/_sitemap-installments.xml
Sitemap: https://www.alza.cz/_sitemap-detail-page-slots-of-accessories.xml
Sitemap: https://www.alza.cz/_sitemap-reviews.xml
Sitemap: https://www.alza.cz/_sitemap-detail-page-bazaar.xml
Sitemap: https://www.alza.cz/_sitemap-productgroups.xml
Sitemap: https://www.alza.cz/_sitemap-accessories.xml
然而,当我第一次运行以下内容时,我得到了FALSE(这是正确的),但现在每次运行它我都得到了TRUE(这是不正确的):
从源代码中找到了这段代码,它表明服务器响应的http状态代码介于400和499之间,这确实很奇怪,不幸的是我自己无法检查
def read(self):
"""Reads the robots.txt URL and feeds it to the parser."""
try:
f = urllib.request.urlopen(self.url)
except urllib.error.HTTPError as err:
if err.code in (401, 403):
self.disallow_all = True
elif err.code >= 400 and err.code < 500:
self.allow_all = True
else:
raw = f.read()
self.parse(raw.decode("utf-8").splitlines())
# Until the robots.txt file has been read or found not
# to exist, we must assume that no url is allowable.
# This prevents false positives when a user erroneously
# calls can_fetch() before calling read().
def读取(自):
“”“读取robots.txt URL并将其提供给解析器。”“”
尝试:
f=urllib.request.urlopen(self.url)
除了urllib.error.HTTPError作为错误:
如果(401403)中有错误代码:
self.disallow\u all=True
elif err.code>=400且err.code<500:
self.allow_all=True
其他:
raw=f.read()
self.parse(原始解码(“utf-8”).splitlines())
#直到robots.txt文件被读取或找不到为止
#为了存在,我们必须假设不允许url。
#这可以防止当用户错误地
#调用read()之前,可以先调用fetch()。
关于可能发生的事情有什么想法吗
编辑:我更新的源代码和没有坏的状态,它给200。我不明白为什么要给这个url一个通行证
def read(self):
"""Reads the robots.txt URL and feeds it to the parser."""
try:
f = urllib.request.urlopen(self.url)
except urllib.error.HTTPError as err:
if err.code in (401, 403):
self.disallow_all = True
elif err.code >= 400 and err.code < 500:
self.allow_all = True
else:
raw = f.read()
self.parse(raw.decode("utf-8").splitlines())
# Until the robots.txt file has been read or found not
# to exist, we must assume that no url is allowable.
# This prevents false positives when a user erroneously
# calls can_fetch() before calling read().