Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/python/296.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python urllib.robotparser.RobotFileParser()每次运行时都会给出不同的结果-http状态?_Python_Robots.txt - Fatal编程技术网

Python urllib.robotparser.RobotFileParser()每次运行时都会给出不同的结果-http状态?

Python urllib.robotparser.RobotFileParser()每次运行时都会给出不同的结果-http状态?,python,robots.txt,Python,Robots.txt,urllib.robotparser.RobotFileParser()每次运行时都会给我不同的结果 这样说-不允许/search.htm* # robots.txt for https://www.alza.cz/ User-Agent: * Disallow: /Order1.htm Disallow: /Order2.htm Disallow: /Order3.htm Disallow: /Order4.htm Disallow: /Order5.htm Disallow: /downl

urllib.robotparser.RobotFileParser()每次运行时都会给我不同的结果

这样说-不允许
/search.htm*

# robots.txt for https://www.alza.cz/

User-Agent: *
Disallow: /Order1.htm
Disallow: /Order2.htm
Disallow: /Order3.htm
Disallow: /Order4.htm
Disallow: /Order5.htm
Disallow: /download/
Disallow: /muj-ucet/
Disallow: /Secure/
Disallow: /LostPassword.htm
Disallow: /search.htm*

Sitemap: https://www.alza.cz/_sitemap-categories.xml
Sitemap: https://www.alza.cz/_sitemap-categories-producers.xml
Sitemap: https://www.alza.cz/_sitemap-live-product.xml
Sitemap: https://www.alza.cz/_sitemap-dead-product.xml
Sitemap: https://www.alza.cz/_sitemap-before_listing.xml
Sitemap: https://www.alza.cz/_sitemap-seo-sorted-categories.xml
Sitemap: https://www.alza.cz/_sitemap-bazaar-categories.xml
Sitemap: https://www.alza.cz/_sitemap-sale-categories.xml
Sitemap: https://www.alza.cz/_sitemap-parametrically-generated-pages.xml
Sitemap: https://www.alza.cz/_sitemap-parametrically-generated-pages-producer.xml
Sitemap: https://www.alza.cz/_sitemap-articles.xml
Sitemap: https://www.alza.cz/_sitemap-producers.xml
Sitemap: https://www.alza.cz/_sitemap-econtent.xml
Sitemap: https://www.alza.cz/_sitemap-dead-econtent.xml
Sitemap: https://www.alza.cz/_sitemap-branch-categories.xml
Sitemap: https://www.alza.cz/_sitemap-installments.xml
Sitemap: https://www.alza.cz/_sitemap-detail-page-slots-of-accessories.xml
Sitemap: https://www.alza.cz/_sitemap-reviews.xml
Sitemap: https://www.alza.cz/_sitemap-detail-page-bazaar.xml
Sitemap: https://www.alza.cz/_sitemap-productgroups.xml
Sitemap: https://www.alza.cz/_sitemap-accessories.xml
然而,当我第一次运行以下内容时,我得到了FALSE(这是正确的),但现在每次运行它我都得到了TRUE(这是不正确的):

从源代码中找到了这段代码,它表明服务器响应的http状态代码介于400和499之间,这确实很奇怪,不幸的是我自己无法检查

def read(self):
    """Reads the robots.txt URL and feeds it to the parser."""
    try:
        f = urllib.request.urlopen(self.url)
    except urllib.error.HTTPError as err:
        if err.code in (401, 403):
            self.disallow_all = True
        elif err.code >= 400 and err.code < 500:
            self.allow_all = True
    else:
        raw = f.read()
        self.parse(raw.decode("utf-8").splitlines())

    # Until the robots.txt file has been read or found not
    # to exist, we must assume that no url is allowable.
    # This prevents false positives when a user erroneously
    # calls can_fetch() before calling read().
def读取(自):
“”“读取robots.txt URL并将其提供给解析器。”“”
尝试:
f=urllib.request.urlopen(self.url)
除了urllib.error.HTTPError作为错误:
如果(401403)中有错误代码:
self.disallow\u all=True
elif err.code>=400且err.code<500:
self.allow_all=True
其他:
raw=f.read()
self.parse(原始解码(“utf-8”).splitlines())
#直到robots.txt文件被读取或找不到为止
#为了存在,我们必须假设不允许url。
#这可以防止当用户错误地
#调用read()之前,可以先调用fetch()。
关于可能发生的事情有什么想法吗

编辑:我更新的源代码和没有坏的状态,它给200。我不明白为什么要给这个url一个通行证

def read(self):
    """Reads the robots.txt URL and feeds it to the parser."""
    try:
        f = urllib.request.urlopen(self.url)
    except urllib.error.HTTPError as err:
        if err.code in (401, 403):
            self.disallow_all = True
        elif err.code >= 400 and err.code < 500:
            self.allow_all = True
    else:
        raw = f.read()
        self.parse(raw.decode("utf-8").splitlines())

    # Until the robots.txt file has been read or found not
    # to exist, we must assume that no url is allowable.
    # This prevents false positives when a user erroneously
    # calls can_fetch() before calling read().