Python 正在检查url以查找404错误_Python_Web Scraping_Http Status Code 404_Scrapy

Python 正在检查url以查找404错误

python web-scraping scrapy

Python 正在检查url以查找404错误,python,web-scraping,http-status-code-404,scrapy,Python,Web Scraping,Http Status Code 404,Scrapy,我正在浏览一组页面，我不确定有多少，但当前页面由url中的一个简单数字表示（例如“”）我想在scrapy中使用for循环来增加页面的当前猜测，并在达到404时停止。我知道从请求返回的响应包含此信息，但我不确定如何从请求自动获取响应有什么办法吗目前，我的代码大致如下： def start_requests(self): baseUrl = "http://website.com/page/" currentPage = 0 stillExists = True

我正在浏览一组页面，我不确定有多少，但当前页面由url中的一个简单数字表示（例如“”）

我想在scrapy中使用for循环来增加页面的当前猜测，并在达到404时停止。我知道从请求返回的响应包含此信息，但我不确定如何从请求自动获取响应

有什么办法吗

目前，我的代码大致如下：

def start_requests(self):
    baseUrl = "http://website.com/page/"
    currentPage = 0
    stillExists = True
    while(stillExists):
        currentUrl = baseUrl + str(currentPage)
        test = Request(currentUrl)
        if test.response.status != 404: #This is what I'm not sure of
            yield test
            currentPage += 1
        else:
            stillExists = False

您可以这样做：

from __future__ import print_function
import urllib2

baseURL = "http://www.website.com/page/"

for n in xrange(100):
    fullURL = baseURL + str(n)
    #print fullURL
    try:
        req = urllib2.Request(fullURL)
        resp = urllib2.urlopen(req)
        if resp.getcode() == 404:
            #Do whatever you want if 404 is found
            print ("404 Found!")
        else:
            #Do your normal stuff here if page is found.
            print ("URL: {0} Response: {1}".format(fullURL, resp.getcode()))
    except:
        print ("Could not connect to URL: {0} ".format(fullURL))

这将遍历范围并尝试通过

urllib2

连接到每个URL。我不知道

scapy

或示例函数如何打开URL，但这是一个如何通过

urllib2

实现的示例

请注意，大多数使用这种URL格式的网站通常运行CMS，该CMS可以自动将不存在的页面重定向到自定义

404-未找到页面，该页面仍将显示为HTTP状态代码200。在这种情况下，查找可能出现但实际上只是自定义404页面的页面的最佳方法是，您应该进行一些屏幕清理，并查找在“正常”页面返回过程中可能未出现的任何内容，例如显示“未找到页面”的文本或与自定义404页面类似且唯一的内容。
您可以执行以下操作：
from __future__ import print_function
import urllib2

baseURL = "http://www.website.com/page/"

for n in xrange(100):
    fullURL = baseURL + str(n)
    #print fullURL
    try:
        req = urllib2.Request(fullURL)
        resp = urllib2.urlopen(req)
        if resp.getcode() == 404:
            #Do whatever you want if 404 is found
            print ("404 Found!")
        else:
            #Do your normal stuff here if page is found.
            print ("URL: {0} Response: {1}".format(fullURL, resp.getcode()))
    except:
        print ("Could not connect to URL: {0} ".format(fullURL))

这将遍历范围并尝试通过urllib2
连接到每个URL。我不知道scapy
或示例函数如何打开URL，但这是一个如何通过urllib2
实现的示例
请注意，大多数使用这种URL格式的网站通常运行CMS，该CMS可以自动将不存在的页面重定向到自定义404-未找到页面，该页面仍将显示为HTTP状态代码200。在这种情况下，查找可能显示但实际上只是自定义404页的页面的最佳方法是，您应该进行一些屏幕抓取，并查找在“正常”页面返回期间可能未显示的任何内容，例如显示“未找到页面”的文本或者与自定义404页面类似且独特的内容。
您需要生成/返回请求以检查状态，创建请求
对象实际上不会发送请求
class MySpider(BaseSpider):
    name = 'website.com'
    baseUrl = "http://website.com/page/"

    def start_requests(self):
        yield Request(self.baseUrl + '0')

    def parse(self, response):
        if response.status != 404:
            page = response.meta.get('page', 0) + 1
            return Request('%s%s' % (self.baseUrl, page), meta=dict(page=page))

您需要生成/返回请求以检查状态，创建请求
对象实际上并不发送请求
class MySpider(BaseSpider):
    name = 'website.com'
    baseUrl = "http://website.com/page/"

    def start_requests(self):
        yield Request(self.baseUrl + '0')

    def parse(self, response):
        if response.status != 404:
            page = response.meta.get('page', 0) + 1
            return Request('%s%s' % (self.baseUrl, page), meta=dict(page=page))

根据我的经验，大多数自定义404页面都会返回404状态码。事实证明，他们的页面不会返回404状态码。如果不检查其内容，我就无法真正解决此问题，这会使速度过慢，但这个答案会正常解决问题。根据我的经验，大多数自定义404页面都会返回404状态码。事实证明，他们的页面不会返回404状态码，如果不检查它们的内容，我就不能真正解决这个问题，这会使它太慢，但是这个答案会正常解决问题。