Python 创建了我的第一个webcrawler，我如何获得“a”；“URL堆栈跟踪”/URL的历史记录'；每个端点有多少个端点？_Python_Url_Web Crawler_Python Requests_Scrapy Spider

Python 创建了我的第一个webcrawler，我如何获得“a”；“URL堆栈跟踪”/URL的历史记录'；每个端点有多少个端点？

python url web-crawler

Python 创建了我的第一个webcrawler，我如何获得“a”；“URL堆栈跟踪”/URL的历史记录'；每个端点有多少个端点？,python,url,web-crawler,python-requests,scrapy-spider,Python,Url,Web Crawler,Python Requests,Scrapy Spider,我创建了一个web爬虫程序，在给定一个基本url的情况下，它将爬行并找到所有可能的端点。虽然我能够获取所有端点，但我首先需要一种方法来确定我是如何到达这些端点的——一个“url堆栈跟踪”路径或指向每个端点的url的面包屑我首先查找所有给定基本url的url。由于我正在寻找的子链接位于json中，因此我认为最好的方法是使用我在这里找到的递归字典示例的变体：重申一下，get_leaf_nodes_列表执行一个get请求，并在json中查找url的任何值（如果每个键的值中都有“http”字符串），

我创建了一个web爬虫程序，在给定一个基本url的情况下，它将爬行并找到所有可能的端点。虽然我能够获取所有端点，但我首先需要一种方法来确定我是如何到达这些端点的——一个“url堆栈跟踪”路径或指向每个端点的url的面包屑

我首先查找所有给定基本url的url。由于我正在寻找的子链接位于json中，因此我认为最好的方法是使用我在这里找到的递归字典示例的变体：

重申一下，get_leaf_nodes_列表执行一个get请求，并在json中查找url的任何值（如果每个键的值中都有“http”字符串），以递归方式执行更多get请求，直到没有url为止

因此，在此重申我的问题：

如何获得我访问每个端点所点击的所有url的线性历史记录
由此推论，我该如何存储这段历史？随着叶节点列表的增长，我的处理速度越来越慢，我想知道是有更好的数据类型来存储这些信息，还是有更有效的处理上述代码的方法

您还可以查看使用json模块加载json

导入json

使用beautifulsoup更简单、更高效。有一些方法可以解析html文件并获取链接。这将使您的代码更清晰，循环更少。@david zemens:对不起，我不明白这对获取url端点的线性历史有什么帮助（即，给定基本url时我是如何到达端点的）。@b10n1k谢谢，我一定会检查一下

import requests
import pytest
import time

BASE_URL = "https://www.my-website.com/"

def get_leaf_nodes_list(base_url):
    """
    :base_url: The starting point to crawl
    :return: List of all possible endpoints
    """

    class Namespace(object):
        # A wrapper function is used to create a Namespace instance to hold the ns.results variable
        pass
    ns = Namespace()
    ns.results = []

    r = requests.get(BASE_URL)
    time.sleep(0.5)  # so we don't cause a DDOS?
    data = r.json()

    def dict_crawler(data):
        # Retrieve all nodes from nested dict
        if isinstance(data, dict):
            for item in data.values():
                dict_crawler(item)
        elif isinstance(data, list) or isinstance(data, tuple):
            for item in data:
                dict_crawler(item)
        else:
            if type(data) is unicode:
                if "http" in data:  # If http in value, keep going
                    # If data is not already in ns.results, don't append it
                    if str(data) not in ns.results:  
                        ns.results.append(data)
                        sub_r = requests.get(data)
                        time.sleep(0.5)  # so we don't cause a DDOS?
                        sub_r_data = sub_r.json()
                        dict_crawler(sub_r_data)

    dict_crawler(data)
    return ns.results