Python 用于快速多维数据查找的数据模型和数据存储技术

Python 用于快速多维数据查找的数据模型和数据存储技术,python,hashmap,key-value,datastore,datamodel,Python,Hashmap,Key Value,Datastore,Datamodel,我有一个parenthashmap数据结构,其中字符串作为键,hashmap数据结构作为子结构(guesschild1,child2,…,childN)。每个子元素都是一个简单的键值映射,以数字作为键,以字符串作为值。 在伪代码中: parent['key1'] = child1; // child1 is a hash map data structure child1[0] = 'foo'; child[1] = 'bar'; ... 我需要将此数据结构实现为数据库系统中的快速查找表

我有一个
parent
hashmap数据结构,其中字符串作为键,hashmap数据结构作为子结构(guess
child1
child2
,…,
childN
)。每个子元素都是一个简单的键值映射,以数字作为键,以字符串作为值。 在伪代码中:

parent['key1'] = child1;    // child1 is a hash map data structure
child1[0] = 'foo';
child[1] = 'bar';
...
我需要将此数据结构实现为数据库系统中的快速查找表。 让我们以Python作为参考语言

解决方案的要求:

  • 尽可能快地检索儿童hasmaps
  • 父散列的估计总重量最多为500 MB
  • 用例如下所示:

  • 客户端Python程序查询数据存储中的特定子哈希
  • 数据存储返回子散列
  • Python程序将整个散列传递给特定函数,从散列中提取特定值(它已经知道要使用哪个键),然后将其传递给第二个函数
  • 您会推荐内存中的键值数据存储(如Redis)还是更经典的“关系”数据库解决方案?你建议我使用哪种数据模型?

    绝对可以。它不仅速度非常快,而且可以精确地处理您需要的结构:

    在您的情况下,您可以避免读取整个“子哈希”,因为客户端“从哈希中提取特定值(它已经知道要使用哪个键)”

    或者,如果您确实想要整个散列:

    redis> HGETALL myhash
    1) "field1"
    2) "Hello"
    3) "field2"
    4) "World"
    redis>
    
    当然,在一个可行的对象(在您的例子中是一个Python字典)中使用a会给出正确的结果。

    绝对可以。它不仅速度非常快,而且可以精确地处理您需要的结构:

    在您的情况下,您可以避免读取整个“子哈希”,因为客户端“从哈希中提取特定值(它已经知道要使用哪个键)”

    或者,如果您确实想要整个散列:

    redis> HGETALL myhash
    1) "field1"
    2) "Hello"
    3) "field2"
    4) "World"
    redis>
    

    当然,使用a可以在一个可行的对象(在您的例子中是Python字典)中给出正确的结果。

    在基于提示的快速搜索之后,我想出了这个解决方案:我可以在Redis中实现一个单一的
    父项
    哈希,其中值字段将是子哈希的字符串表示形式。通过这种方式,我可以从Python程序中快速读取并评估它们

    举个例子,我的Redis数据结构类似于:

    //write a hash with N key-value pairs: each value is an M key-value pairs hash
    redis> HMSET parent_key1 child_hash "c1k1:c1v1, c1k2:c1v2, [...], c1kM:c1vM"
      OK
    redis> HMSET parent_key2 child_hash "c2k1:c2v1, c2k2:c2v2, [...], c2kM:c2vM"
      OK
    [...]
    redis> HMSET parent_keyN child_hash "cNk1:cNv1, cNk2:cNv2, [...], cNkM:cNvM"
      OK
    
    //read data
    redis> HGET parent_key1 child_hash
      "c1k1:c1v1, c1k2:c1v2, [...], c1kM:c1vM"
    
    然后,我的Python代码只需要使用Redis绑定来查询所需的子哈希,并返回它们的实际字符串表示形式;剩下要做的就是将字符串表示形式转换为相应的字典,因此可以方便地查找字典

    示例代码(如中所建议):


    希望我没有错过任何东西

    在基于提示的快速搜索之后,我想到了这个解决方案:我可以在Redis中实现一个
    父类
    散列,其中值字段将是子散列的字符串表示形式。通过这种方式,我可以从Python程序中快速读取并评估它们

    举个例子,我的Redis数据结构类似于:

    //write a hash with N key-value pairs: each value is an M key-value pairs hash
    redis> HMSET parent_key1 child_hash "c1k1:c1v1, c1k2:c1v2, [...], c1kM:c1vM"
      OK
    redis> HMSET parent_key2 child_hash "c2k1:c2v1, c2k2:c2v2, [...], c2kM:c2vM"
      OK
    [...]
    redis> HMSET parent_keyN child_hash "cNk1:cNv1, cNk2:cNv2, [...], cNkM:cNvM"
      OK
    
    //read data
    redis> HGET parent_key1 child_hash
      "c1k1:c1v1, c1k2:c1v2, [...], c1kM:c1vM"
    
    然后,我的Python代码只需要使用Redis绑定来查询所需的子哈希,并返回它们的实际字符串表示形式;剩下要做的就是将字符串表示形式转换为相应的字典,因此可以方便地查找字典

    示例代码(如中所建议):

    希望我没有错过任何东西

    使用示例代码,假设已经安装了Redis(理想情况下),将每个父项保存为哈希字段,子项保存为序列化字符串,并在客户端处理序列化和反序列化:

    JSON版本:

    ## JSON version
    import json 
    # you could use pickle instead, 
    # just replace json.dumps/json.loads with pickle/unpickle
    
    import redis
    
    # set up the redis client
    r = redis.StrictRedis(host = '', port = 6379, db = 0)
    
    # sample parent dicts
    parent0 = {'child0': {0:'a', 1:'b', 2:'c',}, 'child1':{5:'e', 6:'f', 7:'g'}}
    parent1 = {'child0': {0:'h', 1:'i', 2:'j',}, 'child1':{5:'k', 6:'l', 7:'m'}}
    
    # save the parents as hashfields, with the children as serialized strings
    # bear in mind that JSON will convert the int keys to strings in the dumps() process
    r.hmset('parent0', {key: json.dumps(parent0[key]) for key in parent0})
    r.hmset('parent1', {key: json.dumps(parent0[key]) for key in parent1})
    
    
    # Get a child dict from a parent
    # say child1 of parent0
    childstring = r.hget('parent0', 'child1') 
    childdict = json.loads(childstring) 
    # this could have been done in a single line... 
    
    # if you want to convert the keys back to ints:
    for key in childdict.keys():
        childdict[int(key)] = childdict[key]
        del childdict[key]
    
    print childdict
    
    ## pickle version
    # For pickle, you need a file-like object. 
    # StringIO is the native python one, whie cStringIO 
    # is the c implementation of the same.
    # cStringIO is faster
    # see http://docs.python.org/library/stringio.html and
    # http://www.doughellmann.com/PyMOTW/StringIO/ for more information
    import pickle
    # Find the best implementation available on this platform
    try:
        from cStringIO import StringIO
    except:
        from StringIO import StringIO
    
    import redis
    
    # set up the redis client
    r = redis.StrictRedis(host = '', port = 6379, db = 0)
    
    # sample parent dicts
    parent0 = {'child0': {0:'a', 1:'b', 2:'c',}, 'child1':{5:'e', 6:'f', 7:'g'}}
    parent1 = {'child0': {0:'h', 1:'i', 2:'j',}, 'child1':{5:'k', 6:'l', 7:'m'}}
    
    # define a class with a reusable StringIO object
    class Pickler(object):
        """Simple helper class to use pickle with a reusable string buffer object"""
        def __init__(self):
            self.tmpstr = StringIO()
    
        def __del__(self):
            # close the StringIO buffer and delete it
            self.tmpstr.close()
            del self.tmpstr
    
        def dump(self, obj):
            """Pickle an object and return the pickled string"""
            # empty current buffer
            self.tmpstr.seek(0,0)
            self.tmpstr.truncate(0)
            # pickle obj into the buffer
            pickle.dump(obj, self.tmpstr)
            # move the buffer pointer to the start
            self.tmpstr.seek(0,0)
            # return the pickled buffer as a string
            return self.tmpstr.read()
    
        def load(self, obj):
            """load a pickled object string and return the object"""
            # empty the current buffer
            self.tmpstr.seek(0,0)
            self.tmpstr.truncate(0)
            # load the pickled obj string into the buffer
            self.tmpstr.write(obj)
            # move the buffer pointer to start
            self.tmpstr.seek(0,0)
            # load the pickled buffer into an object
            return pickle.load(self.tmpstr)
    
    
    pickler = Pickler()
    
    # save the parents as hashfields, with the children as pickled strings, 
    # pickled using our helper class
    r.hmset('parent0', {key: pickler.dump(parent0[key]) for key in parent0})
    r.hmset('parent1', {key: pickler.dump(parent1[key]) for key in parent1})
    
    
    # Get a child dict from a parent
    # say child1 of parent0
    childstring = r.hget('parent0', 'child1') 
    # this could be done in a single line... 
    childdict = pickler.load(childstring) 
    
    # we don't need to do any str to int conversion on the keys.
    
    print childdict
    
    pickle版本:

    ## JSON version
    import json 
    # you could use pickle instead, 
    # just replace json.dumps/json.loads with pickle/unpickle
    
    import redis
    
    # set up the redis client
    r = redis.StrictRedis(host = '', port = 6379, db = 0)
    
    # sample parent dicts
    parent0 = {'child0': {0:'a', 1:'b', 2:'c',}, 'child1':{5:'e', 6:'f', 7:'g'}}
    parent1 = {'child0': {0:'h', 1:'i', 2:'j',}, 'child1':{5:'k', 6:'l', 7:'m'}}
    
    # save the parents as hashfields, with the children as serialized strings
    # bear in mind that JSON will convert the int keys to strings in the dumps() process
    r.hmset('parent0', {key: json.dumps(parent0[key]) for key in parent0})
    r.hmset('parent1', {key: json.dumps(parent0[key]) for key in parent1})
    
    
    # Get a child dict from a parent
    # say child1 of parent0
    childstring = r.hget('parent0', 'child1') 
    childdict = json.loads(childstring) 
    # this could have been done in a single line... 
    
    # if you want to convert the keys back to ints:
    for key in childdict.keys():
        childdict[int(key)] = childdict[key]
        del childdict[key]
    
    print childdict
    
    ## pickle version
    # For pickle, you need a file-like object. 
    # StringIO is the native python one, whie cStringIO 
    # is the c implementation of the same.
    # cStringIO is faster
    # see http://docs.python.org/library/stringio.html and
    # http://www.doughellmann.com/PyMOTW/StringIO/ for more information
    import pickle
    # Find the best implementation available on this platform
    try:
        from cStringIO import StringIO
    except:
        from StringIO import StringIO
    
    import redis
    
    # set up the redis client
    r = redis.StrictRedis(host = '', port = 6379, db = 0)
    
    # sample parent dicts
    parent0 = {'child0': {0:'a', 1:'b', 2:'c',}, 'child1':{5:'e', 6:'f', 7:'g'}}
    parent1 = {'child0': {0:'h', 1:'i', 2:'j',}, 'child1':{5:'k', 6:'l', 7:'m'}}
    
    # define a class with a reusable StringIO object
    class Pickler(object):
        """Simple helper class to use pickle with a reusable string buffer object"""
        def __init__(self):
            self.tmpstr = StringIO()
    
        def __del__(self):
            # close the StringIO buffer and delete it
            self.tmpstr.close()
            del self.tmpstr
    
        def dump(self, obj):
            """Pickle an object and return the pickled string"""
            # empty current buffer
            self.tmpstr.seek(0,0)
            self.tmpstr.truncate(0)
            # pickle obj into the buffer
            pickle.dump(obj, self.tmpstr)
            # move the buffer pointer to the start
            self.tmpstr.seek(0,0)
            # return the pickled buffer as a string
            return self.tmpstr.read()
    
        def load(self, obj):
            """load a pickled object string and return the object"""
            # empty the current buffer
            self.tmpstr.seek(0,0)
            self.tmpstr.truncate(0)
            # load the pickled obj string into the buffer
            self.tmpstr.write(obj)
            # move the buffer pointer to start
            self.tmpstr.seek(0,0)
            # load the pickled buffer into an object
            return pickle.load(self.tmpstr)
    
    
    pickler = Pickler()
    
    # save the parents as hashfields, with the children as pickled strings, 
    # pickled using our helper class
    r.hmset('parent0', {key: pickler.dump(parent0[key]) for key in parent0})
    r.hmset('parent1', {key: pickler.dump(parent1[key]) for key in parent1})
    
    
    # Get a child dict from a parent
    # say child1 of parent0
    childstring = r.hget('parent0', 'child1') 
    # this could be done in a single line... 
    childdict = pickler.load(childstring) 
    
    # we don't need to do any str to int conversion on the keys.
    
    print childdict
    
    示例代码使用,假设已经安装了Redis(理想情况下),将每个父项保存为哈希字段,子项保存为序列化字符串,并在客户端处理序列化和反序列化:

    JSON版本:

    ## JSON version
    import json 
    # you could use pickle instead, 
    # just replace json.dumps/json.loads with pickle/unpickle
    
    import redis
    
    # set up the redis client
    r = redis.StrictRedis(host = '', port = 6379, db = 0)
    
    # sample parent dicts
    parent0 = {'child0': {0:'a', 1:'b', 2:'c',}, 'child1':{5:'e', 6:'f', 7:'g'}}
    parent1 = {'child0': {0:'h', 1:'i', 2:'j',}, 'child1':{5:'k', 6:'l', 7:'m'}}
    
    # save the parents as hashfields, with the children as serialized strings
    # bear in mind that JSON will convert the int keys to strings in the dumps() process
    r.hmset('parent0', {key: json.dumps(parent0[key]) for key in parent0})
    r.hmset('parent1', {key: json.dumps(parent0[key]) for key in parent1})
    
    
    # Get a child dict from a parent
    # say child1 of parent0
    childstring = r.hget('parent0', 'child1') 
    childdict = json.loads(childstring) 
    # this could have been done in a single line... 
    
    # if you want to convert the keys back to ints:
    for key in childdict.keys():
        childdict[int(key)] = childdict[key]
        del childdict[key]
    
    print childdict
    
    ## pickle version
    # For pickle, you need a file-like object. 
    # StringIO is the native python one, whie cStringIO 
    # is the c implementation of the same.
    # cStringIO is faster
    # see http://docs.python.org/library/stringio.html and
    # http://www.doughellmann.com/PyMOTW/StringIO/ for more information
    import pickle
    # Find the best implementation available on this platform
    try:
        from cStringIO import StringIO
    except:
        from StringIO import StringIO
    
    import redis
    
    # set up the redis client
    r = redis.StrictRedis(host = '', port = 6379, db = 0)
    
    # sample parent dicts
    parent0 = {'child0': {0:'a', 1:'b', 2:'c',}, 'child1':{5:'e', 6:'f', 7:'g'}}
    parent1 = {'child0': {0:'h', 1:'i', 2:'j',}, 'child1':{5:'k', 6:'l', 7:'m'}}
    
    # define a class with a reusable StringIO object
    class Pickler(object):
        """Simple helper class to use pickle with a reusable string buffer object"""
        def __init__(self):
            self.tmpstr = StringIO()
    
        def __del__(self):
            # close the StringIO buffer and delete it
            self.tmpstr.close()
            del self.tmpstr
    
        def dump(self, obj):
            """Pickle an object and return the pickled string"""
            # empty current buffer
            self.tmpstr.seek(0,0)
            self.tmpstr.truncate(0)
            # pickle obj into the buffer
            pickle.dump(obj, self.tmpstr)
            # move the buffer pointer to the start
            self.tmpstr.seek(0,0)
            # return the pickled buffer as a string
            return self.tmpstr.read()
    
        def load(self, obj):
            """load a pickled object string and return the object"""
            # empty the current buffer
            self.tmpstr.seek(0,0)
            self.tmpstr.truncate(0)
            # load the pickled obj string into the buffer
            self.tmpstr.write(obj)
            # move the buffer pointer to start
            self.tmpstr.seek(0,0)
            # load the pickled buffer into an object
            return pickle.load(self.tmpstr)
    
    
    pickler = Pickler()
    
    # save the parents as hashfields, with the children as pickled strings, 
    # pickled using our helper class
    r.hmset('parent0', {key: pickler.dump(parent0[key]) for key in parent0})
    r.hmset('parent1', {key: pickler.dump(parent1[key]) for key in parent1})
    
    
    # Get a child dict from a parent
    # say child1 of parent0
    childstring = r.hget('parent0', 'child1') 
    # this could be done in a single line... 
    childdict = pickler.load(childstring) 
    
    # we don't need to do any str to int conversion on the keys.
    
    print childdict
    
    pickle版本:

    ## JSON version
    import json 
    # you could use pickle instead, 
    # just replace json.dumps/json.loads with pickle/unpickle
    
    import redis
    
    # set up the redis client
    r = redis.StrictRedis(host = '', port = 6379, db = 0)
    
    # sample parent dicts
    parent0 = {'child0': {0:'a', 1:'b', 2:'c',}, 'child1':{5:'e', 6:'f', 7:'g'}}
    parent1 = {'child0': {0:'h', 1:'i', 2:'j',}, 'child1':{5:'k', 6:'l', 7:'m'}}
    
    # save the parents as hashfields, with the children as serialized strings
    # bear in mind that JSON will convert the int keys to strings in the dumps() process
    r.hmset('parent0', {key: json.dumps(parent0[key]) for key in parent0})
    r.hmset('parent1', {key: json.dumps(parent0[key]) for key in parent1})
    
    
    # Get a child dict from a parent
    # say child1 of parent0
    childstring = r.hget('parent0', 'child1') 
    childdict = json.loads(childstring) 
    # this could have been done in a single line... 
    
    # if you want to convert the keys back to ints:
    for key in childdict.keys():
        childdict[int(key)] = childdict[key]
        del childdict[key]
    
    print childdict
    
    ## pickle version
    # For pickle, you need a file-like object. 
    # StringIO is the native python one, whie cStringIO 
    # is the c implementation of the same.
    # cStringIO is faster
    # see http://docs.python.org/library/stringio.html and
    # http://www.doughellmann.com/PyMOTW/StringIO/ for more information
    import pickle
    # Find the best implementation available on this platform
    try:
        from cStringIO import StringIO
    except:
        from StringIO import StringIO
    
    import redis
    
    # set up the redis client
    r = redis.StrictRedis(host = '', port = 6379, db = 0)
    
    # sample parent dicts
    parent0 = {'child0': {0:'a', 1:'b', 2:'c',}, 'child1':{5:'e', 6:'f', 7:'g'}}
    parent1 = {'child0': {0:'h', 1:'i', 2:'j',}, 'child1':{5:'k', 6:'l', 7:'m'}}
    
    # define a class with a reusable StringIO object
    class Pickler(object):
        """Simple helper class to use pickle with a reusable string buffer object"""
        def __init__(self):
            self.tmpstr = StringIO()
    
        def __del__(self):
            # close the StringIO buffer and delete it
            self.tmpstr.close()
            del self.tmpstr
    
        def dump(self, obj):
            """Pickle an object and return the pickled string"""
            # empty current buffer
            self.tmpstr.seek(0,0)
            self.tmpstr.truncate(0)
            # pickle obj into the buffer
            pickle.dump(obj, self.tmpstr)
            # move the buffer pointer to the start
            self.tmpstr.seek(0,0)
            # return the pickled buffer as a string
            return self.tmpstr.read()
    
        def load(self, obj):
            """load a pickled object string and return the object"""
            # empty the current buffer
            self.tmpstr.seek(0,0)
            self.tmpstr.truncate(0)
            # load the pickled obj string into the buffer
            self.tmpstr.write(obj)
            # move the buffer pointer to start
            self.tmpstr.seek(0,0)
            # load the pickled buffer into an object
            return pickle.load(self.tmpstr)
    
    
    pickler = Pickler()
    
    # save the parents as hashfields, with the children as pickled strings, 
    # pickled using our helper class
    r.hmset('parent0', {key: pickler.dump(parent0[key]) for key in parent0})
    r.hmset('parent1', {key: pickler.dump(parent1[key]) for key in parent1})
    
    
    # Get a child dict from a parent
    # say child1 of parent0
    childstring = r.hget('parent0', 'child1') 
    # this could be done in a single line... 
    childdict = pickler.load(childstring) 
    
    # we don't need to do any str to int conversion on the keys.
    
    print childdict
    

    谢谢你的回答。我对用例的第3步描述得很糟糕(现在已经修复了),事实上,我希望整个子散列由Python代码读取和处理。Redis仍然适合吗?Redis仍然很好,但将子哈希保存为序列化/pickle字符串,并在读取后在客户端上反序列化/取消pickle会更快。另外,请确保为您各自的客户端(即)安装了
    hiredis
    ,这样可以更快地将数据从redis转换为python,反之亦然。谢谢您的回答。我对用例的第3步描述得很糟糕(现在已经修复了),事实上,我希望整个子散列由Python代码读取和处理。Redis仍然适合吗?Redis仍然很好,但将子哈希保存为序列化/pickle字符串,并在读取后在客户端上反序列化/取消pickle会更快。另外,请确保为您各自的客户端(即)安装
    hiredis
    ,这样可以更快地将数据从redis转换为python,反之亦然。好的,您可以使用此对象编组来存储字符串,但如果您希望加快速度,则应将每个python哈希存储在redis哈希上。当需要多个键空间时,通常的习惯用法是连接键:
    HMSET“parentkey1:childkeyX”f1 v1 f2 v2 f3 v3
    。这允许Redis使用ziplists优化小散列(对于不到100个左右的字段来说更紧凑、更快),或者,如果您真的想将对象存储为字符串,可以使用pickle,因为Redis字符串是8位干净的。但是如果你想做一些服务器端处理,那么考虑JSON或MeasAgPACK,因为两者都可以被嵌入式Lua引擎解码。当然,如果你不把你的对象串起来(如前一条评论所述),那就更容易了。最快的检索