Python SQL计算多个向量的Tanimoto系数

Python SQL计算多个向量的Tanimoto系数,python,sql,collaborative-filtering,Python,Sql,Collaborative Filtering,我认为用一个例子来解释我的问题更容易 我有一个包含配方成分的表,我实现了一个函数来计算成分之间的差异。它的速度足以计算两个成分之间的系数(需要3个sql查询),但它的伸缩性不好。要计算所有可能成分组合之间的系数,需要N+(N*(N-1))/2次查询或500500次查询,只需1k种成分。有没有更快的方法?以下是我目前得到的信息: class Filtering(): def __init__(self): self._connection=sqlite.connect('databas

我认为用一个例子来解释我的问题更容易

我有一个包含配方成分的表,我实现了一个函数来计算成分之间的差异。它的速度足以计算两个成分之间的系数(需要3个sql查询),但它的伸缩性不好。要计算所有可能成分组合之间的系数,需要N+(N*(N-1))/2次查询或500500次查询,只需1k种成分。有没有更快的方法?以下是我目前得到的信息:

class Filtering():
  def __init__(self):
    self._connection=sqlite.connect('database.db')

  def n_recipes(self, ingredient_id):
    cursor = self._connection.cursor()
    cursor.execute('''select count(recipe_id) from recipe_ingredient
        where ingredient_id = ? ''', (ingredient_id, ))
    return cursor.fetchone()[0]

  def n_recipes_intersection(self, ingredient_a, ingredient_b):
    cursor = self._connection.cursor()
    cursor.execute('''select count(drink_id) from recipe_ingredient where
        ingredient_id = ? and recipe_id in (
        select recipe_id from recipe_ingredient
        where ingredient_id = ?) ''', (ingredient_a, ingredient_b))
    return cursor.fetchone()[0]

  def tanimoto(self, ingredient_a, ingredient_b):
    n_a, n_b = map(self.n_recipes, (ingredient_a, ingredient_b))
    n_ab = self.n_recipes_intersection(ingredient_a, ingredient_b)
    return float(n_ab) / (n_a + n_b - n_ab)

你为什么不把所有的食谱都存到内存中,然后在内存中计算谷本系数呢


它更简单,速度也更快。

我认为这将使交叉点的每对选择次数减少到2次,每对查询次数减少到4次。你无法摆脱O(N^2),因为你正在尝试所有对——N*(N-1)/2就是有多少对

def n_recipes_intersection(self, ingredient_a, ingredient_b):
  cursor = self._cur
  cursor.execute('''
    select count(recipe_id)
      from recipe_ingredient as A 
        join recipe_ingredient as B using (recipe_id)
      where A.ingredient_id = ? 
        and B.ingredient_id = ?;
      ''', (ingredient_a, ingredient_b))
  return cursor.fetchone()[0]

如果您有1000种配料,1000个查询就足以将每种配料映射到内存中的一组食谱。如果(比如)一种成分通常是大约100份食谱的一部分,那么每一组都需要几KB,因此整个词典只需要几MB——将整个内容保存在内存中绝对没有问题(如果每种成分的平均食谱数增长一个数量级,那么这仍然不是一个严重的内存问题)


在这1000次查询之后,50万次成对的Tanimoto系数计算中的每一次都会在内存中完成——你可以预先计算不同集合长度的平方,作为进一步的加速(并将它们放入另一个dict中),以及键“a点积B”当然,每一对的分量都是集合交集的长度。

如果有人感兴趣,这就是我在Alex和s.Lotts的建议后提出的代码。谢谢你们

def __init__(self):
    self._connection=sqlite.connect('database.db')
    self._counts = None
    self._intersections = {}

def inc_intersections(self, ingredients):
    ingredients.sort()
    lenght = len(ingredients)
    for i in xrange(1, lenght):
        a = ingredients[i]
        for j in xrange(0, i):
            b = ingredients[j]
            if a not in self._intersections:
                self._intersections[a] = {b: 1}
            elif b not in self._intersections[a]:
                self._intersections[a][b] = 1
            else:
                self._intersections[a][b] += 1


def precompute_tanimoto(self):
    counts = {}
    self._intersections = {}

    cursor = self._connection.cursor()
    cursor.execute('''select recipe_id, ingredient_id
        from recipe_ingredient
        order by recipe_id, ingredient_id''')
    rows = cursor.fetchall()            

    print len(rows)

    last_recipe = None
    for recipe, ingredient in rows:
        if recipe != last_recipe:
            if last_recipe != None:
                self.inc_intersections(ingredients)
            last_recipe = recipe
            ingredients = [ingredient]
        else:
            ingredients.append(ingredient)

        if ingredient not in counts:
            counts[ingredient] = 1
        else:
            counts[ingredient] += 1

    self.inc_intersections(ingredients)

    self._counts = counts

def tanimoto(self, ingredient_a, ingredient_b):
    if self._counts == None:
        self.precompute_tanimoto()

    if ingredient_b > ingredient_a:
        ingredient_b, ingredient_a = ingredient_a, ingredient_b

    n_a, n_b = self._counts[ingredient_a], self._counts[ingredient_b]
    n_ab = self._intersections[ingredient_a][ingredient_b]

    print n_a, n_b, n_ab

    return float(n_ab) / (n_a + n_b - n_ab)

这是我的第一个想法,但你将如何实现它?循环检查所有配方的成分,并增加找到的每种成分和组合的计数器?我的数据库中有超过60k个项目,所以即使这样也需要一些时间。Facepalm!这种方法比我想象的要快得多。计算所有系数只需4秒。谢谢。一般来说,这是我的经验。人们写的SQL太多了。谢谢Alex+谢谢你的忠告,但我设法在内存中完成了整个计算,一次获取所有数据。整个过程只花了不到4秒的时间。真的很好奇为什么你选择使用Tanimoto而不是余弦或其他相似性算法。我正在考虑执行类似的计算,希望听到你的理由。
def __init__(self):
    self._connection=sqlite.connect('database.db')
    self._counts = None
    self._intersections = {}

def inc_intersections(self, ingredients):
    ingredients.sort()
    lenght = len(ingredients)
    for i in xrange(1, lenght):
        a = ingredients[i]
        for j in xrange(0, i):
            b = ingredients[j]
            if a not in self._intersections:
                self._intersections[a] = {b: 1}
            elif b not in self._intersections[a]:
                self._intersections[a][b] = 1
            else:
                self._intersections[a][b] += 1


def precompute_tanimoto(self):
    counts = {}
    self._intersections = {}

    cursor = self._connection.cursor()
    cursor.execute('''select recipe_id, ingredient_id
        from recipe_ingredient
        order by recipe_id, ingredient_id''')
    rows = cursor.fetchall()            

    print len(rows)

    last_recipe = None
    for recipe, ingredient in rows:
        if recipe != last_recipe:
            if last_recipe != None:
                self.inc_intersections(ingredients)
            last_recipe = recipe
            ingredients = [ingredient]
        else:
            ingredients.append(ingredient)

        if ingredient not in counts:
            counts[ingredient] = 1
        else:
            counts[ingredient] += 1

    self.inc_intersections(ingredients)

    self._counts = counts

def tanimoto(self, ingredient_a, ingredient_b):
    if self._counts == None:
        self.precompute_tanimoto()

    if ingredient_b > ingredient_a:
        ingredient_b, ingredient_a = ingredient_a, ingredient_b

    n_a, n_b = self._counts[ingredient_a], self._counts[ingredient_b]
    n_ab = self._intersections[ingredient_a][ingredient_b]

    print n_a, n_b, n_ab

    return float(n_ab) / (n_a + n_b - n_ab)