Python SQL计算多个向量的Tanimoto系数_Python_Sql_Collaborative Filtering

Python SQL计算多个向量的Tanimoto系数

python sql

Python SQL计算多个向量的Tanimoto系数,python,sql,collaborative-filtering,Python,Sql,Collaborative Filtering,我认为用一个例子来解释我的问题更容易我有一个包含配方成分的表，我实现了一个函数来计算成分之间的差异。它的速度足以计算两个成分之间的系数（需要3个sql查询），但它的伸缩性不好。要计算所有可能成分组合之间的系数，需要N+（N*（N-1））/2次查询或500500次查询，只需1k种成分。有没有更快的方法？以下是我目前得到的信息： class Filtering(): def __init__(self): self._connection=sqlite.connect('databas

我认为用一个例子来解释我的问题更容易

我有一个包含配方成分的表，我实现了一个函数来计算成分之间的差异。它的速度足以计算两个成分之间的系数（需要3个sql查询），但它的伸缩性不好。要计算所有可能成分组合之间的系数，需要N+（N*（N-1））/2次查询或500500次查询，只需1k种成分。有没有更快的方法？以下是我目前得到的信息：

class Filtering():
  def __init__(self):
    self._connection=sqlite.connect('database.db')

  def n_recipes(self, ingredient_id):
    cursor = self._connection.cursor()
    cursor.execute('''select count(recipe_id) from recipe_ingredient
        where ingredient_id = ? ''', (ingredient_id, ))
    return cursor.fetchone()[0]

  def n_recipes_intersection(self, ingredient_a, ingredient_b):
    cursor = self._connection.cursor()
    cursor.execute('''select count(drink_id) from recipe_ingredient where
        ingredient_id = ? and recipe_id in (
        select recipe_id from recipe_ingredient
        where ingredient_id = ?) ''', (ingredient_a, ingredient_b))
    return cursor.fetchone()[0]

  def tanimoto(self, ingredient_a, ingredient_b):
    n_a, n_b = map(self.n_recipes, (ingredient_a, ingredient_b))
    n_ab = self.n_recipes_intersection(ingredient_a, ingredient_b)
    return float(n_ab) / (n_a + n_b - n_ab)

你为什么不把所有的食谱都存到内存中，然后在内存中计算谷本系数呢

它更简单，速度也更快。

我认为这将使交叉点的每对选择次数减少到2次，每对查询次数减少到4次。你无法摆脱O（N^2），因为你正在尝试所有对——N*（N-1）/2就是有多少对

def n_recipes_intersection(self, ingredient_a, ingredient_b):
  cursor = self._cur
  cursor.execute('''
    select count(recipe_id)
      from recipe_ingredient as A 
        join recipe_ingredient as B using (recipe_id)
      where A.ingredient_id = ? 
        and B.ingredient_id = ?;
      ''', (ingredient_a, ingredient_b))
  return cursor.fetchone()[0]

如果您有1000种配料，1000个查询就足以将每种配料映射到内存中的一组食谱。如果（比如）一种成分通常是大约100份食谱的一部分，那么每一组都需要几KB，因此整个词典只需要几MB——将整个内容保存在内存中绝对没有问题（如果每种成分的平均食谱数增长一个数量级，那么这仍然不是一个严重的内存问题）

在这1000次查询之后，50万次成对的Tanimoto系数计算中的每一次都会在内存中完成——你可以预先计算不同集合长度的平方，作为进一步的加速（并将它们放入另一个dict中），以及键“a点积B”当然，每一对的分量都是集合交集的长度。

如果有人感兴趣，这就是我在Alex和s.Lotts的建议后提出的代码。谢谢你们

def __init__(self):
    self._connection=sqlite.connect('database.db')
    self._counts = None
    self._intersections = {}

def inc_intersections(self, ingredients):
    ingredients.sort()
    lenght = len(ingredients)
    for i in xrange(1, lenght):
        a = ingredients[i]
        for j in xrange(0, i):
            b = ingredients[j]
            if a not in self._intersections:
                self._intersections[a] = {b: 1}
            elif b not in self._intersections[a]:
                self._intersections[a][b] = 1
            else:
                self._intersections[a][b] += 1


def precompute_tanimoto(self):
    counts = {}
    self._intersections = {}

    cursor = self._connection.cursor()
    cursor.execute('''select recipe_id, ingredient_id
        from recipe_ingredient
        order by recipe_id, ingredient_id''')
    rows = cursor.fetchall()            

    print len(rows)

    last_recipe = None
    for recipe, ingredient in rows:
        if recipe != last_recipe:
            if last_recipe != None:
                self.inc_intersections(ingredients)
            last_recipe = recipe
            ingredients = [ingredient]
        else:
            ingredients.append(ingredient)

        if ingredient not in counts:
            counts[ingredient] = 1
        else:
            counts[ingredient] += 1

    self.inc_intersections(ingredients)

    self._counts = counts

def tanimoto(self, ingredient_a, ingredient_b):
    if self._counts == None:
        self.precompute_tanimoto()

    if ingredient_b > ingredient_a:
        ingredient_b, ingredient_a = ingredient_a, ingredient_b

    n_a, n_b = self._counts[ingredient_a], self._counts[ingredient_b]
    n_ab = self._intersections[ingredient_a][ingredient_b]

    print n_a, n_b, n_ab

    return float(n_ab) / (n_a + n_b - n_ab)

这是我的第一个想法，但你将如何实现它？循环检查所有配方的成分，并增加找到的每种成分和组合的计数器？我的数据库中有超过60k个项目，所以即使这样也需要一些时间。Facepalm！这种方法比我想象的要快得多。计算所有系数只需4秒。谢谢。一般来说，这是我的经验。人们写的SQL太多了。谢谢Alex+谢谢你的忠告，但我设法在内存中完成了整个计算，一次获取所有数据。整个过程只花了不到4秒的时间。真的很好奇为什么你选择使用Tanimoto而不是余弦或其他相似性算法。我正在考虑执行类似的计算，希望听到你的理由。

def __init__(self):
    self._connection=sqlite.connect('database.db')
    self._counts = None
    self._intersections = {}

def inc_intersections(self, ingredients):
    ingredients.sort()
    lenght = len(ingredients)
    for i in xrange(1, lenght):
        a = ingredients[i]
        for j in xrange(0, i):
            b = ingredients[j]
            if a not in self._intersections:
                self._intersections[a] = {b: 1}
            elif b not in self._intersections[a]:
                self._intersections[a][b] = 1
            else:
                self._intersections[a][b] += 1


def precompute_tanimoto(self):
    counts = {}
    self._intersections = {}

    cursor = self._connection.cursor()
    cursor.execute('''select recipe_id, ingredient_id
        from recipe_ingredient
        order by recipe_id, ingredient_id''')
    rows = cursor.fetchall()            

    print len(rows)

    last_recipe = None
    for recipe, ingredient in rows:
        if recipe != last_recipe:
            if last_recipe != None:
                self.inc_intersections(ingredients)
            last_recipe = recipe
            ingredients = [ingredient]
        else:
            ingredients.append(ingredient)

        if ingredient not in counts:
            counts[ingredient] = 1
        else:
            counts[ingredient] += 1

    self.inc_intersections(ingredients)

    self._counts = counts

def tanimoto(self, ingredient_a, ingredient_b):
    if self._counts == None:
        self.precompute_tanimoto()

    if ingredient_b > ingredient_a:
        ingredient_b, ingredient_a = ingredient_a, ingredient_b

    n_a, n_b = self._counts[ingredient_a], self._counts[ingredient_b]
    n_ab = self._intersections[ingredient_a][ingredient_b]

    print n_a, n_b, n_ab

    return float(n_ab) / (n_a + n_b - n_ab)