Python SQL计算多个向量的Tanimoto系数
我认为用一个例子来解释我的问题更容易 我有一个包含配方成分的表,我实现了一个函数来计算成分之间的差异。它的速度足以计算两个成分之间的系数(需要3个sql查询),但它的伸缩性不好。要计算所有可能成分组合之间的系数,需要N+(N*(N-1))/2次查询或500500次查询,只需1k种成分。有没有更快的方法?以下是我目前得到的信息:Python SQL计算多个向量的Tanimoto系数,python,sql,collaborative-filtering,Python,Sql,Collaborative Filtering,我认为用一个例子来解释我的问题更容易 我有一个包含配方成分的表,我实现了一个函数来计算成分之间的差异。它的速度足以计算两个成分之间的系数(需要3个sql查询),但它的伸缩性不好。要计算所有可能成分组合之间的系数,需要N+(N*(N-1))/2次查询或500500次查询,只需1k种成分。有没有更快的方法?以下是我目前得到的信息: class Filtering(): def __init__(self): self._connection=sqlite.connect('databas
class Filtering():
def __init__(self):
self._connection=sqlite.connect('database.db')
def n_recipes(self, ingredient_id):
cursor = self._connection.cursor()
cursor.execute('''select count(recipe_id) from recipe_ingredient
where ingredient_id = ? ''', (ingredient_id, ))
return cursor.fetchone()[0]
def n_recipes_intersection(self, ingredient_a, ingredient_b):
cursor = self._connection.cursor()
cursor.execute('''select count(drink_id) from recipe_ingredient where
ingredient_id = ? and recipe_id in (
select recipe_id from recipe_ingredient
where ingredient_id = ?) ''', (ingredient_a, ingredient_b))
return cursor.fetchone()[0]
def tanimoto(self, ingredient_a, ingredient_b):
n_a, n_b = map(self.n_recipes, (ingredient_a, ingredient_b))
n_ab = self.n_recipes_intersection(ingredient_a, ingredient_b)
return float(n_ab) / (n_a + n_b - n_ab)
你为什么不把所有的食谱都存到内存中,然后在内存中计算谷本系数呢
它更简单,速度也更快。我认为这将使交叉点的每对选择次数减少到2次,每对查询次数减少到4次。你无法摆脱O(N^2),因为你正在尝试所有对——N*(N-1)/2就是有多少对
def n_recipes_intersection(self, ingredient_a, ingredient_b):
cursor = self._cur
cursor.execute('''
select count(recipe_id)
from recipe_ingredient as A
join recipe_ingredient as B using (recipe_id)
where A.ingredient_id = ?
and B.ingredient_id = ?;
''', (ingredient_a, ingredient_b))
return cursor.fetchone()[0]
如果您有1000种配料,1000个查询就足以将每种配料映射到内存中的一组食谱。如果(比如)一种成分通常是大约100份食谱的一部分,那么每一组都需要几KB,因此整个词典只需要几MB——将整个内容保存在内存中绝对没有问题(如果每种成分的平均食谱数增长一个数量级,那么这仍然不是一个严重的内存问题)
在这1000次查询之后,50万次成对的Tanimoto系数计算中的每一次都会在内存中完成——你可以预先计算不同集合长度的平方,作为进一步的加速(并将它们放入另一个dict中),以及键“a点积B”当然,每一对的分量都是集合交集的长度。如果有人感兴趣,这就是我在Alex和s.Lotts的建议后提出的代码。谢谢你们
def __init__(self):
self._connection=sqlite.connect('database.db')
self._counts = None
self._intersections = {}
def inc_intersections(self, ingredients):
ingredients.sort()
lenght = len(ingredients)
for i in xrange(1, lenght):
a = ingredients[i]
for j in xrange(0, i):
b = ingredients[j]
if a not in self._intersections:
self._intersections[a] = {b: 1}
elif b not in self._intersections[a]:
self._intersections[a][b] = 1
else:
self._intersections[a][b] += 1
def precompute_tanimoto(self):
counts = {}
self._intersections = {}
cursor = self._connection.cursor()
cursor.execute('''select recipe_id, ingredient_id
from recipe_ingredient
order by recipe_id, ingredient_id''')
rows = cursor.fetchall()
print len(rows)
last_recipe = None
for recipe, ingredient in rows:
if recipe != last_recipe:
if last_recipe != None:
self.inc_intersections(ingredients)
last_recipe = recipe
ingredients = [ingredient]
else:
ingredients.append(ingredient)
if ingredient not in counts:
counts[ingredient] = 1
else:
counts[ingredient] += 1
self.inc_intersections(ingredients)
self._counts = counts
def tanimoto(self, ingredient_a, ingredient_b):
if self._counts == None:
self.precompute_tanimoto()
if ingredient_b > ingredient_a:
ingredient_b, ingredient_a = ingredient_a, ingredient_b
n_a, n_b = self._counts[ingredient_a], self._counts[ingredient_b]
n_ab = self._intersections[ingredient_a][ingredient_b]
print n_a, n_b, n_ab
return float(n_ab) / (n_a + n_b - n_ab)
这是我的第一个想法,但你将如何实现它?循环检查所有配方的成分,并增加找到的每种成分和组合的计数器?我的数据库中有超过60k个项目,所以即使这样也需要一些时间。Facepalm!这种方法比我想象的要快得多。计算所有系数只需4秒。谢谢。一般来说,这是我的经验。人们写的SQL太多了。谢谢Alex+谢谢你的忠告,但我设法在内存中完成了整个计算,一次获取所有数据。整个过程只花了不到4秒的时间。真的很好奇为什么你选择使用Tanimoto而不是余弦或其他相似性算法。我正在考虑执行类似的计算,希望听到你的理由。
def __init__(self):
self._connection=sqlite.connect('database.db')
self._counts = None
self._intersections = {}
def inc_intersections(self, ingredients):
ingredients.sort()
lenght = len(ingredients)
for i in xrange(1, lenght):
a = ingredients[i]
for j in xrange(0, i):
b = ingredients[j]
if a not in self._intersections:
self._intersections[a] = {b: 1}
elif b not in self._intersections[a]:
self._intersections[a][b] = 1
else:
self._intersections[a][b] += 1
def precompute_tanimoto(self):
counts = {}
self._intersections = {}
cursor = self._connection.cursor()
cursor.execute('''select recipe_id, ingredient_id
from recipe_ingredient
order by recipe_id, ingredient_id''')
rows = cursor.fetchall()
print len(rows)
last_recipe = None
for recipe, ingredient in rows:
if recipe != last_recipe:
if last_recipe != None:
self.inc_intersections(ingredients)
last_recipe = recipe
ingredients = [ingredient]
else:
ingredients.append(ingredient)
if ingredient not in counts:
counts[ingredient] = 1
else:
counts[ingredient] += 1
self.inc_intersections(ingredients)
self._counts = counts
def tanimoto(self, ingredient_a, ingredient_b):
if self._counts == None:
self.precompute_tanimoto()
if ingredient_b > ingredient_a:
ingredient_b, ingredient_a = ingredient_a, ingredient_b
n_a, n_b = self._counts[ingredient_a], self._counts[ingredient_b]
n_ab = self._intersections[ingredient_a][ingredient_b]
print n_a, n_b, n_ab
return float(n_ab) / (n_a + n_b - n_ab)