在并行Python中使用回调更新数据库
我正在尝试对SQlite数据库中大约200000个条目进行一些文本处理,我正在使用SQLAlchemy访问该数据库。我想将其并行化(我正在研究并行Python),但我不确定具体如何实现 我希望在每次处理条目时提交会话,这样,如果需要停止脚本,就不会丢失已经完成的工作。但是,当我尝试将session.commit()命令传递给回调函数时,它似乎不起作用在并行Python中使用回调更新数据库,python,sqlalchemy,parallel-python,Python,Sqlalchemy,Parallel Python,我正在尝试对SQlite数据库中大约200000个条目进行一些文本处理,我正在使用SQLAlchemy访问该数据库。我想将其并行化(我正在研究并行Python),但我不确定具体如何实现 我希望在每次处理条目时提交会话,这样,如果需要停止脚本,就不会丢失已经完成的工作。但是,当我尝试将session.commit()命令传递给回调函数时,它似乎不起作用 from assignDB import * from sqlalchemy.orm import sessionmaker import pp,
from assignDB import *
from sqlalchemy.orm import sessionmaker
import pp, sys, fuzzy_substring
def matchIng(rawIng, ingreds):
maxScore = 0
choice = ""
for (ingred, parentIng) in ingreds.iteritems():
score = len(ingred)/(fuzzy_substring(ingred,rawIng)+1)
if score > maxScore:
maxScore = score
choice = ingred
refIng = parentIng
return (refIng, choice, maxScore)
def callbackFunc(match, session, inputTuple):
print inputTuple
match.refIng_id = inputTuple[0]
match.refIng_name = inputTuple[1]
match.matchScore = inputTuple[2]
session.commit()
# tuple of all parallel python servers to connect with
ppservers = ()
#ppservers = ("10.0.0.1",)
if len(sys.argv) > 1:
ncpus = int(sys.argv[1])
# Creates jobserver with ncpus workers
job_server = pp.Server(ncpus, ppservers=ppservers)
else:
# Creates jobserver with automatically detected number of workers
job_server = pp.Server(ppservers=ppservers)
print "Starting pp with", job_server.get_ncpus(), "workers"
ingreds = {}
for synonym, parentIng in session.query(IngSyn.synonym, IngSyn.parentIng):
ingreds[synonym] = parentIng
jobs = []
for match in session.query(Ingredient).filter(Ingredient.refIng_id == None):
rawIng = match.ingredient
jobs.append((match, job_server.submit(matchIng,(rawIng,ingreds), (fuzzy_substring,),callback=callbackFunc,callbackargs=(match,session))))
会话是从assignDB
导入的。我没有收到任何错误,只是没有更新数据库
谢谢你的帮助
更新
下面是fuzzy_子字符串的代码
def fuzzy_substring(needle, haystack):
"""Calculates the fuzzy match of needle in haystack,
using a modified version of the Levenshtein distance
algorithm.
The function is modified from the levenshtein function
in the bktree module by Adam Hupp"""
m, n = len(needle), len(haystack)
# base cases
if m == 1:
return not needle in haystack
if not n:
return m
row1 = [0] * (n+1)
for i in range(0,m):
row2 = [i+1]
for j in range(0,n):
cost = ( needle[i] != haystack[j] )
row2.append( min(row1[j+1]+1, # deletion
row2[j]+1, #insertion
row1[j]+cost) #substitution
)
row1 = row2
return min(row1)
我从这里得到的:。在我的例子中,“针”是大约8000种可能的选择之一,而haystack是我试图匹配的原始字符串。我在所有可能的“针”上循环,然后选择得分最好的一个 不看您的特定代码,可以公平地说:
pp
秘制酱汁可能根本买不到多少东西时,即使它工作得很好
为回应评论而添加的内容:
如果fuzzy\u子串
匹配是瓶颈,那么它似乎与数据库访问完全解耦,您应该记住这一点。在看不到模糊子字符串的情况下,一个好的开始假设是您可以对算法进行改进,从而使单线程编程在计算上可行。这是一个经过充分研究的问题,选择正确的算法通常比“投入更多处理器”要好得多
从这个意义上说,更好的是,你有更干净的代码,不要浪费分割和重新组装问题的开销,在最后有一个更可扩展和可调试的程序。@msw提供了一个考虑并行化的一般方法
尽管有这些评论,以下是我最后的工作:
from assignDB import *
from sqlalchemy.orm import sessionmaker
import pp, sys, fuzzy_substring
def matchIng(rawIng, ingreds):
maxScore = 0
choice = ""
for (ingred, parentIng) in ingreds.iteritems():
score = len(ingred)/(fuzzy_substring(ingred,rawIng)+1)
if score > maxScore:
maxScore = score
choice = ingred
refIng = parentIng
return (refIng, choice, maxScore)
# tuple of all parallel python servers to connect with
ppservers = ()
#ppservers = ("10.0.0.1",)
if len(sys.argv) > 1:
ncpus = int(sys.argv[1])
# Creates jobserver with ncpus workers
job_server = pp.Server(ncpus, ppservers=ppservers)
else:
# Creates jobserver with automatically detected number of workers
job_server = pp.Server(ppservers=ppservers)
print "Starting pp with", job_server.get_ncpus(), "workers"
ingreds = {}
for synonym, parentIng in session.query(IngSyn.synonym, IngSyn.parentIng):
ingreds[synonym] = parentIng
rawIngredients = session.query(Ingredient).filter(Ingredient.refIng_id == None)
numIngredients = session.query(Ingredient).filter(Ingredient.refIng_id == None).count()
stepSize = 30
for i in range(0, numIngredients, stepSize):
print i
print numIngredients
if i + stepSize > numIngredients:
stop = numIngredients
else:
stop = i + stepSize
jobs = []
for match in rawIngredients[i:stop]:
rawIng = match.ingredient
jobs.append((match, job_server.submit(matchIng,(rawIng,ingreds), (fuzzy_substring,))))
job_server.wait()
for match, job in jobs:
inputTuple = job()
print match.ingredient
print inputTuple
match.refIng_id = inputTuple[0]
match.refIng_name = inputTuple[1]
match.matchScore = inputTuple[2]
session.commit()
基本上,我已经把这个问题分成了几块。并行匹配30个子字符串后,返回结果并提交到数据库。我随意选择了30个,所以优化这个数字可能会有所收获。它似乎加快了一点,因为我现在正在使用我的处理器中的所有3(!)个内核 嗨,谢谢你的回答——你说的完全有道理。我的代码正在运行,瓶颈是fuzzy_子字符串函数,因为它必须(模糊地)将原始输入与大约8000个可能的结果进行比较。我现在想的是,并行计算多个结果(可能是30个左右),然后执行一个会话。commit()并重新启动另一个会话30?@abroekhof请参阅答案正文中的“添加”。我用模糊子字符串代码更新了问题。我研究了几个模糊子串方法,但这一个似乎表现最好。许多其他算法在匹配子字符串时都有困难。我已经实现了我在第一篇评论中提到的问题细分,它似乎有点加速。