在并行Python中使用回调更新数据库_Python_Sqlalchemy_Parallel Python

在并行Python中使用回调更新数据库

python sqlalchemy

在并行Python中使用回调更新数据库,python,sqlalchemy,parallel-python,Python,Sqlalchemy,Parallel Python,我正在尝试对SQlite数据库中大约200000个条目进行一些文本处理，我正在使用SQLAlchemy访问该数据库。我想将其并行化（我正在研究并行Python），但我不确定具体如何实现我希望在每次处理条目时提交会话，这样，如果需要停止脚本，就不会丢失已经完成的工作。但是，当我尝试将session.commit（）命令传递给回调函数时，它似乎不起作用 from assignDB import * from sqlalchemy.orm import sessionmaker import pp,

我正在尝试对SQlite数据库中大约200000个条目进行一些文本处理，我正在使用SQLAlchemy访问该数据库。我想将其并行化（我正在研究并行Python），但我不确定具体如何实现

我希望在每次处理条目时提交会话，这样，如果需要停止脚本，就不会丢失已经完成的工作。但是，当我尝试将session.commit（）命令传递给回调函数时，它似乎不起作用

from assignDB import *
from sqlalchemy.orm import sessionmaker
import pp, sys, fuzzy_substring

def matchIng(rawIng, ingreds):
maxScore = 0
choice = ""
for (ingred, parentIng) in ingreds.iteritems():
    score = len(ingred)/(fuzzy_substring(ingred,rawIng)+1)
    if score > maxScore:
        maxScore = score
        choice = ingred
        refIng = parentIng  
return (refIng, choice, maxScore)

def callbackFunc(match, session, inputTuple):
    print inputTuple
    match.refIng_id = inputTuple[0]
    match.refIng_name = inputTuple[1]
    match.matchScore = inputTuple[2]
    session.commit()

# tuple of all parallel python servers to connect with
ppservers = ()
#ppservers = ("10.0.0.1",)

if len(sys.argv) > 1:
    ncpus = int(sys.argv[1])
    # Creates jobserver with ncpus workers
    job_server = pp.Server(ncpus, ppservers=ppservers)
else:
    # Creates jobserver with automatically detected number of workers
    job_server = pp.Server(ppservers=ppservers)

print "Starting pp with", job_server.get_ncpus(), "workers"

ingreds = {}
for synonym, parentIng in session.query(IngSyn.synonym, IngSyn.parentIng): 
    ingreds[synonym] = parentIng

jobs = []
for match in session.query(Ingredient).filter(Ingredient.refIng_id == None):
    rawIng = match.ingredient
    jobs.append((match, job_server.submit(matchIng,(rawIng,ingreds),    (fuzzy_substring,),callback=callbackFunc,callbackargs=(match,session))))

会话是从

assignDB

导入的。我没有收到任何错误，只是没有更新数据库

谢谢你的帮助

更新下面是fuzzy_子字符串的代码

def fuzzy_substring(needle, haystack):
    """Calculates the fuzzy match of needle in haystack,
    using a modified version of the Levenshtein distance
    algorithm.
    The function is modified from the levenshtein function
    in the bktree module by Adam Hupp"""
    m, n = len(needle), len(haystack)

    # base cases
    if m == 1:
        return not needle in haystack
    if not n:
        return m

    row1 = [0] * (n+1)
    for i in range(0,m):
        row2 = [i+1]
        for j in range(0,n):
            cost = ( needle[i] != haystack[j] )

            row2.append( min(row1[j+1]+1, # deletion
                               row2[j]+1, #insertion
                               row1[j]+cost) #substitution
                           )
        row1 = row2
    return min(row1)

我从这里得到的：。在我的例子中，“针”是大约8000种可能的选择之一，而haystack是我试图匹配的原始字符串。我在所有可能的“针”上循环，然后选择得分最好的一个

不看您的特定代码，可以公平地说：

使用无服务器SQLite和

通过并行性寻求更高的写性能

是相互矛盾的欲望。引述：

…但是，客户端/服务器数据库引擎（如PostgreSQL、MySQL、，或Oracle）通常支持更高级别的并发并允许多个进程同时写入同一数据库时间这在客户机/服务器数据库中是可能的，因为始终有一个控制良好的服务器进程可用于协调通道如果您的应用程序需要大量的并发性，那么您应该考虑使用Client/Server数据库。但是经验这表明大多数应用程序需要的并发性比它们的应用程序少得多设计师想象

而这甚至没有任何门控和命令SQLAlchemy使用。还不清楚并行Python作业何时完成（如果有的话）

我的建议是：先让它正常工作，然后再寻求优化。特别是当

pp

秘制酱汁可能根本买不到多少东西时，即使它工作得很好

为回应评论而添加的内容：

如果

fuzzy\u子串

匹配是瓶颈，那么它似乎与数据库访问完全解耦，您应该记住这一点。在看不到模糊子字符串的情况下，一个好的开始假设是您可以对算法进行改进，从而使单线程编程在计算上可行。这是一个经过充分研究的问题，选择正确的算法通常比“投入更多处理器”要好得多

从这个意义上说，更好的是，你有更干净的代码，不要浪费分割和重新组装问题的开销，在最后有一个更可扩展和可调试的程序。

@msw提供了一个考虑并行化的一般方法

尽管有这些评论，以下是我最后的工作：

from assignDB import *
from sqlalchemy.orm import sessionmaker
import pp, sys, fuzzy_substring  

def matchIng(rawIng, ingreds):
    maxScore = 0
    choice = ""
    for (ingred, parentIng) in ingreds.iteritems():
        score = len(ingred)/(fuzzy_substring(ingred,rawIng)+1)
        if score > maxScore:
            maxScore = score
            choice = ingred
            refIng = parentIng  
    return (refIng, choice, maxScore)

# tuple of all parallel python servers to connect with
ppservers = ()
#ppservers = ("10.0.0.1",)

if len(sys.argv) > 1:
    ncpus = int(sys.argv[1])
    # Creates jobserver with ncpus workers
    job_server = pp.Server(ncpus, ppservers=ppservers)
else:
    # Creates jobserver with automatically detected number of workers
    job_server = pp.Server(ppservers=ppservers)

print "Starting pp with", job_server.get_ncpus(), "workers"

ingreds = {}
for synonym, parentIng in session.query(IngSyn.synonym, IngSyn.parentIng): 
    ingreds[synonym] = parentIng

rawIngredients = session.query(Ingredient).filter(Ingredient.refIng_id == None)
numIngredients = session.query(Ingredient).filter(Ingredient.refIng_id == None).count()
stepSize = 30

for i in range(0, numIngredients, stepSize):
    print i
    print numIngredients

    if i + stepSize > numIngredients:
        stop = numIngredients
    else:
        stop = i + stepSize

    jobs = []
    for match in rawIngredients[i:stop]:
        rawIng = match.ingredient
        jobs.append((match, job_server.submit(matchIng,(rawIng,ingreds),    (fuzzy_substring,))))

    job_server.wait()

    for match, job in jobs:
        inputTuple = job()
        print match.ingredient
        print inputTuple
        match.refIng_id = inputTuple[0]
        match.refIng_name = inputTuple[1]
        match.matchScore = inputTuple[2]
    session.commit()

基本上，我已经把这个问题分成了几块。并行匹配30个子字符串后，返回结果并提交到数据库。我随意选择了30个，所以优化这个数字可能会有所收获。它似乎加快了一点，因为我现在正在使用我的处理器中的所有3（！）个内核

嗨，谢谢你的回答——你说的完全有道理。我的代码正在运行，瓶颈是fuzzy_子字符串函数，因为它必须（模糊地）将原始输入与大约8000个可能的结果进行比较。我现在想的是，并行计算多个结果（可能是30个左右），然后执行一个会话。commit（）并重新启动另一个会话30？@abroekhof请参阅答案正文中的“添加”。我用模糊子字符串代码更新了问题。我研究了几个模糊子串方法，但这一个似乎表现最好。许多其他算法在匹配子字符串时都有困难。我已经实现了我在第一篇评论中提到的问题细分，它似乎有点加速。