Python for循环中的并行化函数

Python for循环中的并行化函数,python,for-loop,parallel-processing,multiprocessing,Python,For Loop,Parallel Processing,Multiprocessing,我有一个我想并行化的函数 import multiprocessing as mp from pathos.multiprocessing import ProcessingPool as Pool cores=mp.cpu_count() # create the multiprocessing pool pool = Pool(cores) def clean_preprocess(text): """ Given a string of text, the funct

我有一个我想并行化的函数

import multiprocessing as mp
from pathos.multiprocessing import ProcessingPool as Pool

cores=mp.cpu_count()

# create the multiprocessing pool
pool = Pool(cores)

def clean_preprocess(text):
    """
    Given a string of text, the function:
    1. Remove all punctuations and numbers and converts texts to lower case
    2. Handles negation words defined above.
    3. Tokenies words that are of more than length 1
    """
    cores=mp.cpu_count()
    pool = Pool(cores)
    lower = re.sub(r'[^a-zA-Z\s\']', "", text).lower()
    lower_neg_handled = n_pattern.sub(lambda x: n_dict[x.group()], lower)
    letters_only = re.sub(r'[^a-zA-Z\s]', "", lower_neg_handled)
    words = [i for i  in tok.tokenize(letters_only) if len(i) > 1] ##parallelize this? 
return (' '.join(words))

我一直在阅读关于多处理的文档,但对于如何适当地并行化我的函数仍然有点困惑。如果有人能为我指出并行化类似于我的函数的正确方向,我将不胜感激

在您的函数中,您可以通过将文本拆分为子部分,将标记化应用于子部分,然后连接结果来决定并行化

大致如下:

text0 = text[:len(text)/2]
text1 = text[len(text)/2:]
然后使用以下方法将处理应用于这两个部分:

# here, I suppose that clean_preprocess is the sequential version, 
# and we manage the pool outside of it
with Pool(2) as p:
  words0, words1 = pool.map(clean_preprocess, [text0, text1])
words = words1 + words2
# or continue with words0 words1 to save the cost of joining the lists
但是,您的函数似乎内存有限,因此不会有可怕的加速(通常情况下,系数2是我们现在在标准计算机上所能期望的最大值),请参见或

因此,您可以尝试将文本拆分为两个以上的部分,但速度可能不会更快。您甚至可能会得到令人失望的性能,因为拆分文本可能比处理文本更昂贵