如何使用线程来提高python的性能

如何使用线程来提高python的性能,python,multithreading,Python,Multithreading,我有一个句子列表,大约有500000个句子。还有一个包含大约13000000个概念的概念的列表。对于每个句子,我想按照句子的顺序从句子中提取概念,并将其写入输出 例如,我的python程序如下所示 import re sentences = ['data mining is the process of discovering patterns in large data sets involving methods at the intersection of machine learnin

我有一个
句子列表
,大约有
500000个句子
。还有一个包含大约13000000个概念的
概念的
列表。对于每个句子,我想按照句子的顺序从
句子中提取
概念
,并将其写入输出

例如,我的python程序如下所示

import re

sentences = ['data mining is the process of discovering patterns in large data sets involving methods at the intersection of machine learning statistics and database systems', 
             'data mining is an interdisciplinary subfield of computer science and statistics with an overall goal to extract information from a data set and transform the information into a comprehensible structure for further use',
             'data mining is the analysis step of the knowledge discovery in databases process or kdd']

concepts = ['data mining', 'database systems', 'databases process', 
            'interdisciplinary subfield', 'information', 'knowledge discovery',
            'methods', 'machine learning', 'patterns', 'process']

output = []
counting = 0

re_concepts = [re.escape(t) for t in concepts]

find_all_concepts = re.compile('|'.join(re_concepts), flags=re.DOTALL).findall

for sentence in sentences:
    output.append(find_all_concepts(sentence))

print(output)
输出为;

然而,输出的顺序对我来说并不重要。i、 e我的输出也可以如下所示(换句话说,
输出中的列表可以被洗牌)。

[['data mining', 'interdisciplinary subfield', 'information', 'information'], ['data mining', 'knowledge discovery', 'databases process'], ['data mining', 'process', 'patterns', 'methods', 'machine learning', 'database systems']]

[['data mining', 'knowledge discovery', 'databases process'], ['data mining', 'interdisciplinary subfield', 'information', 'information'], ['data mining', 'process', 'patterns', 'methods', 'machine learning', 'database systems']]
然而,由于我的
句子长度
概念
这个程序仍然相当慢


在python中使用多线程是否有可能进一步提高性能(在时间方面)?

多线程是否会产生实际的性能提高,不仅取决于python中的实现和数据量,还取决于执行程序的硬件。在某些情况下,在硬件没有优势的情况下,多线程可能会由于增加的开销而降低速度

但是,假设您在现代标准PC或更好的PC上运行,您可能会看到多线程的一些改进。接下来的问题是建立一些工人,将工作交给他们,并收集结果

贴近您的示例结构、实现和命名:

import re
import queue
import threading

sentences = ['data mining is the process of discovering patterns in large data sets involving methods at the intersection of machine learning statistics and database systems',
             'data mining is an interdisciplinary subfield of computer science and statistics with an overall goal to extract information from a data set and transform the information into a comprehensible structure for further use',
             'data mining is the analysis step of the knowledge discovery in databases process or kdd']

concepts = ['data mining', 'database systems', 'databases process',
            'interdisciplinary subfield', 'information', 'knowledge discovery',
            'methods', 'machine learning', 'patterns', 'process']

re_concepts = [re.escape(t) for t in concepts]

find_all_concepts = re.compile('|'.join(re_concepts), flags=re.DOTALL).findall


def do_find_all_concepts(q_in, l_out):
    while True:
        sentence = q_in.get()
        l_out.append(find_all_concepts(sentence))
        q_in.task_done()


# Queue with default maxsize of 0, infinite queue size
sentences_q = queue.Queue()
output = []

# any reasonable number of workers
num_threads = 2
for i in range(num_threads):
    worker = threading.Thread(target=do_find_all_concepts, args=(sentences_q, output))
    # once there's nothing but daemon threads left, Python exits the program
    worker.daemon = True
    worker.start()

# put all the input on the queue
for s in sentences:
    sentences_q.put(s)

# wait for the entire queue to be processed
sentences_q.join()
print(output)
User@wwii询问了多个线程是否真的会影响cpu限制问题的性能。与使用多个线程访问同一输出变量不同,您还可以使用多个进程访问共享输出队列,如下所示:

import re
import queue
import multiprocessing

sentences = [
    'data mining is the process of discovering patterns in large data sets involving methods at the intersection of machine learning statistics and database systems',
    'data mining is an interdisciplinary subfield of computer science and statistics with an overall goal to extract information from a data set and transform the information into a comprehensible structure for further use',
    'data mining is the analysis step of the knowledge discovery in databases process or kdd']

concepts = ['data mining', 'database systems', 'databases process',
            'interdisciplinary subfield', 'information', 'knowledge discovery',
            'methods', 'machine learning', 'patterns', 'process']

re_concepts = [re.escape(t) for t in concepts]

find_all_concepts = re.compile('|'.join(re_concepts), flags=re.DOTALL).findall


def do_find_all_concepts(q_in, q_out):
    try:
        while True:
            sentence = q_in.get(False)
            q_out.put(find_all_concepts(sentence))
    except queue.Empty:
        pass


if __name__ == '__main__':
    # default maxsize of 0, infinite queue size
    sentences_q = multiprocessing.Queue()
    output_q = multiprocessing.Queue()

    # any reasonable number of workers
    num_processes = 2
    pool = multiprocessing.Pool(num_processes, do_find_all_concepts, (sentences_q, output_q))

    # put all the input on the queue
    for s in sentences:
        sentences_q.put(s)

    # wait for the entire queue to be processed
    pool.close()
    pool.join()
    while not output_q.empty():
        print(output_q.get())

更多的开销,但也使用其他内核上可用的CPU资源。

多线程是否会产生实际的性能提高,不仅取决于Python中的实现和数据量,还取决于执行程序的硬件。在某些情况下,在硬件没有优势的情况下,多线程可能会由于增加的开销而降低速度

但是,假设您在现代标准PC或更好的PC上运行,您可能会看到多线程的一些改进。接下来的问题是建立一些工人,将工作交给他们,并收集结果

贴近您的示例结构、实现和命名:

import re
import queue
import threading

sentences = ['data mining is the process of discovering patterns in large data sets involving methods at the intersection of machine learning statistics and database systems',
             'data mining is an interdisciplinary subfield of computer science and statistics with an overall goal to extract information from a data set and transform the information into a comprehensible structure for further use',
             'data mining is the analysis step of the knowledge discovery in databases process or kdd']

concepts = ['data mining', 'database systems', 'databases process',
            'interdisciplinary subfield', 'information', 'knowledge discovery',
            'methods', 'machine learning', 'patterns', 'process']

re_concepts = [re.escape(t) for t in concepts]

find_all_concepts = re.compile('|'.join(re_concepts), flags=re.DOTALL).findall


def do_find_all_concepts(q_in, l_out):
    while True:
        sentence = q_in.get()
        l_out.append(find_all_concepts(sentence))
        q_in.task_done()


# Queue with default maxsize of 0, infinite queue size
sentences_q = queue.Queue()
output = []

# any reasonable number of workers
num_threads = 2
for i in range(num_threads):
    worker = threading.Thread(target=do_find_all_concepts, args=(sentences_q, output))
    # once there's nothing but daemon threads left, Python exits the program
    worker.daemon = True
    worker.start()

# put all the input on the queue
for s in sentences:
    sentences_q.put(s)

# wait for the entire queue to be processed
sentences_q.join()
print(output)
User@wwii询问了多个线程是否真的会影响cpu限制问题的性能。与使用多个线程访问同一输出变量不同,您还可以使用多个进程访问共享输出队列,如下所示:

import re
import queue
import multiprocessing

sentences = [
    'data mining is the process of discovering patterns in large data sets involving methods at the intersection of machine learning statistics and database systems',
    'data mining is an interdisciplinary subfield of computer science and statistics with an overall goal to extract information from a data set and transform the information into a comprehensible structure for further use',
    'data mining is the analysis step of the knowledge discovery in databases process or kdd']

concepts = ['data mining', 'database systems', 'databases process',
            'interdisciplinary subfield', 'information', 'knowledge discovery',
            'methods', 'machine learning', 'patterns', 'process']

re_concepts = [re.escape(t) for t in concepts]

find_all_concepts = re.compile('|'.join(re_concepts), flags=re.DOTALL).findall


def do_find_all_concepts(q_in, q_out):
    try:
        while True:
            sentence = q_in.get(False)
            q_out.put(find_all_concepts(sentence))
    except queue.Empty:
        pass


if __name__ == '__main__':
    # default maxsize of 0, infinite queue size
    sentences_q = multiprocessing.Queue()
    output_q = multiprocessing.Queue()

    # any reasonable number of workers
    num_processes = 2
    pool = multiprocessing.Pool(num_processes, do_find_all_concepts, (sentences_q, output_q))

    # put all the input on the queue
    for s in sentences:
        sentences_q.put(s)

    # wait for the entire queue to be processed
    pool.close()
    pool.join()
    while not output_q.empty():
        print(output_q.get())

更多的开销,但也使用其他内核上可用的CPU资源。

这里有两个使用.ProcessPoolExecutor的解决方案,它们将任务分配给不同的进程。您的任务似乎是cpu绑定的,而不是i/o绑定的,因此线程可能不会有帮助

import re
import concurrent.futures

# using the lists in your example

re_concepts = [re.escape(t) for t in concepts]
all_concepts = re.compile('|'.join(re_concepts), flags=re.DOTALL)

def f(sequence, regex=all_concepts):
    result = regex.findall(sequence)
    return result

if __name__ == '__main__':

    out1 = []
    with concurrent.futures.ProcessPoolExecutor() as executor:
        futures = [executor.submit(f, s) for s in sentences]
        for future in concurrent.futures.as_completed(futures):
            try:
                result = future.result()
            except Exception as e:
                print(e)
            else:
                #print(result)
                out1.append(result)   

    out2 = []
    with concurrent.futures.ProcessPoolExecutor() as executor:
        for result in executor.map(f, sentences):
            #print(result)
            out2.append(result)

Executor.map()
有一个
chunksize
参数:比如说,发送大于一个iterable项的块可能是有益的。需要对函数进行重构才能解决这一问题。我用一个只返回发送内容的函数来测试它,但不管我指定的chunksize是多少,测试函数只返回单个项。算了吧

def h(sequence):
    return sequence
多处理的一个缺点是,数据必须序列化/pickle才能发送到进程,这需要时间,而且对于如此大的已编译正则表达式可能非常重要—它可能会抵消多个进程带来的好处

我制作了一组13e6随机字符串,每个字符串有20个字符,以近似您编译的正则表达式

data =set(''.join(random.choice(string.printable) for _ in range(20)) for _ in range(13000000))
酸洗到一条流大约需要7.5秒,从io.BytesIO流中取出需要9秒。如果使用多处理解决方案,最好将concepts对象(以任何形式)pickle到硬盘上一次,然后让每个进程从硬盘上取消pickle,而不是每次创建新进程时在IPC的每一侧进行pickle/unpickle,这绝对值得测试-YMMV。我的硬盘上有380MB的数据

当我尝试使用concurrent.futures.ProcessPoolExecutor进行一些实验时,我一直在炸毁我的计算机,因为每个进程都需要它自己的集合副本,而我的计算机没有足够的ram


我将发布另一个关于测试句子中概念的方法的答案。

这里有两个使用.ProcessPoolExecutor的解决方案,它将任务分配给不同的进程。您的任务似乎是cpu绑定的,而不是i/o绑定的,因此线程可能不会有帮助

import re
import concurrent.futures

# using the lists in your example

re_concepts = [re.escape(t) for t in concepts]
all_concepts = re.compile('|'.join(re_concepts), flags=re.DOTALL)

def f(sequence, regex=all_concepts):
    result = regex.findall(sequence)
    return result

if __name__ == '__main__':

    out1 = []
    with concurrent.futures.ProcessPoolExecutor() as executor:
        futures = [executor.submit(f, s) for s in sentences]
        for future in concurrent.futures.as_completed(futures):
            try:
                result = future.result()
            except Exception as e:
                print(e)
            else:
                #print(result)
                out1.append(result)   

    out2 = []
    with concurrent.futures.ProcessPoolExecutor() as executor:
        for result in executor.map(f, sentences):
            #print(result)
            out2.append(result)

Executor.map()
有一个
chunksize
参数:比如说,发送大于一个iterable项的块可能是有益的。需要对函数进行重构才能解决这一问题。我用一个只返回发送内容的函数来测试它,但不管我指定的chunksize是多少,测试函数只返回单个项。算了吧

def h(sequence):
    return sequence
多处理的一个缺点是,数据必须序列化/pickle才能发送到进程,这需要时间,而且对于如此大的已编译正则表达式可能非常重要—它可能会抵消多个进程带来的好处

我制作了一组13e6随机字符串,每个字符串有20个字符,以近似您编译的正则表达式

data =set(''.join(random.choice(string.printable) for _ in range(20)) for _ in range(13000000))
酸洗到一条流大约需要7.5秒,从io.BytesIO流中取出需要9秒。如果使用多处理解决方案,最好将概念对象(以任何形式)pickle到硬盘上一次,然后执行每个过程
def f(sentence, data=data, nwise=nwise):
    '''iterate over memes in sentence and see if they are in data'''
    sentence = sentence.strip().split()
    found = []
    for n in [1,2,3,4,5,6,7,8,9,10]:
        for meme in nwise(sentence,n):
            meme = ' '.join(meme)
            if meme in data:
                found.append(meme)
    return found

def g(sentence, data=data, nwise=nwise):
    'make a set of the memes in sentence then find its intersection with data'''
    sentence = sentence.strip().split()
    test_strings = set(' '.join(meme) for n in range(1,11) for meme in nwise(sentence,n))
    found = test_strings.intersection(data)
    return found
from itertools import tee

sentences = ['data mining is the process of discovering patterns in large data sets involving methods at the intersection of machine learning statistics and database systems', 
             'data mining is an interdisciplinary subfield of computer science and statistics with an overall goal to extract information from a data set and transform the information into a comprehensible structure for further use',
             'data mining is the analysis step of the knowledge discovery in databases process or kdd']

concepts = ['data mining', 'database systems', 'databases process', 
            'interdisciplinary subfield', 'information', 'knowledge discovery',
            'methods', 'machine learning', 'patterns', 'process']

concepts = set(concepts)

def nwise(iterable, n=2):
    "s -> (s0,s1), (s1,s2), (s2, s3), ... for n=2"
    iterables = tee(iterable, n)
    # advance each iterable to the appropriate starting point
    for i, thing in enumerate(iterables[1:],1):
        for _ in range(i):
            next(thing, None)
    return zip(*iterables)

def f(sentence, concepts=concepts, nwise=nwise):
    '''iterate over memes in sentence and see if they are in concepts'''
    indices = set()
    #print(sentence)
    words = sentence.strip().split()
    for n in [1,2,3,4,5,6,7,8,9,10]:
        for meme in nwise(words,n):
            meme = ' '.join(meme)
            if meme in concepts:
                start = sentence.find(meme)
                end = len(meme)+start
                while (start,end) in indices:
                    #print(f'{meme} already found at character:{start} - looking for another one...') 
                    start = sentence.find(meme, end)
                    end = len(meme)+start
                indices.add((start, end))
    return [sentence[start:end] for (start,end) in sorted(indices)]


###########
results = []
for sentence in sentences:
    results.append(f(sentence))
    #print(f'{sentence}\n\t{results[-1]})')


In [20]: results
Out[20]: 
[['data mining', 'process', 'patterns', 'methods', 'machine learning', 'database systems'],
 ['data mining', 'interdisciplinary subfield', 'information', 'information'],
 ['data mining', 'knowledge discovery', 'databases process', 'process']]