Python 使用Numpy进行多线程处理会导致分段错误_Python_Multithreading_Numpy_Segmentation Fault

Python 使用Numpy进行多线程处理会导致分段错误

python multithreading numpy

Python 使用Numpy进行多线程处理会导致分段错误,python,multithreading,numpy,segmentation-fault,Python,Multithreading,Numpy,Segmentation Fault,我试图生成一个包含不同数据“分组”的报告。对于每一个，我必须以不同的方式查询postgres并应用不同的逻辑，这可能需要相当长的时间（约1小时）为了提高性能，我为每个任务创建了一个线程，每个线程都有自己的连接，因为psycopg2会在每个连接上连续执行查询。我使用numpy来计算一部分数据的中值和平均值（这在各组之间是通用的）下面是我的代码的一个简短示例： # -*- coding: utf-8 -*- from postgres import Connection from lookup

我试图生成一个包含不同数据“分组”的报告。对于每一个，我必须以不同的方式查询postgres并应用不同的逻辑，这可能需要相当长的时间（约1小时）

为了提高性能，我为每个任务创建了一个线程，每个线程都有自己的连接，因为psycopg2会在每个连接上连续执行查询。我使用numpy来计算一部分数据的中值和平均值（这在各组之间是通用的）

下面是我的代码的一个简短示例：

# -*- coding: utf-8 -*-

from postgres import Connection
from lookup import Lookup
from queries import QUERY1, QUERY2
from threading import Thread

class Report(object):

    def __init__(self, **credentials):
        self.conn = self.__get_conn(**credentials)
        self._lookup = Lookup(self.conn)
        self.data = {}

    def __get_conn(self, **credentials):
        return Connection(**credentials)

    def _get_averages(self, data):
        return {
            'mean' : numpy.mean(data),
            'median' : numpy.median(data)
        }

    def method1(self):
        conn = self.__get_conn()
        cursor = conn.get_cursor()
        data = cursor.execute(QUERY1)

        for row in data:
            # Logic specific to the results returned by the query.
            row['arg1'] = self._lookup.find_data_by_method_1(row)
            avgs = self._get_averages(row['data'])
            row['mean'] = avgs['mean']
            row['median'] = avgs['median']

        return data

    def method2(self):
        conn = self.__get_conn()
        cursor = conn.get_cursor()
        data = cursor.execute(QUERY2)

        for row in data:
            # Logic specific to the results returned by the query.
            row['arg2'] = self._lookup.find_data_by_method_2(row)
            avgs = self._get_averages(row['data'])
            row['mean'] = avgs['mean']
            row['median'] = avgs['median']

        return  data

    def lookup(self, arg):

        methods = {
            'arg1' : self.method1,
            'arg2' : self.method2
        }

        method = methods(arg)
        self.data[arg] = method()

    def lookup_args(self):
        return self._lookup.find_args()

    def do_something_with_data(self):
        print self.data

def main():

    creds = {
        'host':'host',
        'user':'postgres',
        'database':'mydatabase',
        'password':'mypassword'
    }
    reporter = Report(**creds)

    args = reporter.lookup_args()
    threads = []
    for arg in args:
        thread = Thread(target=reporter.lookup, args=(arg,))
        threads.append(thread)

    for thread in threads:
        thread.start()
    for thread in threads:
        thread.join()

    reporter.do_something_with_data()

导入的

Connection

类是围绕psycopg2的一个简单包装器，用于方便创建游标和连接到多个postgres数据库

导入的

查找

类接受一个

连接

实例，并用于执行短查询以查找相关数据，这些数据在合并到较大的查询中时会大大降低性能

\u get\u averages

示例方法所接受的

数据

是一个

decimal.decimal

对象列表

当我同时运行所有线程时，我得到一个segfault。如果我独立运行每个线程，脚本将成功完成

使用

gdb

我发现numpy是罪魁祸首：

Program received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7fffedc8c700 (LWP 10997)]
0x00007ffff2ac33b7 in sortCompare (a=0x2e956298, b=0x2e956390) at numpy/core/src/multiarray/item_selection.c:1045
1045    numpy/core/src/multiarray/item_selection.c: No such file or directory.
        in numpy/core/src/multiarray/item_selection.c

我知道numpy有很多缺点，但这似乎只会影响包含类实例和其他数值类型的排序列表。我的列表中的对象保证是

decimal.decimal

实例。（是的，我核实了这一点）

什么会导致

numpy

在线程内部使用时导致segfault，但在其他情况下会按预期运行？

我不知道这是否重要，但如果列表元素是

decimal.decimal

，那么由它们构建的数组将是

dtype=object

。但我发现，代码甚至试图访问

.c

源文件，这让我感到困惑。这应该只在编译过程中发生。NumPy可以也将在C代码中出错，因为它主要是用C编写的，上面只有一层Python的外衣。@msw明白。我不明白的是，当多个线程运行时，它为什么会失败。我不知道这是否重要，但如果列表元素是

decimal.decimal

，那么从它们构建的数组将是

dtype=object

。但我发现，代码甚至试图访问

.c

源文件，这让我感到困惑。这应该只在编译过程中发生。NumPy可以也将在C代码中出错，因为它主要是用C编写的，上面只有一层Python的外衣。@msw明白。我不明白的是，当多个线程运行时，它为什么会失败。