Warning: file_get_contents(/data/phpspider/zhask/data//catemap/0/performance/5.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python 提高scipy'的性能;s-Anderson-Darling二样本检验_Python_Performance_Scipy_Statistical Test - Fatal编程技术网

Python 提高scipy'的性能;s-Anderson-Darling二样本检验

Python 提高scipy'的性能;s-Anderson-Darling二样本检验,python,performance,scipy,statistical-test,Python,Performance,Scipy,Statistical Test,我需要申请两个1D样本数十万次。scipy中的实现非常简单,工作正常,但需要花费大量时间。我希望提高其性能,因为我知道: 我将始终比较两个样本 我只需要Anderson-Darling检验统计量,即,不需要临界值或p值 从测试中删除不必要的检查,我成功地将性能提高了近30% 这是否可以进一步改进 import numpy as np from scipy.stats import anderson_ksamp import time as t def main(): """ T

我需要申请两个1D样本数十万次。
scipy
中的实现非常简单,工作正常,但需要花费大量时间。我希望提高其性能,因为我知道:

  • 我将始终比较两个样本
  • 我只需要Anderson-Darling检验统计量,即,不需要临界值或p值
  • 从测试中删除不必要的检查,我成功地将性能提高了近30%

    这是否可以进一步改进

    import numpy as np
    from scipy.stats import anderson_ksamp
    import time as t
    
    def main():
        """
        Test scipy's osiginal vs the simplified Anderson-Dalring tests.
        """
        t1, t2 = 0., 0.
        AD_all = []
        for _ in range(1000):
            N = np.random.randint(10, 200)
            aa = np.random.uniform(0., 1., N)
            bb = np.random.uniform(0., 1., N)
    
            s = t.time()
            AD = anderson_ksamp([aa, bb])[0]
            t1 += t.time() - s
            s = t.time()
            AD2 = anderson_ksamp_new([aa, bb])
            t2 += t.time() - s
    
            # Check that both values are equal
            AD_all.append([AD, AD2])
    
        AD_all = np.array(AD_all).T
        print((AD_all[0] == AD_all[1]).all())
        print("Improvement: {:.1f}%".format(100. - 100. * t2 / t1))
    
    
    def anderson_ksamp_new(samples):
        """
        A2akN: equation 7 of Scholz and Stephens.
    
        samples : sequence of 1-D array_like
            Array of sample arrays.
        Z : array_like
            Sorted array of all observations.
        Zstar : array_like
            Sorted array of unique observations.
        k : int
            Number of samples.
        n : array_like
            Number of observations in each sample.
        N : int
            Total number of observations.
        Returns
        -------
        A2aKN : float
            The A2aKN statistics of Scholz and Stephens 1987.
    
        """
    
        k = 2  # This will always be 2
    
        Z = np.sort(np.hstack(samples))
        N = Z.size
        Zstar = np.unique(Z)
    
        n = np.array([sample.size for sample in samples])
    
        A2kN = 0.
        Z_ssorted_left = Z.searchsorted(Zstar, 'left')
        if N == Zstar.size:
            lj = 1.
        else:
            lj = Z.searchsorted(Zstar, 'right') - Z_ssorted_left
        Bj = Z_ssorted_left + lj / 2.
        for i in np.arange(0, k):
            s = np.sort(samples[i])
            s_ssorted_right = s.searchsorted(Zstar, side='right')
            Mij = s_ssorted_right.astype(float)
            fij = s_ssorted_right - s.searchsorted(Zstar, 'left')
            Mij -= fij / 2.
            inner = lj / float(N) * (N*Mij - Bj*n[i])**2 / (Bj*(N - Bj) - N*lj/4.)
            A2kN += inner.sum() / n[i]
        A2kN *= (N - 1.) / N
    
        H = (1. / n).sum()
        hs_cs = (1. / np.arange(N - 1, 1, -1)).cumsum()
        h = hs_cs[-1] + 1
        g = (hs_cs / np.arange(2, N)).sum()
    
        a = (4*g - 6) * (k - 1) + (10 - 6*g)*H
        b = (2*g - 4)*k**2 + 8*h*k + (2*g - 14*h - 4)*H - 8*h + 4*g - 6
        c = (6*h + 2*g - 2)*k**2 + (4*h - 4*g + 6)*k + (2*h - 6)*H + 4*h
        d = (2*h + 6)*k**2 - 4*h*k
        sigmasq = (a*N**3 + b*N**2 + c*N + d) / ((N - 1.) * (N - 2.) * (N - 3.))
        m = k - 1
        A2 = (A2kN - m) / np.sqrt(sigmasq)
    
        return A2
    

    通过使用
    np.ndarray.sort
    代替
    np.sort
    进行适当排序,您可以获得一点改进:

    Z = np.sort(np.hstack(samples))
    
    变成:

    Z = np.hstack(samples)
    Z.sort()
    
    s = samples[i]
    s.sort()
    

    变成:

    Z = np.hstack(samples)
    Z.sort()
    
    s = samples[i]
    s.sort()
    
    通过这些修改,与anderson_ksamp_new(您的版本)相比,我得到了12%的改进


    此外,您还可以使用
    np.concatenate
    代替
    np.hstack
    。这与适当的排序相结合,使我的成绩提高了16%。

    谢谢Jacques,我想这是可行的。我想这是最好的,没有一些主要的假设/限制。不确定这是否是你能做的,但这些是最明显的。我想尝试
    argsort
    而不是
    sort
    ,但还没有时间。只是尝试了一下
    argsort
    ,发现这不是一个好主意!