在Python中比较样本平均值与随机分类_Python_Pandas

在Python中比较样本平均值与随机分类

python pandas

在Python中比较样本平均值与随机分类,python,pandas,Python,Pandas,给定df A B C Date 2010-01-17 -0.9304 3.7477 0.0000 2010-01-24 -3.6348 1.5733 -3.6348 2010-01-31 -1.8950 0.4957 -1.8950 2010-02-07 -0.6990 -0.1480 -0.6990 2010-02-14 1.4635 -3.4206 1.4635

给定df

            A         B         C
Date            
2010-01-17  -0.9304   3.7477    0.0000
2010-01-24  -3.6348   1.5733   -3.6348
2010-01-31  -1.8950   0.4957   -1.8950
2010-02-07  -0.6990  -0.1480   -0.6990
2010-02-14   1.4635  -3.4206    1.4635

我想将每个日期的df['C']平均值与从df['A']或df['B']中选取1个元素创建的10.000个随机序列进行比较，看看平均值排名在哪里（1 if最高，0.95 if高于9500个随机数，等等）

我写了一个旧的公式，但我不能再把它组合起来，也许这有帮助

def mean_diff(d):
    result = {}
    for k, (l, t) in d.iteritems():
        m = np.mean(t)
        len_ = len(t)
        result[k] = np.mean([m > np.mean(npr.choice(l, len_, True))
                            for _ in range(10000)])
    return result

谢谢

**10000，因为原始数据的行数远远超过5行

更新：

为了解决这个问题，我必须开始解决一个小问题。看这个

嗯，有一个快捷方式：

由于A、B两列中的元素数量相等，我们可以将它们放在一个列表中，从该列表中随机抽取10000个样本，并将它们与C的平均值进行比较

sample = df['C'].values
a = df['A'].values
b = df['B'].values
population = np.concatenate((a,b), axis=0)

def mean_diff(s, p):
    m = np.mean(s)
    len_ = len(s)
    result = np.mean([m > np.mean(npr.choice(p, len_, True))
                            for _ in range(10000)])
    return result

mean_diff(sample, population)