Python 从两个预先计算的直方图报告两个样本的K-S统计_Python_Numpy_Matplotlib_Scipy_Statistics

Python 从两个预先计算的直方图报告两个样本的K-S统计

python numpy matplotlib statistics

Python 从两个预先计算的直方图报告两个样本的K-S统计,python,numpy,matplotlib,scipy,statistics,Python,Numpy,Matplotlib,Scipy,Statistics,问题：在这里，我绘制了存储在文本文件中的两个数据集（在列表数据集中），每个数据集包含218亿个数据点。这使得数据太大，无法作为数组保存在内存中。我仍然能够将它们绘制成直方图，但我不确定如何通过一个简单的公式来计算它们的差异。这是因为我不知道如何访问plt对象中的每个直方图示例：下面是一些生成虚拟数据的代码： mu = [100, 120] sigma = 30 dataset = ['gsl_test_1.txt', 'gsl_test_2.txt'] for idx, file in e

问题：

在这里，我绘制了存储在文本文件中的两个数据集（在列表

数据集

中），每个数据集包含218亿个数据点。这使得数据太大，无法作为数组保存在内存中。我仍然能够将它们绘制成直方图，但我不确定如何通过一个简单的公式来计算它们的差异。这是因为我不知道如何访问plt对象中的每个直方图

示例：

下面是一些生成虚拟数据的代码：

mu = [100, 120]
sigma = 30
dataset = ['gsl_test_1.txt', 'gsl_test_2.txt']
for idx, file in enumerate(dataset):
    dist = np.random.normal(mu[idx], sigma, 10000)
    with open(file, 'w') as g:
        for s in dist:
            g.write('{}\t{}\t{}\n'.format('stuff', 'stuff', str(s)))

这将生成我的两个直方图（可能）：

问题：

大多数使用两个原始数据数组/观察值/点/等等，但我没有足够的内存来使用这种方法。根据上面的例子，我如何访问这些预计算的存储箱（从

'gsl\u test\u 1.txt'

和

'gsl\u test\u 2.txt'

来计算两个发行版之间的KS统计

奖励业力：

在图表上记录KS统计数据和pvalue！

我清理了一下您的代码。写入

StringIO

比写入文件更精简。设置默认vibe w/

seaborn

而不是

matplotlib

，使其看起来更现代。如果需要，两个示例的

bins

阈值应该相同你想让统计测试排成一行。我想如果你这样迭代并制作箱子，整个过程可能需要比它需要的时间更长。

计数器

可能很有用b/c你只需要循环一次…而且你可以制作相同的箱子大小。将浮点数转换为整数，因为你正在将它们组合在一起。

from集合导入计数器

然后

C=Counter（）

和

C[value]+=1

。最后将有一个

dict

，您可以从

列表（C.keys（））中创建箱子

。这会很好，因为您的数据非常粗糙。此外，您应该看看是否有一种方法可以使用

numpy

而不是

pandas

b/c

numpy

进行

%timeit

索引。对于

DF.iloc[i，j]

和

ARRAY[i，j]，尝试使用%timeit

你会明白我的意思。我写了很多函数，试图让它更模块化

import numpy as np
import pandas as pd
import math
import matplotlib.pyplot as plt
from io import StringIO
from scipy.stats import ks_2samp
import seaborn as sns; sns.set()

%matplotlib inline

#Added seaborn b/c it looks mo betta

mu = [100, 120]
sigma = 30

def write_random(file,mu,sigma=30):
    dist = np.random.normal(mu, sigma, 10000)
    for i,s in enumerate(dist):
        file.write('{}\t{}\t{}\n'.format("label_A-%d" % i, "label_B-%d" % i, str(s)))
    return(file)

#Writing to StringIO instead of an actual file
gs1_test_1 = write_random(StringIO(),mu=100)
gs1_test_2 = write_random(StringIO(),mu=120)

chunksize = 1000

def make_hist(fh,ax):
    # find the min, max, line qty, for bins
    low = np.inf
    high = -np.inf

    loop = 0

    fh.seek(0)
    for chunk in pd.read_table(fh, header=None, chunksize=chunksize, sep='\t'):
        low = np.minimum(chunk.iloc[:, 2].min(), low) #btw, iloc is way slower than numpy array indexing
        high = np.maximum(chunk.iloc[:, 2].max(), high) #you might wanna import and do the chunks with numpy
        loop += 1
    lines = loop*chunksize

    nbins = math.ceil(math.sqrt(lines))   

    bin_edges = np.linspace(low, high, nbins + 1)
    total = np.zeros(nbins, np.int64)  # np.ndarray filled with np.uint32 zeros, CHANGED TO int64

    fh.seek(0)
    for chunk in pd.read_table(fh, header=None, chunksize=chunksize, delimiter='\t'):

        # compute bin counts over the 3rd column
        subtotal, e = np.histogram(chunk.iloc[:, 2], bins=bin_edges)  # np.ndarray filled with np.int64

        # accumulate bin counts over chunks
        total += subtotal

    plt.hist(bin_edges[:-1], bins=bin_edges, weights=total,axes=ax,alpha=0.5)

    return(ax,bin_edges,total)

#Make the plot canvas to write on to give it to the function
fig,ax = plt.subplots()

test_1_data = make_hist(gs1_test_1,ax)
test_2_data = make_hist(gs1_test_2,ax)

#test_1_data[1] == test_2_data[1] The bins should be the same if you're going try and compare them...
ax.set_title("ks: %f, p_in_the_v: %f" % ks_2samp(test_1_data[2], test_2_data[2]))

我认为这里报告的KS统计数据是错误的。

KS_2sample

对原始样本有效，并且您通过了预计算的直方图，因此测试实际上对每个箱子上的频率进行操作。OP的问题特别问到如何在预计算的直方图上运行K-s。

import numpy as np
import pandas as pd
import math
import matplotlib.pyplot as plt
from io import StringIO
from scipy.stats import ks_2samp
import seaborn as sns; sns.set()

%matplotlib inline

#Added seaborn b/c it looks mo betta

mu = [100, 120]
sigma = 30

def write_random(file,mu,sigma=30):
    dist = np.random.normal(mu, sigma, 10000)
    for i,s in enumerate(dist):
        file.write('{}\t{}\t{}\n'.format("label_A-%d" % i, "label_B-%d" % i, str(s)))
    return(file)

#Writing to StringIO instead of an actual file
gs1_test_1 = write_random(StringIO(),mu=100)
gs1_test_2 = write_random(StringIO(),mu=120)

chunksize = 1000

def make_hist(fh,ax):
    # find the min, max, line qty, for bins
    low = np.inf
    high = -np.inf

    loop = 0

    fh.seek(0)
    for chunk in pd.read_table(fh, header=None, chunksize=chunksize, sep='\t'):
        low = np.minimum(chunk.iloc[:, 2].min(), low) #btw, iloc is way slower than numpy array indexing
        high = np.maximum(chunk.iloc[:, 2].max(), high) #you might wanna import and do the chunks with numpy
        loop += 1
    lines = loop*chunksize

    nbins = math.ceil(math.sqrt(lines))   

    bin_edges = np.linspace(low, high, nbins + 1)
    total = np.zeros(nbins, np.int64)  # np.ndarray filled with np.uint32 zeros, CHANGED TO int64

    fh.seek(0)
    for chunk in pd.read_table(fh, header=None, chunksize=chunksize, delimiter='\t'):

        # compute bin counts over the 3rd column
        subtotal, e = np.histogram(chunk.iloc[:, 2], bins=bin_edges)  # np.ndarray filled with np.int64

        # accumulate bin counts over chunks
        total += subtotal

    plt.hist(bin_edges[:-1], bins=bin_edges, weights=total,axes=ax,alpha=0.5)

    return(ax,bin_edges,total)

#Make the plot canvas to write on to give it to the function
fig,ax = plt.subplots()

test_1_data = make_hist(gs1_test_1,ax)
test_2_data = make_hist(gs1_test_2,ax)

#test_1_data[1] == test_2_data[1] The bins should be the same if you're going try and compare them...
ax.set_title("ks: %f, p_in_the_v: %f" % ks_2samp(test_1_data[2], test_2_data[2]))