Python 3.x python:从比较两个直方图的数据集中删除记录

Python 3.x python:从比较两个直方图的数据集中删除记录,python-3.x,pandas,numpy,histogram,Python 3.x,Pandas,Numpy,Histogram,我有两个长度不同的多列(顺序为10)数据集(每一行都是一条记录),它们必须成为相同数量的行:标准是对多个列进行装箱,从2到4,然后删除两个数据集中的一个中超出的记录(在该箱子中的所有记录之间随机选取) 我目前正在使用numpy,但也可以使用熊猫 因为我事先知道一个数据集比另一个小,所以我(天真地说)的想法是计算两个柱状图(先小一点),从另一个柱状图中减去一个柱状图,以在每个箱子中获得差异,然后遍历数据集以删除超出的记录,但是:我必须知道哪个箱子中有什么记录 用于在python中计算直方图的代码段

我有两个长度不同的多列(顺序为10)数据集(每一行都是一条记录),它们必须成为相同数量的行:标准是对多个列进行装箱,从2到4,然后删除两个数据集中的一个中超出的记录(在该箱子中的所有记录之间随机选取)

我目前正在使用numpy,但也可以使用熊猫

因为我事先知道一个数据集比另一个小,所以我(天真地说)的想法是计算两个柱状图(先小一点),从另一个柱状图中减去一个柱状图,以在每个箱子中获得差异,然后遍历数据集以删除超出的记录,但是:我必须知道哪个箱子中有什么记录

用于在python中计算直方图的代码段(为了简单起见,使用两列数据集):

是否有一种方法可以在装箱时跟踪数据集索引? 我知道
pandas
数据帧可以有索引,所以如果我坚持这个算法,它们可能是一个自然的选择


有没有更聪明的方法来做到这一点,改变算法但坚持使用python?

我使用
pandas
找到了一个很好的解决方案

import pandas as pd, numpy as np
x = 50 * np.random.randn(50, 5)
dfx = pd.DataFrame(x)
bins = np.linspace(min(dfx[0]), max(dfx[0]), 10)
first_binning = pd.cut(dfx[0], bins)
bins = np.linspace(min(dfx[1]), max(dfx[1]), 5)
second_binning = pd.cut(ddx[1], bins)
groups = dfx.groupby([first_binning, second_binning])
现在,您可以(根据您的数据):

查看计数,以及

In [163]: groups.indices
Out[163]:
{('(-101.273, -71.403]', '(50.481, 109.902]'): array([20, 37]),
 ('(-11.661, 18.21]', '(-127.783, -68.362]'): array([26, 39]),
 ('(-11.661, 18.21]', '(-8.94, 50.481]'): array([ 4, 14, 18, 34, 35,     45]),
 ('(-11.661, 18.21]', '(50.481, 109.902]'): array([17]),
 ('(-41.532, -11.661]', '(-68.362, -8.94]'): array([ 3, 13, 16, 30]),
 ('(-41.532, -11.661]', '(-8.94, 50.481]'): array([25, 38, 48]),
 ('(-41.532, -11.661]', '(50.481, 109.902]'): array([0, 5]),
 ('(-71.403, -41.532]', '(-68.362, -8.94]'): array([ 1, 24, 32, 47]),
 ('(-71.403, -41.532]', '(-8.94, 50.481]'): array([ 6, 19, 31]),
 ('(-71.403, -41.532]', '(50.481, 109.902]'): array([12]),
 ('(18.21, 48.0806]', '(-127.783, -68.362]'): array([21, 46]),
 ('(18.21, 48.0806]', '(-68.362, -8.94]'): array([ 2, 15, 22, 33, 40]),
 ('(18.21, 48.0806]', '(-8.94, 50.481]'): array([ 7, 28, 36]),
 ('(18.21, 48.0806]', '(50.481, 109.902]'): array([ 9, 23, 49]),
 ('(48.0806, 77.951]', '(-68.362, -8.94]'): array([41, 42]),
 ('(48.0806, 77.951]', '(-8.94, 50.481]'): array([27, 29, 43, 44]),
 ('(77.951, 107.822]', '(-68.362, -8.94]'): array([11])}
当然要查看数据集记录索引。

有帮助吗?
In [160]: groups.size()
Out[160]:
0                    1
(-101.273, -71.403]  (50.481, 109.902]      2
(-71.403, -41.532]   (-68.362, -8.94]       4
                     (-8.94, 50.481]        3
                     (50.481, 109.902]      1
(-41.532, -11.661]   (-68.362, -8.94]       4
                     (-8.94, 50.481]        3
                     (50.481, 109.902]      2
(-11.661, 18.21]     (-127.783, -68.362]    2
                     (-8.94, 50.481]        6
                     (50.481, 109.902]      1
(18.21, 48.0806]     (-127.783, -68.362]    2
                     (-68.362, -8.94]       5
                     (-8.94, 50.481]        3
                     (50.481, 109.902]      3
(48.0806, 77.951]    (-68.362, -8.94]       2
                     (-8.94, 50.481]        4
(77.951, 107.822]    (-68.362, -8.94]       1
dtype: int64
In [163]: groups.indices
Out[163]:
{('(-101.273, -71.403]', '(50.481, 109.902]'): array([20, 37]),
 ('(-11.661, 18.21]', '(-127.783, -68.362]'): array([26, 39]),
 ('(-11.661, 18.21]', '(-8.94, 50.481]'): array([ 4, 14, 18, 34, 35,     45]),
 ('(-11.661, 18.21]', '(50.481, 109.902]'): array([17]),
 ('(-41.532, -11.661]', '(-68.362, -8.94]'): array([ 3, 13, 16, 30]),
 ('(-41.532, -11.661]', '(-8.94, 50.481]'): array([25, 38, 48]),
 ('(-41.532, -11.661]', '(50.481, 109.902]'): array([0, 5]),
 ('(-71.403, -41.532]', '(-68.362, -8.94]'): array([ 1, 24, 32, 47]),
 ('(-71.403, -41.532]', '(-8.94, 50.481]'): array([ 6, 19, 31]),
 ('(-71.403, -41.532]', '(50.481, 109.902]'): array([12]),
 ('(18.21, 48.0806]', '(-127.783, -68.362]'): array([21, 46]),
 ('(18.21, 48.0806]', '(-68.362, -8.94]'): array([ 2, 15, 22, 33, 40]),
 ('(18.21, 48.0806]', '(-8.94, 50.481]'): array([ 7, 28, 36]),
 ('(18.21, 48.0806]', '(50.481, 109.902]'): array([ 9, 23, 49]),
 ('(48.0806, 77.951]', '(-68.362, -8.94]'): array([41, 42]),
 ('(48.0806, 77.951]', '(-8.94, 50.481]'): array([27, 29, 43, 44]),
 ('(77.951, 107.822]', '(-68.362, -8.94]'): array([11])}