Python 如何根据值的范围对数据进行分组

Python 如何根据值的范围对数据进行分组,python,python-2.7,Python,Python 2.7,我有这样的数据 [312.281, 370.401, 254.245, 272.256, 312.325, 286.243, 271.231, ...] 然后我想根据值的范围对它们进行分组 for i in data: if i in range(200,300): data_200_300.append(i) elif i in range(300,400): data_300_400.append(i) 它不工作,我应该使用什

我有这样的数据

[312.281,
 370.401,
 254.245,
 272.256,
 312.325,
 286.243,
 271.231,  ...]
然后我想根据值的范围对它们进行分组

for i in data:
    if i in range(200,300):
        data_200_300.append(i)
    elif i in range(300,400):
        data_300_400.append(i)

它不工作,我应该使用什么代码?

返回两个数字之间的整数列表,而您的数据包含浮点数。当数据包含浮点数时,可以直接使用using>和返回两个数字之间的整数列表。您可以直接使用using>和@AKS正确地回答它,作为一种替代方法,您也可以使用类似于这样的lambda表达式进行尝试

result = filter(lambda x: 200 < x < 300, data)
你可以用它来处理你的数据

filtered_data = []
for i in range(200,400,100):
    filtered_data.append( filter(lambda x: i < x < i+100, data) )

>>> filtered_data
[[254.245, 272.256, 286.243, 271.231], [312.281, 370.401, 312.325]]

@AKS的回答是正确的,作为替代,您也可以用lambda表达式进行类似的尝试

result = filter(lambda x: 200 < x < 300, data)
你可以用它来处理你的数据

filtered_data = []
for i in range(200,400,100):
    filtered_data.append( filter(lambda x: i < x < i+100, data) )

>>> filtered_data
[[254.245, 272.256, 286.243, 271.231], [312.281, 370.401, 312.325]]

如果您有很多这样的值,并且可以导入numpy,那么有一个比If条件字符串或lambda过滤器更快的选项。它使用的是逻辑索引:

def indexingversion(data, bin_start, bin_end, bin_step):
    x = np.array(data)
    bin_edges = np.arange(bin_start, bin_end + bin_step, bin_step)
    bin_number = bin_edges.size - 1
    cond = np.zeros((x.size, bin_number), dtype=bool)
    for i in range(bin_number):
        cond[:, i] = np.logical_and(bin_edges[i] < x,
                                    x < bin_edges[i+1])
    return [list(x[cond[:, i]]) for i in range(bin_number)]
分析输出:

All the same? - True
Wrote profile results to bla.py.lprof
Timer unit: 1e-06 s

Total time: 0.580098 s
File: bla.py
Function: run_all at line 32

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
    32                                           @profile
    33                                           def run_all():
    34         1            1      1.0      0.0      n = 100000
    35         1         3311   3311.0      0.6      x = np.random.random_integers(200, 400, n) + np.random.ranf(n)
    36         1            2      2.0      0.0      bin_start = 200
    37         1            1      1.0      0.0      bin_end = 400
    38         1            0      0.0      0.0      bin_step = 100
    39         1       263073 263073.0     45.3      a = forloop(x)
    40         1       301819 301819.0     52.0      b = lambdaversion(x, bin_start, bin_end, bin_step)
    41         1         7514   7514.0      1.3      c = indexingversion(x, bin_start, bin_end, bin_step)
    42         1         4377   4377.0      0.8      print('All the same? - ' + str(a == b == c))

正如您在时间或%时间列中所看到的,numpy索引大约快40或50倍,至少对于100000个数字来说。但是,对于非常少的值,在我的机器上速度较慢,在大约40个值时开始速度会更快。

如果您有很多这样的值,并且可以导入numpy,则有一个比一系列If条件或lambda过滤器更快的选项。它使用的是逻辑索引:

def indexingversion(data, bin_start, bin_end, bin_step):
    x = np.array(data)
    bin_edges = np.arange(bin_start, bin_end + bin_step, bin_step)
    bin_number = bin_edges.size - 1
    cond = np.zeros((x.size, bin_number), dtype=bool)
    for i in range(bin_number):
        cond[:, i] = np.logical_and(bin_edges[i] < x,
                                    x < bin_edges[i+1])
    return [list(x[cond[:, i]]) for i in range(bin_number)]
分析输出:

All the same? - True
Wrote profile results to bla.py.lprof
Timer unit: 1e-06 s

Total time: 0.580098 s
File: bla.py
Function: run_all at line 32

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
    32                                           @profile
    33                                           def run_all():
    34         1            1      1.0      0.0      n = 100000
    35         1         3311   3311.0      0.6      x = np.random.random_integers(200, 400, n) + np.random.ranf(n)
    36         1            2      2.0      0.0      bin_start = 200
    37         1            1      1.0      0.0      bin_end = 400
    38         1            0      0.0      0.0      bin_step = 100
    39         1       263073 263073.0     45.3      a = forloop(x)
    40         1       301819 301819.0     52.0      b = lambdaversion(x, bin_start, bin_end, bin_step)
    41         1         7514   7514.0      1.3      c = indexingversion(x, bin_start, bin_end, bin_step)
    42         1         4377   4377.0      0.8      print('All the same? - ' + str(a == b == c))

正如您在时间或%时间列中所看到的,numpy索引大约快40或50倍,至少对于100000个数字来说。但是,对于非常少量的值,在我的机器上速度较慢,在大约40个值时开始速度更快。

如果我想在列中分组,如df=[id,v1,v2,v3 1,12,32,23 2,65,45,22 3,55,34,76…],如果我想基于v3 colunn进行分组,我应该怎么做?如果我想在列中分组,比如df=[id,v1,v2,v3 1,12,32,23 2,65,45,22 3,55,34,76…]如果我想基于v3 colunn进行分组,我应该怎么做?