Python：如何通过一列中的重复值来存储一组数据_Python_Pandas_Numpy

Python：如何通过一列中的重复值来存储一组数据

python pandas numpy

Python：如何通过一列中的重复值来存储一组数据,python,pandas,numpy,Python,Pandas,Numpy,比如说，我有这样一个numpy数组： import numpy as np x= np.array( [[100, 14, 12, 15], [100, 21, 16, 11], [100, 19, 10, 13], [160, 24, 15, 12], [160, 43, 12, 65], [160, 17, 53, 23], [300, 15, 17, 11], [300, 66, 23, 12], [300, 44,

比如说，我有这样一个numpy数组：

import numpy as np

x= np.array(
    [[100, 14, 12, 15],
    [100, 21, 16, 11],
    [100, 19, 10, 13],
    [160, 24, 15, 12],
    [160, 43, 12, 65],
    [160, 17, 53, 23],
    [300, 15, 17, 11],
    [300, 66, 23, 12],
    [300, 44, 70, 19]])

原始数组要大得多，所以我的问题是，是否有方法根据第1列上的值对行进行装箱或分组？例如：

{'100': [[14, 12, 15], [21, 16, 11], [19, 10, 13]],
,'160': [[24, 15, 12], [43, 12, 65], [17, 53, 23]],
,'300': [[15, 17, 11], [66, 23, 12], [44, 70, 19]]}

您可以使用

collections.defaultdict

和循环对数据进行分组

from collections import defaultdict

data = defaultdict(list)
for l in x:
    data[l[0]].append(l[1:])

print(dict(data))

输出：

{100: [[14, 12, 15], [21, 16, 11], [19, 10, 13]],
 160: [[24, 15, 12], [43, 12, 65], [17, 53, 23]],
 300: [[15, 17, 11], [66, 23, 12], [44, 70, 19]]}

我想你想要这样

编辑后

ls_dict={}
for ls in x:
    key=ls[0]
    value=[ls[1:]]
    if key in ls_dict:
        value = ls[1:]
        ls_dict[key].append(value)
    else:
        value = [ls[1:]]
        ls_dict[key]=value
print(ls_dict)

{100: [[14, 12, 15], [21, 16, 11], [19, 10, 13]], 160: [[24, 15, 12], [43, 12, 65], [17, 53, 23]], 300: [[15, 17, 11], [66, 23, 12], [44, 70, 19]]}

输出

ls_dict={}
for ls in x:
    key=ls[0]
    value=[ls[1:]]
    if key in ls_dict:
        value = ls[1:]
        ls_dict[key].append(value)
    else:
        value = [ls[1:]]
        ls_dict[key]=value
print(ls_dict)

{100: [[14, 12, 15], [21, 16, 11], [19, 10, 13]], 160: [[24, 15, 12], [43, 12, 65], [17, 53, 23]], 300: [[15, 17, 11], [66, 23, 12], [44, 70, 19]]}

我们讨论的是大型数据集，所以我们可能需要性能，因为输入数据也是NumPy数组。本文列出了两种NumPy方法

方法#1

这里有一种方法，使用

np.unique

获得分隔组的行索引，然后使用循环理解获得字典输出-

unq, idx = np.unique(x[:,0], return_index=1)
idx1 = np.r_[idx,x.shape[0]]
dict_out = {unq[i]:x[idx1[i]:idx1[i+1],1:] for i in range(len(unq))}

这假设第一列按照问题标题中的说明进行排序-

…一列中的重复值

。如果不是这样，我们需要使用

x[：，0].argsort（）

对

的行进行排序，然后继续

样本输入、输出-

In [41]: x
Out[41]: 
array([[100,  14,  12,  15],
       [100,  21,  16,  11],
       [100,  19,  10,  13],
       [160,  24,  15,  12],
       [160,  43,  12,  65],
       [160,  17,  53,  23],
       [300,  15,  17,  11],
       [300,  66,  23,  12],
       [300,  44,  70,  19]])

In [42]: dict_out
Out[42]: 
{100: array([[14, 12, 15],
        [21, 16, 11],
        [19, 10, 13]]), 160: array([[24, 15, 12],
        [43, 12, 65],
        [17, 53, 23]]), 300: array([[15, 17, 11],
        [66, 23, 12],
        [44, 70, 19]])}

方法#2

这里是另一个摆脱

np.unique的

，以进一步提高性能-

idx1 = np.concatenate(([0],np.flatnonzero(x[1:,0] != x[:-1,0])+1, [x.shape[0]]))
dict_out = {x[i,0]:x[i:j,1:] for i,j in zip(idx1[:-1], idx1[1:])}

运行时测试

ls_dict={}
for ls in x:
    key=ls[0]
    value=[ls[1:]]
    if key in ls_dict:
        value = ls[1:]
        ls_dict[key].append(value)
    else:
        value = [ls[1:]]
        ls_dict[key]=value
print(ls_dict)

{100: [[14, 12, 15], [21, 16, 11], [19, 10, 13]], 160: [[24, 15, 12], [43, 12, 65], [17, 53, 23]], 300: [[15, 17, 11], [66, 23, 12], [44, 70, 19]]}

接近-

# @COLDSPEED's soln
from collections import defaultdict
def defaultdict_app(x):
    data = defaultdict(list)
    for l in x:
        data[l[0]].append(l[1:])

# @David Z's soln-1
import pandas as pd
def pandas_groupby_app(x):
    df = pd.DataFrame(x)
    return {key: group.iloc[:,1:] for key, group in df.groupby(0)}

# @David Z's soln-2
import itertools as it
def groupby_app(x):
    return {key: list(map(list, group)) for key, group in \
                        it.groupby(x, lambda row: row[0])}

# Proposed in this post    
def numpy_app1(x):
    unq, idx = np.unique(x[:,0], return_index=1)
    idx1 = np.r_[idx,x.shape[0]]
    return {unq[i]:x[idx1[i]:idx1[i+1],1:] for i in range(len(unq))}

# Proposed in this post    
def numpy_app2(x):
    idx1 = np.concatenate(([0],np.flatnonzero(x[1:,0] != x[:-1,0])+1, [x.shape[0]]))
    return {x[i,0]:x[i:j,1:] for i,j in zip(idx1[:-1], idx1[1:])}

时间安排-

In [84]: x = np.random.randint(0,100,(10000,4))

In [85]: x[:,0].sort()

In [86]: %timeit defaultdict_app(x)
    ...: %timeit pandas_groupby_app(x)
    ...: %timeit groupby_app(x)
    ...: %timeit numpy_app1(x)
    ...: %timeit numpy_app2(x)
    ...: 
100 loops, best of 3: 4.43 ms per loop
100 loops, best of 3: 15 ms per loop
100 loops, best of 3: 12.1 ms per loop
1000 loops, best of 3: 310 µs per loop
10000 loops, best of 3: 75.6 µs per loop

由于您将其标记为，因此可能需要使用

DataFrame

来执行此操作。您将从原始数组创建一个

数据帧

import pandas as pd
df = pd.DataFrame(x)

并按第一列进行分组；然后，您可以迭代得到的

GroupBy

对象，以获得在第一列中具有相同结果的帧组

{key: group for key, group in df.groupby(0)}

当然，在这个片段

组中

包括第一列。您可以使用索引将其删除：

{key: group.iloc[:,1:] for key, group in df.groupby(0)}

如果要将子帧转换回Numpy数组，请改用

group.iloc[：，1::].values

。（如您的问题所示，如果您希望将它们作为列表的列表，那么编写一个函数来进行转换应该不难，但将其保存在Pandas或至少Numpy（如果可以的话）中可能会更有效。）

另一种方法是使用OG，它允许您避免熊猫（如果您有这样做的理由），并使用简单的旧迭代方法

import itertools as it
{key: list(map(list, group))
    for key, group in it.groupby(x, lambda row: row[0])}

这同样包括结果行中的键，但可以使用索引将其修剪掉

{key: list(map(lambda a: list(a)[1:], group))
    for key, group in it.groupby(x, lambda row: row[0])}

您可以使用（标准Python库中未包含）使代码稍微干净一些：

公开：我将<代码> GROMPYBY转换（）/<代码>函数提交给更多的ItRealToSs/p>如果这个或任何答案已经解决了你的问题，请通过点击复选标记来考虑。这向更广泛的社区表明，你已经找到了一个解决方案，并给回答者和你自己带来了一些声誉。@Jelmed12他说的话。：）你打算对结果做什么？这可能会决定创建和存储组的方法是最有效的。第一列是否必须排序？如果它是预排序的，则可能

unq=np.array（set（x[：，0]）

和

idx1=np.r\np.searchsorted（x，unq），x.shape[0]

中的

numpy\u app1

可能更快。或者只是做

searchsorted

，而不是从

np.unique

idx

（我认为这是慢的一点）。@DanielF是的，方法2明确地利用了排序的性质，并且证明比

np.unique

更有效。我推出了

np.unique

版本，目的是精简代码。