在python中将等效行分组为2D数组，用于非常大的数据集_Python_Pandas_Numpy_Machine Learning_Data Science

在python中将等效行分组为2D数组，用于非常大的数据集

python pandas numpy machine-learning

在python中将等效行分组为2D数组，用于非常大的数据集,python,pandas,numpy,machine-learning,data-science,Python,Pandas,Numpy,Machine Learning,Data Science,我有10万行，我想用python对其进行分组，如下所述。一个简单的python迭代需要很多时间。如何使用任何python ML库对其进行优化 [[1,2,3,4],[2,3],[1,2,3],[2,3],[1,2,3],[1,2,3,4],[1],[2]...] Output [[0,5],[1,3]],[2,4],[6],[7]] Explanation: index 0,5 have same list ; index

我有10万行，我想用python对其进行分组，如下所述。一个简单的python迭代需要很多时间。如何使用任何python ML库对其进行优化

    [[1,2,3,4],[2,3],[1,2,3],[2,3],[1,2,3],[1,2,3,4],[1],[2]...]

    Output
    [[0,5],[1,3]],[2,4],[6],[7]]

    Explanation:  index 0,5 have same list ;
                  index 1,3 have same list ;
                  index 2,4 have same list ; 
                  index 6 no match

我有100k个子列表，我想按照上面用python解释的那样对它进行分组。

一个简单的解决方案是将列表转换为元组，然后只需

groupby

并访问

.groups

属性，如果您想知道每个组的索引

import pandas as pd
df = pd.DataFrame({'vals': [[1,2,3,4], [2,3], [1,2,3], [2,3],
                            [1,2,3], [1,2,3,4], [1], [2], [2,2], [2,1,3]]})

df.groupby(df.vals.apply(tuple)).groups
#{(1,): Int64Index([6], dtype='int64'),
# (1, 2, 3): Int64Index([2, 4], dtype='int64'),
# (1, 2, 3, 4): Int64Index([0, 5], dtype='int64'),
# (2,): Int64Index([7], dtype='int64'),
# (2, 1, 3): Int64Index([9], dtype='int64'),
# (2, 2): Int64Index([8], dtype='int64'),
# (2, 3): Int64Index([1, 3], dtype='int64')}

如果需要分组索引列表，请尝试以下操作：

df.reset_index().groupby(df.vals.apply(tuple))['index'].apply(list).sort_values().tolist()
#[[0, 5], [1, 3], [2, 4], [6], [7], [8], [9]]

订购重要吗？[1,2,3]是否与[2,1,3]相同？是的。这很重要。因为列表是MyCase中数据库的副本通常每个列表中有多少元素？所有元素的最大值可能是多少？填充列表以生成一个二维数组，然后使用

numpy.unique（my_array，axis=1）

查找唯一的元素并最终查找索引。@Divakar.每个列表有20个元素。。