Python 当其重复次数少于n次时,从numpy数组中删除行

Python 当其重复次数少于n次时,从numpy数组中删除行,python,arrays,numpy,scikit-learn,cross-validation,Python,Arrays,Numpy,Scikit Learn,Cross Validation,当numpy数组的重复次数少于n次时,从该数组中删除行 原因: 我有一个1 gb大小的数据集。 它有29.118.021个样本和108.390个类别 然而,有些类只有一个示例。或3个样本,等等 问题: 我想从numpy数组中删除显示/重复次数少于N次的行/类 参考 失败的尝试 train_x, train_y, test_x, test_id = loader.load() n_samples = train_y.shape[0] unique_labels, y_inversed = np

当numpy数组的重复次数少于n次时,从该数组中删除行

原因:

我有一个1 gb大小的数据集。 它有29.118.021个样本和108.390个类别

然而,有些类只有一个示例。或3个样本,等等

问题: 我想从numpy数组中删除显示/重复次数少于N次的行/类

参考

失败的尝试

train_x, train_y, test_x, test_id = loader.load()

n_samples = train_y.shape[0]
unique_labels, y_inversed = np.unique(train_y, return_inverse=True)
label_counts = bincount(y_inversed)
min_labels = np.min(label_counts)

print "Total Rows ", n_samples
print "unique_labels ", unique_labels.shape[0]
print "label_counts ", label_counts[:]
print "min labels ", min_labels

unique_labels = unique_labels.astype(np.uint8)
unique_amounts = np.empty(shape=unique_labels.shape, dtype=np.uint8)
for u in xrange(0, unique_labels.shape[0]):
    if u % 100 == 0:
        print "Processed ", str(u)
    for index in xrange(0, train_y.shape[0]):
        if train_y[index] == unique_labels[u]:
            unique_amounts[u] = unique_amounts[u] + 1

for k in xrange(0, unique_amounts.shape[0]):
    if unique_amounts[k] == 1:
        print "\n"
        print "value :", unique_amounts[k]
        print "at ", k
上面的代码花的时间太长了。即使我让它在服务器上运行了一整晚,它甚至还没有达到半个进程


加载方法

train_x, train_y, test_x, test_id = loader.load()

n_samples = train_y.shape[0]
unique_labels, y_inversed = np.unique(train_y, return_inverse=True)
label_counts = bincount(y_inversed)
min_labels = np.min(label_counts)

print "Total Rows ", n_samples
print "unique_labels ", unique_labels.shape[0]
print "label_counts ", label_counts[:]
print "min labels ", min_labels

unique_labels = unique_labels.astype(np.uint8)
unique_amounts = np.empty(shape=unique_labels.shape, dtype=np.uint8)
for u in xrange(0, unique_labels.shape[0]):
    if u % 100 == 0:
        print "Processed ", str(u)
    for index in xrange(0, train_y.shape[0]):
        if train_y[index] == unique_labels[u]:
            unique_amounts[u] = unique_amounts[u] + 1

for k in xrange(0, unique_amounts.shape[0]):
    if unique_amounts[k] == 1:
        print "\n"
        print "value :", unique_amounts[k]
        print "at ", k
这是我的加载方法。 我可以加载它并将其作为数据帧保存

def load():
    train = pd.read_csv('input/train.csv', index_col=False, header='infer')
    test = pd.read_csv('input/test.csv', index_col=False, header='infer')

    # drop useless columns
    train.drop('row_id', axis=1, inplace=True)

    acc = train["accuracy"].iloc[:].as_matrix()
    x = train["x"].iloc[:].as_matrix()
    y = train["y"].iloc[:].as_matrix()
    time = train["time"].iloc[:].as_matrix()
    train_y = train["place_id"].iloc[:].as_matrix()

    ####################################################################################
    acc = acc.reshape(-1, 1)
    x = x.reshape(-1, 1)
    y = y.reshape(-1, 1)
    time = time.reshape(-1, 1)
    train_y = train_y.reshape(-1, 1)

    ####################################################################################

    train_x = np.hstack((acc, x, y, time))

    ####################################################################################

    acc = test["accuracy"].iloc[:].as_matrix()
    x = test["x"].iloc[:].as_matrix()
    y = test["y"].iloc[:].as_matrix()
    time = test["time"].iloc[:].as_matrix()
    test_id = test['row_id'].iloc[:].as_matrix()

    #######################
    acc = acc.reshape(-1, 1)
    x = x.reshape(-1, 1)
    y = y.reshape(-1, 1)
    time = time.reshape(-1, 1)
    #######################

    test_x = np.hstack((acc, x, y, time))

    return train_x, train_y, test_x, test_id
vc = df['labels'].value_counts()
labels = vc[vc < n_min].index
df.drop(labels, inplace=True)

我将以数据帧格式保存您的数据。 这样,您可以使用
pandas
模块中的一些有用方法,这应该比循环更快

首先,使用
df['labels']获取与
df
关联的不同标签。value\u counts()
。 (我假设标签列名是
'labels'

然后,仅获取数据框中小于
n_min
行的标签

def load():
    train = pd.read_csv('input/train.csv', index_col=False, header='infer')
    test = pd.read_csv('input/test.csv', index_col=False, header='infer')

    # drop useless columns
    train.drop('row_id', axis=1, inplace=True)

    acc = train["accuracy"].iloc[:].as_matrix()
    x = train["x"].iloc[:].as_matrix()
    y = train["y"].iloc[:].as_matrix()
    time = train["time"].iloc[:].as_matrix()
    train_y = train["place_id"].iloc[:].as_matrix()

    ####################################################################################
    acc = acc.reshape(-1, 1)
    x = x.reshape(-1, 1)
    y = y.reshape(-1, 1)
    time = time.reshape(-1, 1)
    train_y = train_y.reshape(-1, 1)

    ####################################################################################

    train_x = np.hstack((acc, x, y, time))

    ####################################################################################

    acc = test["accuracy"].iloc[:].as_matrix()
    x = test["x"].iloc[:].as_matrix()
    y = test["y"].iloc[:].as_matrix()
    time = test["time"].iloc[:].as_matrix()
    test_id = test['row_id'].iloc[:].as_matrix()

    #######################
    acc = acc.reshape(-1, 1)
    x = x.reshape(-1, 1)
    y = y.reshape(-1, 1)
    time = time.reshape(-1, 1)
    #######################

    test_x = np.hstack((acc, x, y, time))

    return train_x, train_y, test_x, test_id
vc = df['labels'].value_counts()
labels = vc[vc < n_min].index
df.drop(labels, inplace=True)
vc=df['labels'].value_counts()
labels=vc[vc
希望有帮助

该包(免责声明:我是它的作者)包含一个多重性函数,它提供了一种非常可读的执行此类操作的方法:

import numpy_indexed as npi
samples_mask = npi.multiplicity(train_y) >= n_min
filtered_train_y = train_y[samples_mask]

一个类或标签只对一个数据帧有意义,而不是一个numpy数组。我可以将它作为一个数据帧加载。谢谢你的回复,它肯定会帮助其他人解决同样的问题!谢谢你的回复,这肯定会帮助其他人解决同样的问题!我将使用这个numpy索引实现。谢谢不过有个问题。如果我从train_y数组中删除行'r',我还必须从train_x中删除行'r',有什么想法吗?只需添加
filtered_train_x=train_x[samples_mask]