Python 为什么np.median（）返回多行？_Python_Arrays_Numpy

Python 为什么np.median（）返回多行？

python arrays numpy

Python 为什么np.median（）返回多行？,python,arrays,numpy,Python,Arrays,Numpy,我有一个numpy数组，它有100行和16026列。我必须找到每列的中间值。因此，每列的中位数将根据100个观察值（本例中为100行）计算得出。我使用以下代码来实现这一点： for category in categories: indices = np.random.randint(0, len(os.listdir(filepath + category)) - 1, 100) tempArray = X_train[indices, ] medArray = np.

我有一个numpy数组，它有100行和16026列。我必须找到每列的中间值。因此，每列的中位数将根据100个观察值（本例中为100行）计算得出。我使用以下代码来实现这一点：

for category in categories:
    indices = np.random.randint(0, len(os.listdir(filepath + category)) - 1, 100)
    tempArray = X_train[indices, ]
    medArray = np.median(tempArray, axis=0)
    print(medArray.shape)

这是我得到的结果：

(100, 16026)
(100, 16026)
(100, 16026)
(100, 16026)

我的问题是-为什么

medArray

100*16026的形状不是1*16026？因为我在计算每一列的中位数，所以我预计只有一行16026列。我错过了什么

请注意，

X\u train

是一个稀疏矩阵

X_train.shape

输出：

(2034, 16026)

在这方面的任何帮助都是非常感谢的

编辑：

上述问题已通过

toarray（）

函数解决

tempArray = X_train[indices, ].toarray()

我还认为我太愚蠢了，在我的中值计算中还包括了所有的零，这就是为什么我总是得到0作为中值的原因。有没有一种简单的方法可以通过删除/忽略所有列中的零来计算中间值？

这真的很奇怪，我想你应该得到

（16026，）

，我们是否遗漏了一些东西：

In [241]:

X_train=np.random.random((1000,16026)) #1000 can be any int.
indices = np.random.randint(0, 60, 100) #60 can be any int.
tempArray = X_train[indices, ]
medArray = np.median(tempArray, axis=0)
print(medArray.shape)

(16026,)

获得

2d数组

结果的唯一方法是：

In [243]:

X_train=np.random.random((100,2,16026))
indices = np.random.randint(0, 60, 100)
tempArray = X_train[indices, ]
medArray = np.median(tempArray, axis=0)
print(medArray.shape)


(2, 16026)

当您有一个

3d数组

输入时

当它是一个稀疏的

数组

时，一个避免这种情况的愚蠢方法可能是：

In [319]:

X_train = sparse.rand(112, 16026, 0.5, 'csr') #just make up a random sparse array
indices = np.random.randint(0, 60, 100)
tempArray = X_train[indices, ]
medArray = np.median(tempArray.toarray(), axis=0)
print(medArray.shape)
(16026,)

.toarray（）

也可能转到第三行。但不管怎样，这意味着@zhangxaochen指出的

不出所料，可能会有更好的解释。

问题在于NumPy无法将稀疏矩阵识别为数组或类似数组的对象。例如，对稀疏矩阵调用

asanyarray

，将返回一个0D数组，其中一个元素是原始稀疏矩阵：

In [8]: numpy.asanyarray(scipy.sparse.csc_matrix([[1,2,3],[4,5,6]]))
Out[8]:
array(<2x3 sparse matrix of type '<type 'numpy.int64'>'
        with 6 stored elements in Compressed Sparse Column format>, dtype=object)

[8]中的

：numpy.asanyarray（scipy.sparse.csc_矩阵[1,2,3]，[4,5,6]）
出[8]：
数组（，dtype=object）

与大多数NumPy一样，

NumPy.median

依赖于将数组或类似数组的对象作为输入。如果你给它一个稀疏矩阵，它所依赖的例程，特别是排序，将无法理解它们在看什么。

我终于能够解决这个问题了。我使用了掩码数组和以下代码：

X_train.shape

 sample = [] 
    sample_size = 50
    idx = matplotlib.mlab.find(newsgroups_train.target==i)
    random_index = []
    for j in range(sample_size):
        random_index.append(randrange(0,len(idx)-1)) 

y = np.ma.masked_where(X_train[sample[0]].toarray() == 0, X_train[sample[0]].toarray())
    medArray = np.ma.median(y, axis=0).filled(0)
    print '============median ' + newsgroups_train.target_names[i] + '============='
    for k,word in enumerate(np.array(vectorizer.get_feature_names())[np.argsort(medArray)[::-1][0:10]]):
        print word + ':' + str(np.sort(medArray)[::-1][k])

这给了我忽略零的中值。

2d数组的结果是有意义的。我正试图做完全相同的事情，但不知怎么的，我的输出给了我100*16026。哦，事实上，因为现在你提到它是一个

稀疏的，事情现在不同了。它是哪种格式csc
？实际上是csr，但我不确定这是什么意思！我高度怀疑它与稀疏数组的结构有关。如果只输入一个随机数组，比如说，np.median（稀疏的.rand（5,5,0.6，'csr'）[1，]，axis=0）
，我会得到索引器：axis 0 out of bounds（0）
我也尝试过使用toarray（），但得到的结果不正确，主要是因为我愚蠢地在中值计算中包含了零。这次我会再做一遍，不包括零。很抱歉误读了你的问题，你的tempArray.shape
？tempArray.shape=（10016026）你想得到密集数组（包括0）的中值吗？@zhangxaochen：我想你这里有一个非常好的观点。在与numpy的疯狂斗争中，我想我忘记了统计的基础知识，我想我在计算中位数时也包括了零。难怪我得到了所有的零，我猜这是在正确计算每列的中值，正好是零。为了计算中位数，我想排除零。这是否意味着np.中位数（xxxxx，axis=0）
将导致索引器：轴0越界（0）
？OP不知何故没有得到那个异常。@CTZhu：你会这么认为，但是当输入为0D或1D时，sort
似乎不会对越界轴抛出错误。我怀疑它假定所有可能的轴值对于0D和1D数组都是等效的，并且不必查看指定的轴值。为了清除此问题，我使用了以下代码：tempArray[0，np.nonzero（tempArray[0]）].tolist（）。这将我的tempArray转换成一个普通的python列表。现在，如果我计算tempArray上每一列的中位数，实际上应该是可行的，但事实并非如此。当我运行以下代码计算中位数时：print np.nonzero（np.median（tempArray）），我收到一个空白数组，我100%确定tempArray中有值。@Patthebug:如果数据稀疏，则任何给定行的中位数可能为0。这可能是原因吗？@user2357112-我想你的观点真的很有道理。在与numpy的疯狂斗争中，我想我忘记了统计的基础知识，我想我在计算中位数时也包括了零。难怪我得到了所有的零，我猜这是在正确计算每列的中值，正好是零。为了计算中值，我想排除零。