Python Numpy矩阵维数tfidf向量

Python Numpy矩阵维数tfidf向量,python,numpy,vector,tf-idf,Python,Numpy,Vector,Tf Idf,我试图解决一个群集问题。我有一个由CountVectorizer()函数生成的tf idf加权向量列表。这是数据类型: <1000x5369 sparse matrix of type '<type 'numpy.float64'>' with 42110 stored elements in Compressed Sparse Row format> 其中,相似性函数为: def cosine_similarity(vector1,vector2): scor

我试图解决一个群集问题。我有一个由CountVectorizer()函数生成的tf idf加权向量列表。这是数据类型:

<1000x5369 sparse matrix of type '<type 'numpy.float64'>'
with 42110 stored elements in Compressed Sparse Row format>
其中,相似性函数为:

def cosine_similarity(vector1,vector2):
    score=1-scipy.spatial.distance.cosine(vector1,vector2)
    return score
我得到一个错误:

Traceback (most recent call last):
  File "<pyshell#25>", line 1, in <module>
    sim_scores=[cosine_similarity(vector,centroid) for vector in tfidf_vec_list]
  File "/home/ashwin/Desktop/Python-2.7.9/programs/test_2.py", line 28, in             cosine_similarity
    score=1-scipy.spatial.distance.cosine(vector1,vector2)
  File "/usr/lib/python2.7/dist-packages/scipy/spatial/distance.py", line 287, in cosine
    dist = 1.0 - np.dot(u, v) / (norm(u) * norm(v))
    File "/usr/lib/python2.7/dist-packages/scipy/sparse/base.py", line 302, in __mul__
    raise ValueError(**'dimension mismatch'**)
回溯(最近一次呼叫最后一次):
文件“”,第1行,在
sim_分数=[tfidf_向量列表中向量的余弦_相似性(向量,质心)]
文件“/home/ashwin/Desktop/Python-2.7.9/programs/test_2.py”,第28行,余弦格式
分数=1-scipy.space.distance.cosine(向量1,向量2)
文件“/usr/lib/python2.7/dist packages/scipy/space/distance.py”,第287行,余弦格式
距离=1.0-np.点(u,v)/(范数(u)*范数(v))
文件“/usr/lib/python2.7/dist packages/scipy/sparse/base.py”,第302行,在__
提升值错误(*“维度不匹配”**)

我尝试了所有的方法,包括将矩阵转换为数组,将每个向量转换为列表。但是我得到了同样的错误

scipy.space.distance.cosine
似乎不支持稀疏矩阵输入。具体来说,np.linalg.norm(稀疏向量)失败(请参阅)

如果在传递之前将两个输入向量(实际上这里它们是矩阵形式的行向量)转换为密集版本,则效果良好:

>>> xs
<1x4 sparse matrix of type '<class 'numpy.int64'>'
        with 3 stored elements in Compressed Sparse Row format>
>>> ys
<1x4 sparse matrix of type '<class 'numpy.int64'>'
        with 3 stored elements in Compressed Sparse Row format>
>>> cosine(xs, ys)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/python3.4/site-packages/scipy/spatial/distance.py", line 296, in cosine
    dist = 1.0 - np.dot(u, v) / (norm(u) * norm(v))
  File "/usr/lib/python3.4/site-packages/scipy/sparse/base.py", line 308, in __mul__
    raise ValueError('dimension mismatch')
ValueError: dimension mismatch
>>> cosine(xs.todense(), ys.todense())
-2.2204460492503131e-16
>xs
>>>ys
>>>余弦(xs,ys)
回溯(最近一次呼叫最后一次):
文件“”,第1行,在
文件“/usr/lib/python3.4/site packages/scipy/space/distance.py”,第296行,余弦格式
距离=1.0-np.点(u,v)/(范数(u)*范数(v))
文件“/usr/lib/python3.4/site packages/scipy/sparse/base.py”,第308行,在__
提升值错误('维度不匹配')
ValueError:维度不匹配
>>>余弦(xs.todense(),ys.todense())
-2.2204460492503131e-16

这只适用于单个5369个元素向量(与整个矩阵相反)。

看起来向量和质心具有不同的维度,因此检查这两个向量的长度@Michael Plakhov Nope他们具有相同的维度:1*5369,这是我无法理解的。这两个向量中的元素是什么?我是说典型尺寸?@Michael Plakhov.“Todense”很管用@HapeMask..我忘记了这个问题。我做了同样的事情。当矩阵转换为稠密矩阵时,它工作得很好。!。。在使用距离度量和稀疏矩阵时,需要注意这一点。
Traceback (most recent call last):
  File "<pyshell#25>", line 1, in <module>
    sim_scores=[cosine_similarity(vector,centroid) for vector in tfidf_vec_list]
  File "/home/ashwin/Desktop/Python-2.7.9/programs/test_2.py", line 28, in             cosine_similarity
    score=1-scipy.spatial.distance.cosine(vector1,vector2)
  File "/usr/lib/python2.7/dist-packages/scipy/spatial/distance.py", line 287, in cosine
    dist = 1.0 - np.dot(u, v) / (norm(u) * norm(v))
    File "/usr/lib/python2.7/dist-packages/scipy/sparse/base.py", line 302, in __mul__
    raise ValueError(**'dimension mismatch'**)
>>> xs
<1x4 sparse matrix of type '<class 'numpy.int64'>'
        with 3 stored elements in Compressed Sparse Row format>
>>> ys
<1x4 sparse matrix of type '<class 'numpy.int64'>'
        with 3 stored elements in Compressed Sparse Row format>
>>> cosine(xs, ys)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/python3.4/site-packages/scipy/spatial/distance.py", line 296, in cosine
    dist = 1.0 - np.dot(u, v) / (norm(u) * norm(v))
  File "/usr/lib/python3.4/site-packages/scipy/sparse/base.py", line 308, in __mul__
    raise ValueError('dimension mismatch')
ValueError: dimension mismatch
>>> cosine(xs.todense(), ys.todense())
-2.2204460492503131e-16