Warning: file_get_contents(/data/phpspider/zhask/data//catemap/7/python-2.7/5.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python 2.7 从tf idf计算余弦相似性_Python 2.7_Pandas_Scikit Learn_Text Classification_Cosine Similarity - Fatal编程技术网

Python 2.7 从tf idf计算余弦相似性

Python 2.7 从tf idf计算余弦相似性,python-2.7,pandas,scikit-learn,text-classification,cosine-similarity,Python 2.7,Pandas,Scikit Learn,Text Classification,Cosine Similarity,在数据帧df中,我有一列tf-idf: tf-idf 0 {u'selection': 3.83579393163, u'carltons': 7.0... 1 {u'precise': 6.43261849762, u'thomas': 3.31980... 2 {u'just': 2.70047792082, u'issued': 4.42829758... 3 {u'englishreading': 9.88788310056, u'a

在数据帧
df
中,我有一列
tf-idf

       tf-idf
0      {u'selection': 3.83579393163, u'carltons': 7.0...
1      {u'precise': 6.43261849762, u'thomas': 3.31980...
2      {u'just': 2.70047792082, u'issued': 4.42829758...
3      {u'englishreading': 9.88788310056, u'all': 1.6...
4      {u'they': 1.89922701484, u'gangstergenka': 10....
5      {u'since': 1.45530416153, u'less': 3.956522477...
6      {u'exclusive': 10.4488880129, u'producer': 2.6...
7      {u'taxi': 6.04485296662, u'all': 1.64302370465...
8      {u'houston': 3.93463976627, u'frankie': 6.0306...
9      {u'phenomenon': 5.74474837417, u'deborash': 10...
10     {u'zwigoff': 19.7757662011, u'september': 1.90...
11     {u'gospels': 7.9419729515, u'theft': 6.0028887... `

我正在努力寻找两个样本之间的
余弦相似性
,例如
df['tf-idf'][0]
df['tf-idf'][1]

您可以使用scikit学习:

from sklearn.feature_extraction import DictVectorizer
from sklearn.metrics.pairwise import cosine_similarity

a = DictVectorizer().fit_transform(df['tf-idf'])
cosine_similarity(a[0], a[1])

@钦坦人,很高兴听到!没有
.tolist()
,它还能用吗?还有一个问题。如果有50000个样本,即
df.shape[0]=50000
是否有更快的方法获得相似性矩阵(无需运行两个for循环)?请尝试
cosine\u相似性(a)
。它应该返回所有成对的相似性。好的…让我试试。。。谢谢是的,这是一个CPU和内存密集型操作。尝试检查您是否正在进行ram交换,因为这会使一切变慢。