Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/python/300.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python 在数据帧的列上运行函数的有效方法?_Python_Pandas_Numpy_Dataframe_Vectorization - Fatal编程技术网

Python 在数据帧的列上运行函数的有效方法?

Python 在数据帧的列上运行函数的有效方法?,python,pandas,numpy,dataframe,vectorization,Python,Pandas,Numpy,Dataframe,Vectorization,我想在数据帧的列上运行一个函数。 语料库是一个pd数据帧 import pandas as pd import numpy as np from scipy.spatial.distance import cosine corpus = pd.DataFrame([[3,1,1,1,1,60],[2,2,0,2,0,20], [0,2,1,1,0,0], [0,0,2,1,0,1],[0,0,0,0,1,0]],index=["stark","groß","schwach","klein",

我想在数据帧的列上运行一个函数。 语料库是一个pd数据帧

import pandas as pd 
import numpy as np
from scipy.spatial.distance import cosine

corpus = pd.DataFrame([[3,1,1,1,1,60],[2,2,0,2,0,20], [0,2,1,1,0,0], [0,0,2,1,0,1],[0,0,0,0,1,0]],index=["stark","groß","schwach","klein", "dick"],columns=["d1", "d2", "d3","d4","d5","d6"])
我有疑问。查询是一个系列

query = pd.Series([1,1,0,0,0], index=["stark","groß","schwach","klein", "dick"])
现在我想对语料库和查询中的每一列运行余弦函数

for column in corpus:
print("Similarity of Documents", column," and query: \n" ,1-cosine(query, corpus[column]))
有没有更好的方法在列上运行余弦函数?也许有一种方法可以获取列并在每一列上运行函数。我想避免for循环

您可以使用
“余弦”
功能进行矢量化解决,如下所示-

from scipy.spatial.distance import cdist

out = 1-cdist(query.values[None], corpus.values.T, 'cosine')
样本运行-

In [192]: corpus
Out[192]: 
         d1  d2  d3  d4  d5  d6
stark     3   1   1   1   1  60
groß      2   2   0   2   0  20
schwach   0   2   1   1   0   0
klein     0   0   2   1   0   1
dick      0   0   0   0   1   0

In [193]: query
Out[193]: 
stark      1
groß       1
schwach    0
klein      0
dick       0
dtype: int64

In [194]: from scipy.spatial.distance import cosine

In [195]: for column in corpus:
     ...:     print(1-cosine(query, corpus[column]))
     ...:     
0.980580675691
0.707106781187
0.288675134595
0.801783725737
0.5
0.89431540856

In [196]: 1-cdist(query.values[None], corpus.values.T, 'cosine')
Out[196]: array([[ 0.98058,  0.70711,  0.28868,  0.80178,  0.5    ,  0.89432]])
运行时测试-

In [225]: corpus = pd.DataFrame(np.random.rand(100,10000))

In [226]: query = pd.Series(np.random.rand(100))

# @C.Square's apply based soln
In [227]: %timeit corpus.apply(lambda x:1-cosine(query, x), axis=0)
1 loop, best of 3: 352 ms per loop

 # Proposed in this post using cdist()
In [228]: %timeit 1-cdist(query.values[None], corpus.values.T, 'cosine')
100 loops, best of 3: 3.2 ms per loop

应用
-使用函数是一种简洁、易读且快速的方法:

import pandas as pd
from scipy.spatial.distance import cosine

corpus = pd.DataFrame([[3,1,1,1,1,60],[2,2,0,2,0,20], [0,2,1,1,0,0], [0,0,2,1,0,1],[0,0,0,0,1,0]], index=["stark","groß","schwach","klein", "dick"], columns=["d1", "d2", "d3","d4","d5","d6"])
query = pd.Series([1,1,0,0,0], index=["stark","groß","schwach","klein", "dick"])

corpus.apply(lambda x:1-cosine(query, x),  # Apply your function
             axis=0)                       # For each column

# d1    0.980581
# d2    0.707107
# d3    0.288675
# d4    0.801784
# d5    0.500000
# d6    0.894315
# dtype: float64

您还可以使用
余弦的定义并自己实现

熊猫

corpus.T.dot(query) / (corpus ** 2).sum() ** .5 / (query ** 2).sum() ** .5

d1    0.980581
d2    0.707107
d3    0.288675
d4    0.801784
d5    0.500000
d6    0.894315
dtype: float64
c = corpus.values
q = query.values

r = c.T.dot(q) / (c ** 2).sum(0) ** .5 / (q ** 2).sum() ** .5

pd.Series(r, corpus.columns)

d1    0.980581
d2    0.707107
d3    0.288675
d4    0.801784
d5    0.500000
d6    0.894315
dtype: float64

numpy

corpus.T.dot(query) / (corpus ** 2).sum() ** .5 / (query ** 2).sum() ** .5

d1    0.980581
d2    0.707107
d3    0.288675
d4    0.801784
d5    0.500000
d6    0.894315
dtype: float64
c = corpus.values
q = query.values

r = c.T.dot(q) / (c ** 2).sum(0) ** .5 / (q ** 2).sum() ** .5

pd.Series(r, corpus.columns)

d1    0.980581
d2    0.707107
d3    0.288675
d4    0.801784
d5    0.500000
d6    0.894315
dtype: float64
根据@Divakar的建议
np.einsum


余弦函数只是从scipy.spatial.distance scipy.spatial.distance导入的。余弦(u,v)u和v是数组。(余弦计算两个一维阵列之间的距离。)谢谢你,你说得对。我编辑了我的问题。:)我在那里看到了
einsum
(c**2)。求和(0)
和另一个!