Python 为熊猫切片添加值的有效方法
我想以一种有效的方式为pandas切片添加值,因为这个函数经常被调用。结构如下所示:Python 为熊猫切片添加值的有效方法,python,performance,python-2.7,pandas,numpy,Python,Performance,Python 2.7,Pandas,Numpy,我想以一种有效的方式为pandas切片添加值,因为这个函数经常被调用。结构如下所示: import pandas as pd import numpy as np names = ["a", "b", "c", "d", "e", "f"] mat = pd.DataFrame(0.0, index=names, columns=names) # now comes the `tricky' part positive_instances = ["a", "e", "c"] negativ
import pandas as pd
import numpy as np
names = ["a", "b", "c", "d", "e", "f"]
mat = pd.DataFrame(0.0, index=names, columns=names)
# now comes the `tricky' part
positive_instances = ["a", "e", "c"]
negative_instances = ["d", "b", "f"]
p_mat = np.array([[1.,2.],[3.,4.]])
mat.loc[positive_instances, positive_instances] += p_mat[0,0]
mat.loc[positive_instances, negative_instances] += p_mat[0,1]
mat.loc[negative_instances, positive_instances] += p_mat[1,0]
mat.loc[negative_instances, negative_instances] += p_mat[1,1]
mat =
a b c d e f
a 1 2 1 2 1 2
b 3 4 3 4 3 4
c 1 2 1 2 1 2
d 3 4 3 4 3 4
e 1 2 1 2 1 2
f 3 4 3 4 3 4
所需的新矩阵mat
如下所示:
import pandas as pd
import numpy as np
names = ["a", "b", "c", "d", "e", "f"]
mat = pd.DataFrame(0.0, index=names, columns=names)
# now comes the `tricky' part
positive_instances = ["a", "e", "c"]
negative_instances = ["d", "b", "f"]
p_mat = np.array([[1.,2.],[3.,4.]])
mat.loc[positive_instances, positive_instances] += p_mat[0,0]
mat.loc[positive_instances, negative_instances] += p_mat[0,1]
mat.loc[negative_instances, positive_instances] += p_mat[1,0]
mat.loc[negative_instances, negative_instances] += p_mat[1,1]
mat =
a b c d e f
a 1 2 1 2 1 2
b 3 4 3 4 3 4
c 1 2 1 2 1 2
d 3 4 3 4 3 4
e 1 2 1 2 1 2
f 3 4 3 4 3 4
注释下面的结构嵌入到for循环中。有几个不同的积极和消极的例子。
要添加数据结构,请执行以下操作:
和positive_实例
总是不相交的,不需要具有相同的长度negative_实例
和positive\u实例
的并集总是negative\u实例
名称
始终位于正例
p\u mat的索引
,0
始终位于索引负例
李>1
Edit2:添加了有关
正实例
和负实例
性质的信息,我们可以在这里使用NumPy有效地使用其广播索引将值分配到数组中,从而模拟与Pandas中相同的.loc[row,col]
行为。完成赋值后,我们将创建输出数据帧
因此,实现是这样的-
sidx = np.argsort(names)
p_idx = sidx[np.searchsorted(names, positive_instances, sorter= sidx)]
n_idx = sidx[np.searchsorted(names, negative_instances, sorter= sidx)]
n = len(names)
arr = np.zeros((n,n),dtype=p_mat.dtype)
arr[np.ix_(p_idx, p_idx)] = +p_mat[0,0]
arr[np.ix_(p_idx, n_idx)] = +p_mat[0,1]
arr[np.ix_(n_idx, p_idx)] = +p_mat[1,0]
arr[np.ix_(n_idx, n_idx)] = +p_mat[1,1]
df = pd.DataFrame(arr, index=names, columns=names)
运行时测试-
方法:
def func0(p_mat, names, positive_instances, negative_instances):
mat = pd.DataFrame(0.0, index=names, columns=names)
mat.loc[positive_instances, positive_instances] += p_mat[0,0]
mat.loc[positive_instances, negative_instances] += p_mat[0,1]
mat.loc[negative_instances, positive_instances] += p_mat[1,0]
mat.loc[negative_instances, negative_instances] += p_mat[1,1]
return mat
def func1(p_mat, names, positive_instances, negative_instances):
sidx = np.argsort(names)
p_idx = sidx[np.searchsorted(names, positive_instances, sorter= sidx)]
n_idx = sidx[np.searchsorted(names, negative_instances, sorter= sidx)]
n = len(names)
arr = np.zeros((n,n),dtype=p_mat.dtype)
arr[np.ix_(p_idx, p_idx)] = +p_mat[0,0]
arr[np.ix_(p_idx, n_idx)] = +p_mat[0,1]
arr[np.ix_(n_idx, p_idx)] = +p_mat[1,0]
arr[np.ix_(n_idx, n_idx)] = +p_mat[1,1]
df = pd.DataFrame(arr, index=names, columns=names)
return df
时间安排-
In [109]: names = ["a", "f", "d","b", "c", "e"]
...:
...: # now comes the `tricky' part
...: positive_instances = ["a", "e", "c"]
...: negative_instances = ["d", "b", "f"]
...:
...: p_mat = np.array([[1.,2.],[3.,4.]])
...:
In [110]: %timeit func0(p_mat, names, positive_instances, negative_instances)
100 loops, best of 3: 4.87 ms per loop
In [111]: %timeit func1(p_mat, names, positive_instances, negative_instances)
10000 loops, best of 3: 189 µs per loop
In [112]: 4870.0/189
Out[112]: 25.767195767195766
25x+
在那里加速 如何定义sup
和sun
?你能为这个特定的例子展示你想要的结果吗?@Cleb我更正了名称,并添加了一个想要的输出。positive\u实例
是否始终是从第一个开始的每隔一行/列,而从第二个开始的negative\u实例
相同?@Divakar No。这变化很大。它们甚至不必有相同的长度。在正的\u实例
和负的\u实例
之间会有重叠吗?做得好(上票)!你知道为什么首先创建numpy数组并将其转换为pandas数据帧比创建数据帧并修改其元素要快得多吗?@Cleb我在回答pandas问题的基础上收集的一个观察点是,在数组级别处理数据比在本地pandas级别更高效。在执行基于数字的计算时,这似乎是非常正确的,在这种情况下,索引也是如此。好的,很高兴知道,我会记住这一点。我仍然觉得惊讶,因为我认为熊猫在NUMY和C++下使用了引擎盖,因此,我不会期望如此巨大的性能差异。@ SWOT是代码>名称< /COD>字母排序。编辑。