Python 阵列输入的熊猫矢量化_Python_Pandas_Vectorization_Sparse Matrix

Python 阵列输入的熊猫矢量化

python pandas

Python 阵列输入的熊猫矢量化,python,pandas,vectorization,sparse-matrix,Python,Pandas,Vectorization,Sparse Matrix,我想从dataframe以向量化的方式创建一个备用矩阵，包含标签向量和值向量，同时知道所有标签另一个限制是，我不能先创建密集数据帧，然后将其转换为备用数据帧，因为它太大，无法保存在内存中示例： all_labels = np.sort(all_labels) n = len(df) lens = list(map(len,df['labels'])) l_ar = np.concatenate(df['labels'].to_list()) d = np.concatenate(df[

我想从dataframe以向量化的方式创建一个备用矩阵，包含标签向量和值向量，同时知道所有标签
另一个限制是，我不能先创建密集数据帧，然后将其转换为备用数据帧，因为它太大，无法保存在内存中

示例：

all_labels = np.sort(all_labels) n = len(df) lens = list(map(len,df['labels'])) l_ar = np.concatenate(df['labels'].to_list()) d = np.concatenate(df['scores'].to_list()) R = np.repeat(np.arange(n),lens) C = np.searchsorted(all_labels,l_ar) my_result = coo_matrix( (d, (R, C)), shape = (n,len(all_labels)))
所有可能标签的列表：

all_labels = ['a','b','c','d','e',\ 'f','g','h','i','j',\ 'k','l','m','n','o',\ 'p','q','r','s','t',\ 'u','v','w','z']
每行中具有特定标签值的Dataframe：

data = {'labels': [['b','a'],['q'],['n','j','v']], 'scores': [[0.1,0.2],[0.7],[0.3,0.5,0.1]]} df = pd.DataFrame(data)

预计密集产量：

这是我如何以非矢量化的方式完成的，这占用了太多的时间：

from scipy import sparse from scipy.sparse import coo_matrix def labels_to_sparse(input_): all_, lables_, scores_ = input_ rows = [0]*len(all_) cols = range(len(all_)) vals = [0]*len(all_) for i in range(len(lables_)): vals[all_.index(lables_[i])] = scores_[i] return coo_matrix((vals, (rows, cols))) df['sparse_row'] = df.apply( lambda x: labels_to_sparse((all_labels, x['labels'], x['scores'])), axis=1 ) df
尽管这样做有效，但由于必须使用
df.apply
，因此在处理较大数据时速度非常慢。是否有办法将此函数矢量化，以避免使用
apply
最后，我想使用此数据框创建矩阵：

my_result = sparse.vstack(df['sparse_row'].values) my_result.todense() #not really needed - just for visualization

编辑
总结已接受的解决方案（由@Divakar提供）：

all_labels = np.sort(all_labels) n = len(df) lens = list(map(len,df['labels'])) l_ar = np.concatenate(df['labels'].to_list()) d = np.concatenate(df['scores'].to_list()) R = np.repeat(np.arange(n),lens) C = np.searchsorted(all_labels,l_ar) my_result = coo_matrix( (d, (R, C)), shape = (n,len(all_labels)))

这里有一些你可以尝试的替代方法
方法1-使用列表和
方法2-
用于循环
，使用更新值
两者的产量应该相同
[外]
这是一个基于-
注意：如果
所有标签
未排序，我们需要将
分拣机
arg与
搜索排序
一起使用
进入稀疏矩阵输出，如-

有没有一种方法可以使
out
成为稀疏矩阵？如果我理解正确，
out
包含结果，但它是一个numpy数组。另外，在调用
np.concatenate
之前，我必须将
添加到\u list（）
。此问题中的示例没有问题，但对于real dataset（其中标签为单词/短语），它在没有它的情况下无法运行（KeyError:0）。
out\u sparse
命令失败：
ValueError:列索引超过矩阵维度
。我的真实尺寸：
len（所有标签）
-9933<代码>n-407447<编码>镜头（镜头）-407447<代码>长度（l_ar）-3018669<代码>d.形状-（3018669，）<代码>R.形状-（3018669，）
C.shape
-（3018669，）@matt525252是否对所有标签进行了
排序？@matt525252然后如文章中所述，使用分拣机 arg。从这篇文章中获得灵感-。当我第一次对它进行排序（all\u labels=np.sort（all\u labels））时，您的解决方案就可以工作了。而且它真的很快。谢谢你的帮助！：） my_result = pd.DataFrame(np.zeros((len(df), len(all_labels))), columns=all_labels) for i, (lab, val) in df.iterrows(): my_result.loc[i, lab] = val my_result = my_result.values [[0.2 0.1 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. ] [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.7 0. 0. 0. 0. 0. 0. 0. 0. 0. ] [0. 0. 0. 0. 0. 0. 0. 0. 0. 0.5 0. 0. 0. 0.3 0. 0. 0. 0. 0. 0. 0. 0.1 0. 0. 0. 0. ]] n = len(df) lens = list(map(len,df['labels'])) l_ar = np.concatenate(df['labels']) d = np.concatenate(df['scores']) out = np.zeros((n,len(all_labels)),dtype=d.dtype) R = np.repeat(np.arange(n),lens) C = np.searchsorted(all_labels,l_ar) out[R, C] = d from scipy.sparse import csr_matrix,coo_matrix out_sparse = coo_matrix( (d, (R, C)), shape = (n,len(all_labels)))