Python 基于复合密钥的数据帧到稀疏密钥项矩阵的转换_Python_Pandas_Dictionary_Dataframe_Composite Primary Key

Python 基于复合密钥的数据帧到稀疏密钥项矩阵的转换

python pandas dictionary dataframe

Python 基于复合密钥的数据帧到稀疏密钥项矩阵的转换,python,pandas,dictionary,dataframe,composite-primary-key,Python,Pandas,Dictionary,Dataframe,Composite Primary Key,我有一个3列的数据框。第1列是字符串订单号，第2列是整数日，第3列是产品名称。我想将其转换为一个矩阵，其中每一行表示一个唯一的订单/日组合，每一列表示一个1/0表示该组合的产品名称到目前为止，我的方法使用了一个产品字典和一个复合键order#&day的字典。最后一步是Slooow，它迭代原始数据帧以将矩阵中的位翻转为1s。对于大小为363K X 331且稀疏度约为97%的基质，大约需要10分钟有没有不同的方法我应该考虑？例如：将成为 A B C D 1 1 0

我有一个3列的数据框。第1列是字符串订单号，第2列是整数日，第3列是产品名称。我想将其转换为一个矩阵，其中每一行表示一个唯一的订单/日组合，每一列表示一个1/0表示该组合的产品名称

到目前为止，我的方法使用了一个产品字典和一个复合键order#&day的字典。最后一步是Slooow，它迭代原始数据帧以将矩阵中的位翻转为1s。对于大小为363K X 331且稀疏度约为97%的基质，大约需要10分钟

<>有没有不同的方法我应该考虑？

例如：

将成为

A   B   C   D
1   1   0   0
0   1   1   1

我的方法是创建订单/日期对字典：

ord_day_dict = {}
print("Making a dictionary of ord-by-day keys...")
gp = df.groupby(['day', 'ord'])
for i,g in enumerate(gp.groups.items()):
    ord_day_dict[g[0][0], g[0][1]] = i

我将索引表示附加到原始数据帧：

df['ord_day_idx'] = 0 #Create a place holder column
for i, row in df.iterrows(): #populate the column with the index
    df.set_value(i,'ord_day_idx',ord_day_dict[(row['day'], row['ord_nb'])])

然后，我初始化一个矩阵，其大小与我的ord/day X独特产品相同：

n_items = df.prod_nm.unique().shape[0] #unique number of products
n_ord_days = len(ord_day_dict) #unique number of ord-by-day combos
df_fac_matrix = np.zeros((n_ord_days, n_items), dtype=np.float64)#-1)

我通过字典将我的产品从字符串转换为索引：

prod_dict = dict()
i = 0
for v in df.prod:
    if v not in prod_dict:
        prod_dict[v] = i
        i = i + 1

最后遍历原始数据框，用1填充矩阵，其中特定日期的特定订单包含特定产品

for line in df.itertuples():
    df_fac_matrix[line[4], line[3]] = 1.0 #in the order-by-day index row and the product index column of our ord/day-by-prod matrix, mark a 1

以下是您可以尝试的一个选项：

df.groupby(['ord_nb', 'day'])['prod'].apply(list).apply(lambda x: pd.Series(1, x)).fillna(0)

#              A    B    C    D
#ord_nb day             
#     1   1  1.0  1.0  0.0  0.0
#         2  0.0  1.0  1.0  1.0

这里有一种基于NumPy的方法，可以将数组作为输出-

a = df[['ord_nb','day']].values.astype(int)
row = np.unique(np.ravel_multi_index(a.T,a.max(0)+1),return_inverse=1)[1]
col = np.unique(df.prd.values,return_inverse=1)[1]
out_shp = row.max()+1, col.max()+1
out = np.zeros(out_shp, dtype=int)
out[row,col] = 1

请注意，第三列的名称假定为“prd”，以避免与内置的名称冲突

以性能为重点的可能改进-

如果

prd

只有从

开始的单字母字符，我们可以简单地用

df.prd.values.astype（'S1'）.view（'uint8'）-65来计算col


或者，我们可以使用：np.unique（a[：，0]*（a[：，1].max（）+1）+a[：，1]，return\u inverse=1）[1]
计算row


使用稀疏数组节省内存：对于真正庞大的数组，我们可以通过将它们存储为稀疏矩阵来节省内存。因此，获得这样一个稀疏矩阵的最后步骤是-
from scipy.sparse import coo_matrix

d = np.ones(row.size,dtype=int)
out_sparse = coo_matrix((d,(row,col)), shape=out_shp)

样本输入、输出-
In [232]: df
Out[232]: 
  ord_nb day prd
0      1   1   A
1      1   1   B
2      1   2   B
3      1   2   C
4      1   2   D

In [233]: out
Out[233]: 
array([[1, 1, 0, 0],
       [0, 1, 1, 1]])

In [241]: out_sparse
Out[241]: 
<2x4 sparse matrix of type '<type 'numpy.int64'>'
    with 5 stored elements in COOrdinate format>

In [242]: out_sparse.toarray()
Out[242]: 
array([[1, 1, 0, 0],
       [0, 1, 1, 1]])

[232]中的：df
输出[232]：
奥德日珠三角
011A
11b
2 1 2 B
3 1 2 C
412d
In[233]：out
出[233]：
数组（[[1,1,0,0]，
[0, 1, 1, 1]])
In[241]：out\u稀疏
Out[241]：
In[242]：out_sparse.toarray（）
Out[242]：
数组（[[1,1,0,0]，
[0, 1, 1, 1]])

In [232]: df
Out[232]: 
  ord_nb day prd
0      1   1   A
1      1   1   B
2      1   2   B
3      1   2   C
4      1   2   D

In [233]: out
Out[233]: 
array([[1, 1, 0, 0],
       [0, 1, 1, 1]])

In [241]: out_sparse
Out[241]: 
<2x4 sparse matrix of type '<type 'numpy.int64'>'
    with 5 stored elements in COOrdinate format>

In [242]: out_sparse.toarray()
Out[242]: 
array([[1, 1, 0, 0],
       [0, 1, 1, 1]])