Python 如何创建统计公共变量的数据透视表?
我创建了以下数据帧:Python 如何创建统计公共变量的数据透视表?,python,pandas,Python,Pandas,我创建了以下数据帧: df = pd.DataFrame({ 'Product ID': ['shirt', 'dress', 'shirt', 'pants', 'jacket', 'jacket', 'dress', 'hat'], 'Discount Group': [1, 2, 3, 2, 1, 3, 4, 5] }) Product ID Discount Group 0 shirt 1 1 dress
df = pd.DataFrame({
'Product ID': ['shirt', 'dress', 'shirt', 'pants', 'jacket', 'jacket', 'dress', 'hat'],
'Discount Group': [1, 2, 3, 2, 1, 3, 4, 5]
})
Product ID Discount Group
0 shirt 1
1 dress 2
2 shirt 3
3 pants 2
4 jacket 1
5 jacket 3
6 dress 4
7 hat 5
我想创建一个透视表,其中行和列都是“折扣组”
,表值是“产品ID”
中共享项目的计数。例如,1(列)和3(行)都将“shirt”作为公共项,因此它们的值应为1
应该是这样的:
1 2 3 4 5
1 1 0 1 0 0
2 0 1 0 1 0
3 1 0 1 1 0
4 0 1 0 1 0
5 0 0 0 0 1
我试过了
df.pivot_table(values='product id', index=['discount group'], columns='discount group', aggfunc='count')
这就回来了
1 2 3 4 5
1 1 0 0 0 0
2 0 1 0 0 0
3 0 0 1 0 0
4 0 0 0 1 0
5 0 0 0 0 1
我不确定
pivot\u table
是否有帮助,但以下是您可以做的
首先,我们对“折扣组”进行groupby
,并将所有“产品ID”放入列表中:
df2 = df.groupby('Discount Group')['Product ID'].apply(list).reset_index()
df2
我们得到
Discount Group Product ID
-- ---------------- -------------------
0 1 ['shirt', 'jacket']
1 2 ['dress', 'pants']
2 3 ['shirt', 'jacket']
3 4 ['dress']
4 5 ['hat']
下一步,我们想用df本身做一个“笛卡尔积”。为此,我们在一个常量键上进行外部合并
df2['key'] = 0
df3 = df2.merge(df2, on = 'key', how = 'outer').drop(columns=['key'])
df3
我们明白了
Discount Group_x Product ID_x Discount Group_y Product ID_y
-- ------------------ ------------------- ------------------ -------------------
0 1 ['shirt', 'jacket'] 1 ['shirt', 'jacket']
1 1 ['shirt', 'jacket'] 2 ['dress', 'pants']
2 1 ['shirt', 'jacket'] 3 ['shirt', 'jacket']
3 1 ['shirt', 'jacket'] 4 ['dress']
4 1 ['shirt', 'jacket'] 5 ['hat']
5 2 ['dress', 'pants'] 1 ['shirt', 'jacket']
6 2 ['dress', 'pants'] 2 ['dress', 'pants']
7 2 ['dress', 'pants'] 3 ['shirt', 'jacket']
8 2 ['dress', 'pants'] 4 ['dress']
9 2 ['dress', 'pants'] 5 ['hat']
10 3 ['shirt', 'jacket'] 1 ['shirt', 'jacket']
11 3 ['shirt', 'jacket'] 2 ['dress', 'pants']
12 3 ['shirt', 'jacket'] 3 ['shirt', 'jacket']
13 3 ['shirt', 'jacket'] 4 ['dress']
14 3 ['shirt', 'jacket'] 5 ['hat']
15 4 ['dress'] 1 ['shirt', 'jacket']
16 4 ['dress'] 2 ['dress', 'pants']
17 4 ['dress'] 3 ['shirt', 'jacket']
18 4 ['dress'] 4 ['dress']
19 4 ['dress'] 5 ['hat']
20 5 ['hat'] 1 ['shirt', 'jacket']
21 5 ['hat'] 2 ['dress', 'pants']
22 5 ['hat'] 3 ['shirt', 'jacket']
23 5 ['hat'] 4 ['dress']
24 5 ['hat'] 5 ['hat']
请注意,我们是如何在单独的一行中获得每对“折扣组”和相应的“产品ID”的
接下来,对于每一行,我们计算“Product ID_x”和“Product ID_y”列表中存在的产品数量,并将其放入“count”列中
df3['count'] = df3.apply(lambda row : len(set(row['Product ID_x'])&set(row['Product ID_y'])), axis = 1)[
df3
看起来是这样的
Discount Group_x Product ID_x Discount Group_y Product ID_y count
-- ------------------ ------------------- ------------------ ------------------- -------
0 1 ['shirt', 'jacket'] 1 ['shirt', 'jacket'] 2
1 1 ['shirt', 'jacket'] 2 ['dress', 'pants'] 0
2 1 ['shirt', 'jacket'] 3 ['shirt', 'jacket'] 2
3 1 ['shirt', 'jacket'] 4 ['dress'] 0
4 1 ['shirt', 'jacket'] 5 ['hat'] 0
5 2 ['dress', 'pants'] 1 ['shirt', 'jacket'] 0
6 2 ['dress', 'pants'] 2 ['dress', 'pants'] 2
7 2 ['dress', 'pants'] 3 ['shirt', 'jacket'] 0
8 2 ['dress', 'pants'] 4 ['dress'] 1
9 2 ['dress', 'pants'] 5 ['hat'] 0
10 3 ['shirt', 'jacket'] 1 ['shirt', 'jacket'] 2
11 3 ['shirt', 'jacket'] 2 ['dress', 'pants'] 0
12 3 ['shirt', 'jacket'] 3 ['shirt', 'jacket'] 2
13 3 ['shirt', 'jacket'] 4 ['dress'] 0
14 3 ['shirt', 'jacket'] 5 ['hat'] 0
15 4 ['dress'] 1 ['shirt', 'jacket'] 0
16 4 ['dress'] 2 ['dress', 'pants'] 1
17 4 ['dress'] 3 ['shirt', 'jacket'] 0
18 4 ['dress'] 4 ['dress'] 1
19 4 ['dress'] 5 ['hat'] 0
20 5 ['hat'] 1 ['shirt', 'jacket'] 0
21 5 ['hat'] 2 ['dress', 'pants'] 0
22 5 ['hat'] 3 ['shirt', 'jacket'] 0
23 5 ['hat'] 4 ['dress'] 0
24 5 ['hat'] 5 ['hat'] 1
我们几乎完成了--设置索引并取消堆栈:
df3.set_index(['Discount Group_x','Discount Group_y'])['count'].unstack(level = 1)
得到
Discount Group_y 1 2 3 4 5
Discount Group_x
1 2 0 2 0 0
2 0 2 0 1 0
3 2 0 2 0 0
4 0 1 0 1 0
5 0 0 0 0 1
另一个使用更少内存的答案
。。。但有点难看
from itertools import product
s = df.groupby('Discount Group')['Product ID'].apply(list)
pairs = [[(p[0][0],p[1][0]),(p[0][1] ,p[1][1])] for p in product(s.items(),repeat = 2)]
count = [[p[0][0],p[0][1],len(set(p[1][0])&set(p[1][1]))] for p in pairs]
count
生成在第一列和第二列中具有折扣ID的列表以及重叠项目的计数:
[[1, 1, 2],
[1, 2, 0],
[1, 3, 2],
[1, 4, 0],
[1, 5, 0],
[2, 1, 0],
[2, 2, 2],
[2, 3, 0],
[2, 4, 1],
[2, 5, 0],
[3, 1, 2],
[3, 2, 0],
[3, 3, 2],
[3, 4, 0],
[3, 5, 0],
[4, 1, 0],
[4, 2, 1],
[4, 3, 0],
[4, 4, 1],
[4, 5, 0],
[5, 1, 0],
[5, 2, 0],
[5, 3, 0],
[5, 4, 0],
[5, 5, 1]]
现在我们将其插入df并取消堆叠
pd.DataFrame(count).set_index([0,1]).unstack(level = 1)
产生
2
1 1 2 3 4 5
0
1 2 0 2 0 0
2 0 2 0 1 0
3 2 0 2 0 0
4 0 1 0 1 0
5 0 0 0 0 1
这将有助于显示您的预期输出,而不仅仅是用文字进行解释,还将包含您尝试过的内容以及错误之处的代码,以便我们能够更好地理解如何help@G.Anderson我已经更新了我的code@piterbarg我用一个更大的数据集(1000行)尝试了这个方法我想这并不完全令人惊讶,因为算法的内存使用率至少为O(N^2)。。哪一步才是正确的error@Wiseface我添加了另一个版本,它不应该破坏内存