Python 寻找对频繁出现的项目的支持_Python_Pandas

Python 寻找对频繁出现的项目的支持

python pandas

Python 寻找对频繁出现的项目的支持,python,pandas,Python,Pandas,假设我有一个数据框，其中每一行都有订单id和一个项目。我想知道哪些项目与另一个项目一起出现的频率最高（考虑到另一个项目存在，可能项目按顺序出现的概率？）假设数据是 order_id,item 1,a 1,b 1,c 2,a 2,b 2,d 3,a 3,b 3,e 然后这对a，b得到最高分数（我肯定这有一个技术名称，找不到：）一种可能的输出是成对出现的概率，在我们的例子中，类似于： item1,item2,probablility a,b,1 a,b,0.3 b,c,0.3 a,d,0.3

假设我有一个数据框，其中每一行都有订单id和一个项目。我想知道哪些项目与另一个项目一起出现的频率最高（考虑到另一个项目存在，可能项目按顺序出现的概率？）

假设数据是

order_id,item
1,a
1,b
1,c
2,a
2,b
2,d
3,a
3,b
3,e

然后这对

a，b

得到最高分数

（我肯定这有一个技术名称，找不到：）

一种可能的输出是成对出现的概率，在我们的例子中，类似于：

item1,item2,probablility
a,b,1
a,b,0.3
b,c,0.3
a,d,0.3
b,c,0.3
a,e,0.3
b,3,0.3

求共生矩阵

首先，获取订单、项目和事件的组合

In [249]: cross = pd.crosstab(df['order_id'], df['item'])

In [250]: cross
Out[250]:
item      a  b  c  d  e
order_id
1         1  1  1  0  0
2         1  1  0  1  0
3         1  1  0  0  1

然后，做交叉和转置交叉的乘积

In [251]: cross.T.dot(cross)
Out[251]:
item  a  b  c  d  e
item
a     3  3  1  1  1
b     3  3  1  1  1
c     1  1  1  0  0
d     1  1  0  1  0
e     1  1  0  0  1

这使您可以根据订单id将每个项目与每个其他项目同时出现。

您正在尝试计算对各个项目子集的支持。我已经编写了一个通用代码，可以找到对所有组合的支持

创建数据集这就是

的外观

item    a   b   c   d   e
order_id                    
1   True    True    True    False   False
2   True    True    False   True    False
3   True    True    False   False   True

现在，对于长度从1到唯一项数不等的子集的所有组合，我们查询该子集中的所有项是否都为真

主要逻辑这给了我们以下结果

{('a',): 1.0,
 ('a', 'b'): 1.0,
 ('a', 'b', 'c'): 0.3333333333333333,
 ('a', 'b', 'c', 'd'): 0.0,
 ('a', 'b', 'c', 'e'): 0.0,
 ('a', 'b', 'd'): 0.3333333333333333,
 ('a', 'b', 'd', 'e'): 0.0,
 ('a', 'b', 'e'): 0.3333333333333333,
 ('a', 'c'): 0.3333333333333333,
 ('a', 'c', 'd'): 0.0,
 ('a', 'c', 'd', 'e'): 0.0,
 ('a', 'c', 'e'): 0.0,
 ('a', 'd'): 0.3333333333333333,
 ('a', 'd', 'e'): 0.0,
 ('a', 'e'): 0.3333333333333333,
 ('b',): 1.0,
 ('b', 'c'): 0.3333333333333333,
 ('b', 'c', 'd'): 0.0,
 ('b', 'c', 'd', 'e'): 0.0,
 ('b', 'c', 'e'): 0.0,
 ('b', 'd'): 0.3333333333333333,
 ('b', 'd', 'e'): 0.0,
 ('b', 'e'): 0.3333333333333333,
 ('c',): 0.3333333333333333,
 ('c', 'd'): 0.0,
 ('c', 'd', 'e'): 0.0,
 ('c', 'e'): 0.0,
 ('d',): 0.3333333333333333,
 ('d', 'e'): 0.0,
 ('e',): 0.3333333333333333}

仅打印长度2个组合如果您只想打印两种组合的长度，您可以修改范围（1，len（df.item.unique（））：中k的

，并通过设置k=2
删除循环。在这种情况下，答案将是：
pd.Series(out)

a  b    1.000000
   c    0.333333
   d    0.333333
   e    0.333333
b  c    0.333333
   d    0.333333
   e    0.333333
c  d    0.000000
   e    0.000000
d  e    0.000000

那么您的预期输出是什么？更新了问题，谢谢！谢谢为了得到概率，我只需要将[251]

除以

df['order\u id'].nunique（）

？

from itertools import combinations
out = {}
for k  in range(1, len(df.item.unique())):
    combination_len_k = list(combinations(df.item.unique(), k))
    for c in combination_len_k:
        q = " & ".join(list(c))
        out[c] = len(e.query(q))*1./len(e)

{('a',): 1.0,
 ('a', 'b'): 1.0,
 ('a', 'b', 'c'): 0.3333333333333333,
 ('a', 'b', 'c', 'd'): 0.0,
 ('a', 'b', 'c', 'e'): 0.0,
 ('a', 'b', 'd'): 0.3333333333333333,
 ('a', 'b', 'd', 'e'): 0.0,
 ('a', 'b', 'e'): 0.3333333333333333,
 ('a', 'c'): 0.3333333333333333,
 ('a', 'c', 'd'): 0.0,
 ('a', 'c', 'd', 'e'): 0.0,
 ('a', 'c', 'e'): 0.0,
 ('a', 'd'): 0.3333333333333333,
 ('a', 'd', 'e'): 0.0,
 ('a', 'e'): 0.3333333333333333,
 ('b',): 1.0,
 ('b', 'c'): 0.3333333333333333,
 ('b', 'c', 'd'): 0.0,
 ('b', 'c', 'd', 'e'): 0.0,
 ('b', 'c', 'e'): 0.0,
 ('b', 'd'): 0.3333333333333333,
 ('b', 'd', 'e'): 0.0,
 ('b', 'e'): 0.3333333333333333,
 ('c',): 0.3333333333333333,
 ('c', 'd'): 0.0,
 ('c', 'd', 'e'): 0.0,
 ('c', 'e'): 0.0,
 ('d',): 0.3333333333333333,
 ('d', 'e'): 0.0,
 ('e',): 0.3333333333333333}

pd.Series(out)

a  b    1.000000
   c    0.333333
   d    0.333333
   e    0.333333
b  c    0.333333
   d    0.333333
   e    0.333333
c  d    0.000000
   e    0.000000
d  e    0.000000