Python 对数据帧中的匹配组合进行计数_Python_Pandas_Combinations

Python 对数据帧中的匹配组合进行计数

python pandas

Python 对数据帧中的匹配组合进行计数,python,pandas,combinations,Python,Pandas,Combinations,我需要为以下问题找到更有效的解决方案：给定的是一个数据帧，每行有4个变量。我需要找到8个元素的列表，其中包括最大行数中每行的所有变量一个有效但非常缓慢的解决方案是创建第二个数据帧，其中包含所有可能的组合（基本上是一个没有重复的排列）。然后循环遍历每个组合，并将其与初始数据帧进行比较。计算解决方案的数量并将其添加到第二个数据帧中 import numpy as np import pandas as pd from itertools import combinations df = pd

我需要为以下问题找到更有效的解决方案：

给定的是一个数据帧，每行有4个变量。我需要找到8个元素的列表，其中包括最大行数中每行的所有变量

一个有效但非常缓慢的解决方案是创建第二个数据帧，其中包含所有可能的组合（基本上是一个没有重复的排列）。然后循环遍历每个组合，并将其与初始数据帧进行比较。计算解决方案的数量并将其添加到第二个数据帧中

import numpy as np
import pandas as pd
from itertools import combinations


df = pd.DataFrame(np.random.randint(0,20,size=(100, 4)), columns=list('ABCD'))
df = 'x' + df.astype(str)
listofvalues = df['A'].tolist()
listofvalues.extend(df['B'].tolist())
listofvalues.extend(df['C'].tolist())
listofvalues.extend(df['D'].tolist())
listofvalues = list(dict.fromkeys(listofvalues))
possiblecombinations = list(combinations(listofvalues, 6))
dfcombi = pd.DataFrame(possiblecombinations, columns = ['M','N','O','P','Q','R'])
dfcombi['List'] = dfcombi.M.map(str) + ',' + dfcombi.N.map(str) + ',' + dfcombi.O.map(str) + ',' + dfcombi.P.map(str) + ',' + dfcombi.Q.map(str) + ',' + dfcombi.R.map(str)
dfcombi['Count'] = ''
for x, row in dfcombi.iterrows():
        comparelist =  row['List'].split(',')
        pointercounter = df.index[(df['A'].isin(comparelist) == True) & (df['B'].isin(comparelist) == True) & (df['C'].isin(comparelist) == True) & (df['D'].isin(comparelist) == True)].tolist()
        row['Count'] = len(pointercounter)

我想一定有办法避免for循环，并用指针替换它，我只是不知道怎么做

谢谢

您的代码可以重写为：

# working with integers are much better than strings
enums, codes = df.stack().factorize()

# encodings of df
s = [set(x) for x in enums.reshape(-1,4)]

# possible combinations
from itertools import combinations, product
possiblecombinations = np.array([set(x) for x in combinations(range(len(codes)), 6)])

# count the combination with issubset
ret = [0]*len(possiblecombinations)
for a, (i,b) in product(s, enumerate(possiblecombinations)):
    ret[i] += a.issubset(b)

# the combination with maximum count
max_combination = possiblecombinations[np.argmax(ret)]
# in code {0, 3, 4, 5, 17, 18}

# and in values: 
codes[list(max_combination)]
# Index(['x5', 'x15', 'x12', 'x8', 'x0', 'x6'], dtype='object')

所有这些花费了大约2秒，而代码花费了大约1.5分钟。

您的实际数据长度是多少，以及唯一值的数量是多少？它们是否与样本中的相同，即100和20？各不相同。我指出的是最坏的情况。是的，它们与示例中的类似，唯一的区别是其中也可以有字符串（如12X5DE）。更多的栏目，但没有必要的信息。它的工作！它工作得很快！我会努力了解你在那里到底做了什么！非常感谢。