Python 有效地删除元组列表中的部分重复项_Python_List_Performance_Tuples

Python 有效地删除元组列表中的部分重复项

python list performance

Python 有效地删除元组列表中的部分重复项,python,list,performance,tuples,Python,List,Performance,Tuples,我有一个元组列表，根据元组的长度，列表的长度可以在8到1000之间变化。列表中的每个元组都是唯一的。元组的长度为N，其中每个条目都是一个通用词示例元组的长度可以是N（单词1、单词2、单词3、…、单词N）对于列表中的任何元组，所述元组中的元素j将是'或单词j 一个非常简单的字母示例如下 l = [('A', 'B', '', ''), ('A', 'B', 'C', ''), ('', '', '', 'D'), ('A', '', '', 'D'), ('', 'B'

我有一个元组列表，根据元组的长度，列表的长度可以在8到1000之间变化。列表中的每个元组都是唯一的。元组的长度为N，其中每个条目都是一个通用词

示例元组的长度可以是N

（单词1、单词2、单词3、…、单词N）

对于列表中的任何元组，所述元组中的元素j将是

或

单词j

一个非常简单的字母示例如下

l = [('A', 'B', '', ''), ('A', 'B', 'C', ''), 
     ('', '', '', 'D'), ('A', '', '', 'D'), 
     ('', 'B', '', '')]

每个元组的每个位置要么具有相同的值，要么为空。我想删除另一个元组中所有非

值位于同一位置的所有元组。例如，

（A，B，''，''）

在

（A，B，C，''）

中有其所有的非

''

值，因此应该删除

filtered_l = [(A,B,C,''),(A,'','',D)]

元组的长度总是相同的（不一定是4）。元组的长度在2-10之间

做这件事最快的方法是什么？

我不确定这是不是最有效的方法，但这将是直接的方法（同样，也许其他人会采用更复杂的列表理解方法）：

看看这个：

l = [('A','B','',''),('A','B','C',''),('','','','D'),('A','','','D'),('','B','','')]

def item_in_list(item, l):
    for item2comp in l:
        if item!=item2comp:
            found = True
            for part,rhs_part in zip(item, item2comp):
                if part!='' and part!=rhs_part:
                    found = False
                    break
            if found:
                return True
    return False
            
                
            
new_arr = []
for item in l:
    if not item_in_list(item, l):
        new_arr.append(item)
print(new_arr)

输出：

[('A', 'B', 'C', ''), ('A', '', '', 'D')]

在我看来，时间复杂度是-O（（N**2）*M）

N-列表中的元素数

M-每个元素中的部分数

让我们将每个元组概念化为一个二进制数组，其中1表示“包含某物”，2表示“包含空字符串”。因为每个位置上的项目都是相同的，所以我们不需要关心每个位置上的内容，只需要关心某个内容

L = [('A', 'B','',''),('A','B','C',''),('','','','D'),('A','','','D'),('','B','','')]
keys = collections.defaultdict(lambda: collections.defaultdict(set))

# maintain a record of tuple-indices that contain each character in each position
for i,t in enumerate(L):
    for c,e in enumerate(t):
        if not e: continue
        keys[e][c].add(i)

delme = set()
for i,t in enumerate(L):
    collocs = set.intersection(*[keys[e][c] for c,e in enumerate(t) if e])
    if len(collocs)>1:  # if all characters appear in this position in >1 index
        # ignore the collocation with the most non-empty characters
        # mark the rest for deletion
        C = max(collocs, key=lambda i: sum(bool(e) for bool in L[i]))
        for c in collocs:
            if c!=C: delme.add(c)

filtered = [t for i,t in enumerate(L) if i not in delme]

l = [('A','B','',''),('A','B','C',''),('','','','D'),('A','','','D'),('','B','','')]
l_bin = [sum(2**i if k else 0 for i,k in enumerate(tup)) for tup in l]
# [3, 7, 8, 9, 2]
# [0b0011, 0b0111, 0b1000, 0b1001, 0b0010]
# that it's backwards doesn't really matter, since it's consistent

现在，我们可以遍历该列表并构建一个没有“重复项”的新数据结构。由于我们将元组编码为二进制，我们可以通过执行位运算来确定重复的元组，即给定

和

，如果

a | b==a

，那么

必须包含

codes = {}
for tup, b in zip(l, l_bin):
    # check if any existing code contains the potential new one
    # in this case, skip adding the new one
    if any(a | b == a for a in codes):
        continue
    # check if the new code contains a potential existing one or more
    # in which case, replace the existing code(s) with the new code
    for a in list(codes):
        if b | a == b:
            codes.pop(a)
    # and finally, add this code to our datastructure
    codes[b] = tup

现在，我们可以提取元组的“过滤”列表：

output = list(codes.values())
# [('A', 'B', 'C', ''), ('A', '', '', 'D')]

请注意，

（A，B，C，”）

同时包含

（A，B，“，”）

和

（“”，B，“，”）

，并且

（A，”，“”，D'）

包含

（“”，”，“”，D）

，因此这应该是正确的

从Python3.8开始，

dict

保留插入顺序，因此输出的顺序应该与元组最初出现在列表中的顺序相同

f = sorted(map(lambda x: list(map(bool, x)), l), key=sum, reverse=True)

to_keep = []

while len(f) > 1:
    if all(map(lambda x, y: True if x == y or x else False, f[0], f[1])):
        to_keep.append(len(l) - len(f) + 1)
    f = f[1:]

print([l[i] for i in to_keep])

这个解决方案不会非常有效，因为代码的数量可能会叠加，但它应该在O（n）和O（n^2）之间，这取决于最后剩下的唯一代码的数量（并且由于每个元组的长度明显小于

，它应该更接近O（n）而不是O（n^2）.

特别是对于该限制，显而易见的解决方案是将每个元组转换为位掩码，将它们累积到计数器数组中，执行子集和转换，然后过滤数组

请参阅注释中的详细代码解释

时间复杂度显然是

n+m*2^m

，其中

是元组数，

是每个元组的长度。对于

n==1000

和

m==10

，这显然比

n^2

快

l = [('A','B','',''),('A','B','C',''),('','','','D'),('A','','','D'),('','B','','')]
# assumes that l is not empty. (to access l[0])
# The case where l is empty is trivial to handle.

def tuple_to_mask(tuple_):
    # convert the information whether each value in (tuple_) is empty to a bit mask
    # (1 is empty, 0 is not empty)
    return sum((value == '') << index for index, value in enumerate(tuple_))


count = [0] * (1 << len(l[0]))
for tuple_ in l:
    # tuple_ is a tuple.
    count[tuple_to_mask(tuple_)] += 1

# now count[mask] is the number of tuples in l with that mask

# transform the count array.
for dimension in range(len(l[0])):
    for mask in range(len(count)):
        if mask >> dimension & 1:
            count[mask] += count[mask - (1 << dimension)]

# now count[mask] is the number of tuples in l with a mask (mask_) such that (mask) contains (mask_)
# (i.e. all the bits that are set in mask_ are also set in mask)


filtered_l = [tuple_ for tuple_ in l if count[tuple_to_mask(tuple_)] == 1]
print(filtered_l)

l=[（'A'，'B'，''），（'A'，'B'，'C'，''），（'A'，''，'D'），（'A'，''，''，'D'），（'B'，''）]
#假定l不是空的。（要访问l[0]）
#l为空的情况很难处理。
定义元组到掩码（元组）：
#将（tuple\中的每个值是否为空的信息转换为位掩码
#（1为空，0不为空）
返回和（（值=''）维度&1:
计数[掩码]+=计数[掩码-（1字符串总是在同一个位置，所以我用布尔值替换它们，以便更容易地进行比较。首先我进行排序，然后我只保留元素，如果与所有其他元素相比，前一个元素在任何地方都是真的，或者与下一个元素相同。然后当比较完成时，我将其从名单
f = sorted(map(lambda x: list(map(bool, x)), l), key=sum, reverse=True)

to_keep = []

while len(f) > 1:
    if all(map(lambda x, y: True if x == y or x else False, f[0], f[1])):
        to_keep.append(len(l) - len(f) + 1)
    f = f[1:]

print([l[i] for i in to_keep])

在43.7µs时，它的速度也是原始序列的两倍。
将每个序列视为一组。现在我们只需丢弃所有子集
给定的
import itertools as it


expected = {("A", "B", "C", ""), ("A", "", "", "D")}
data = [
    ("A", "B", "", ""),
    ("A", "B", "C", ""), 
    ("", "", "", "D"), 
    ("A", "", "", "D"), 
    ("", "B", "", "")
]

代码
转换和比较集合的迭代解决方案
def discard_子集（池：列表）->设置：
“”“返回不带子集的集。”“”
丢弃=集（）
对于it.product（池，重复=2）中的n，k:#1
如果集合（k）<集合（n））：#2
丢弃。添加（k）
返回集（池）-已丢弃#3

类似的单线解决方案
set(data) - {k for n, k in it.product(data, repeat=2) if set(k) < set(n)}


详细信息
对后一个功能进行注释，以帮助解释每个部分：
将所有元素相互比较。（或使用嵌套循环）
如果一个元素是一个适当的子集（见下文），则丢弃它
从池中删除丢弃的元素
为什么要使用集合
池的每个元素都可以是一个集合，因为相关的子元素是唯一的，即“a”、“B”、“C”、“D”和“
”
集合具有成员属性。例如
（“A”、“B”、“C”和“）
中的所有值都在（“A”、“B”、“C”和“）

也可以说
集合{A”，“B”，“0”，“0}
是{A”，“B”，“C”，“0}

剩下的就是比较所有元素并拒绝所有元素
a，a，ac={“a”}，{“a”}，{“a”，“c”}
#子集
断言a.issubset（a）
断言a看起来你有一些不错的答案，但你可能也想研究集合运算。例如，或@kepr，我想我把它归结到一行（如下）。看看这是否符合你的要求。应该是sum（bool（e）for e in L[I]）这对我不起作用。例如L=[（a），（C），（B，C），（C），（a），，，，“（e），”，（“（B，”，，，，”，（，，”，，，，”，，，，，”，（A，B，，（“”，B，，）]您的解决方案返回
discard_subsets(data)
# {('A', '', '', 'D'), ('A', 'B', 'C', '')}