Python 如何在保留顺序的同时从列表中删除重复项？_Python_List_Duplicates_Unique

Python 如何在保留顺序的同时从列表中删除重复项？

python list

Python 如何在保留顺序的同时从列表中删除重复项？,python,list,duplicates,unique,Python,List,Duplicates,Unique,在Python中是否有一个内置程序可以在保留顺序的同时从列表中删除重复项？我知道我可以使用集合删除重复项，但这会破坏原始顺序。我也知道我可以像这样玩我自己的游戏： def uniq(input): output = [] for x in input: if x not in output: output.append(x) return output def test_round(x,y): return round(x) != round(y) （

在Python中是否有一个内置程序可以在保留顺序的同时从列表中删除重复项？我知道我可以使用集合删除重复项，但这会破坏原始顺序。我也知道我可以像这样玩我自己的游戏：

def uniq(input):
  output = []
  for x in input:
    if x not in output:
      output.append(x)
  return output

def test_round(x,y):
    return round(x) != round(y)

（谢谢你的帮助。）

但是如果可能的话，我想使用一个内置的或者更具Python风格的习惯用法

相关问题：

这里有一些选择：

最快的一个：

def f7(seq):
    seen = set()
    seen_add = seen.add
    return [x for x in seq if not (x in seen or seen_add(x))]

为什么分配

seen.add

到

seen\u add

而不是调用

seen.add

？Python是一种动态语言，解析

seen.add

每次迭代比解析局部变量的成本更高<代码>已看到。添加可能在迭代之间发生更改，而运行时不够聪明，无法排除这种情况。为了安全起见，它必须每次检查对象

如果您计划在同一数据集上大量使用此函数，那么使用有序集可能会更好：

O（1）每次操作的插入、删除和成员检查

（小的附加说明：

seen.add（）

始终返回

None

，因此上面的

或

仅作为尝试设置更新的一种方式，而不是逻辑测试的一个组成部分。）

列表甚至不必排序，充分条件是将相等的值分组在一起

Edit：我假设“保持顺序”意味着列表实际上是有序的。如果不是这样，那么MizardX的解决方案就是正确的。

社区编辑：但是，这是“将重复的连续元素压缩为单个元素”的最优雅的方式。

如果您需要一个行程序，那么这可能会有所帮助：

reduce(lambda x, y: x + y if y[0] not in x else x, map(lambda x: [x],lst))

。。。应该可以，但如果我对无哈希类型（例如列表列表）有错误，请纠正我，基于MizardX的：

def f7_noHash(seq)
    seen = set()
    return [ x for x in seq if str( x ) not in seen and not seen.add( str( x ) )]

MizardX的回答提供了多种方法的良好集合

这是我在大声思考时想到的：

mylist = [x for i,x in enumerate(mylist) if x not in mylist[i+1:]]

您可以引用由符号“[1]”生成的列表理解
例如，下面的函数unique通过引用元素列表来定义元素列表，而不改变其顺序

def unique(my_list): 
    return [x for x in my_list if x not in locals()['_[1]']]

演示：

输出：

[1, 2, 3, 4, 5]

独特的→ <代码>['1'，'2'，'3'，'6'，'4'，'5']

我想如果你想维持订单

您可以尝试以下方法：或者类似地，您可以这样做：您也可以这样做：也可以这样写：

编辑2020

从CPython/pypy3.6开始（作为3.7中的语言保证），plain

dict

是按插入顺序排列的，甚至比（也是C实现的）collections.OrderedDict更高效。因此，到目前为止，最快的解决方案也是最简单的：

>>> items = [1, 2, 0, 1, 3, 2]
>>> list(dict.fromkeys(items))
[1, 2, 0, 3]

与

list（set（items））

类似，这会将所有工作推送到C层（在CPython上），但由于

dict

s是按插入顺序排列的，

dict.fromkeys

不会失去顺序。它比

列表（集合（项目））

慢（通常需要50-100%的时间），但比任何其他保序解决方案快得多（大约需要一半的时间）

编辑2016

正如Raymond所说，在Python3.5+中，C语言实现了

OrderedDict

，列表理解方法将比

OrderedDict

慢（除非您实际上需要在末尾使用列表，即使是在输入非常短的情况下）。因此，3.5+的最佳解决方案是

OrderedDict

重要编辑2015

如前所述，库（

pip install more_itertools

）包含一个函数，用于解决此问题，而不会出现任何不可读的（
未显示。在列表理解中添加）突变。这也是最快的解决方案： >>> from more_itertools import unique_everseen >>> items = [1, 2, 0, 1, 3, 2] >>> list(unique_everseen(items)) [1, 2, 0, 3] 只需一个简单的库导入，没有黑客攻击。这来自itertools配方的一个实现，该配方如下所示： def unique_everseen(iterable, key=None): "List unique elements, preserving order. Remember all elements ever seen." # unique_everseen('AAAABBBCCDAABBB') --> A B C D # unique_everseen('ABBCcAD', str.lower) --> A B C D seen = set() seen_add = seen.add if key is None: for element in filterfalse(seen.__contains__, iterable): seen_add(element) yield element else: for element in iterable: k = key(element) if k not in seen: seen_add(k) yield element 在Python2.7+ 中，公认的通用习惯用法（它可以工作，但没有针对速度进行优化，我现在使用它）用于：运行时：O（N）这看起来比： seen = set() [x for x in seq if x not in seen and not seen.add(x)] 并且不使用丑陋的黑客： not seen.add(x) 这取决于set.add 是一个就地方法，它总是返回None ，因此notnone 的计算结果为True 但是请注意，hack解决方案的原始速度更快，尽管它具有相同的运行时复杂性O（N）。借用Haskell的nub 函数定义列表时使用的递归思想，这将是一种递归方法： def unique(lst): return [] if lst==[] else [lst[0]] + unique(filter(lambda x: x!= lst[0], lst[1:])) e、 g: 我试着用它来增加数据量，并看到了次线性时间复杂性（不是确定的，但建议这对于正常数据应该是好的）我还认为有趣的是，这可以通过其他操作很容易地推广到唯一性。像这样： import operator def unique(lst, cmp_op=operator.ne): return [] if lst==[] else [lst[0]] + unique(filter(lambda x: cmp_op(x, lst[0]), lst[1:]), cmp_op) 例如，您可以传入一个函数，该函数使用舍入到同一整数的概念，就好像出于唯一性目的它是“相等的”，如下所示： def uniq(input): output = [] for x in input: if x not in output: output.append(x) return output def test_round(x,y): return round(x) != round(y) 然后，unique（一些列表，test\u round）将提供列表中唯一的元素，其中唯一性不再意味着传统的相等（这是通过使用任何基于集合或基于dict键的方法来解决此问题而隐含的）而是意味着对于元素可能舍入到的每个可能整数K，仅取舍入到K的第一个元素，例如： In [6]: unique([1.2, 5, 1.9, 1.1, 4.2, 3, 4.8], test_round) Out[6]: [1.2, 5, 1.9, 4.2, 3] 使用\u排序 anumpy 数组的相对有效的方法： b = np.array([1,3,3, 8, 12, 12,12]) numpy.hstack([b[0], [x[0] for x in zip(b[1:], b[:-1]) if x[0]!=x[1]]]) 产出： array([ 1, 3, 8, 12]) 对于另一个非常老的问题，另一个非常晚的回答：通过使用seen set技术，具有执行此操作的功能，但是：处理标准的键功能不使用不体面的黑客通过预绑定优化循环。添加而不是查找N次。（f7也会这样做，但s def unique(lst): return [] if lst==[] else [lst[0]] + unique(filter(lambda x: x!= lst[0], lst[1:])) In [118]: unique([1,5,1,1,4,3,4]) Out[118]: [1, 5, 4, 3] In [122]: %timeit unique(np.random.randint(5, size=(1))) 10000 loops, best of 3: 25.3 us per loop In [123]: %timeit unique(np.random.randint(5, size=(10))) 10000 loops, best of 3: 42.9 us per loop In [124]: %timeit unique(np.random.randint(5, size=(100))) 10000 loops, best of 3: 132 us per loop In [125]: %timeit unique(np.random.randint(5, size=(1000))) 1000 loops, best of 3: 1.05 ms per loop In [126]: %timeit unique(np.random.randint(5, size=(10000))) 100 loops, best of 3: 11 ms per loop import operator def unique(lst, cmp_op=operator.ne): return [] if lst==[] else [lst[0]] + unique(filter(lambda x: cmp_op(x, lst[0]), lst[1:]), cmp_op) def test_round(x,y): return round(x) != round(y) In [6]: unique([1.2, 5, 1.9, 1.1, 4.2, 3, 4.8], test_round) Out[6]: [1.2, 5, 1.9, 4.2, 3] b = np.array([1,3,3, 8, 12, 12,12]) numpy.hstack([b[0], [x[0] for x in zip(b[1:], b[:-1]) if x[0]!=x[1]]]) array([ 1, 3, 8, 12]) def unique(iterable): seen = set() seen_add = seen.add for element in itertools.ifilterfalse(seen.__contains__, iterable): seen_add(element) yield element [l[i] for i in range(len(l)) if l.index(l[i]) == i] l = [1,2,2,3,3,...] n = [] n.extend(ele for ele in l if ele not in set(n)) >>> l = [5, 6, 6, 1, 1, 2, 2, 3, 4] >>> reduce(lambda r, v: v in r[1] and r or (r[0].append(v) or r[1].add(v)) or r, l, ([], set()))[0] [5, 6, 1, 2, 3, 4] default = (list(), set()) # use list to keep order # use set to make lookup faster def reducer(result, item): if item not in result[1]: result[0].append(item) result[1].add(item) return result >>> reduce(reducer, l, default)[0] [5, 6, 1, 2, 3, 4] def uniquefy_list(a): return uniquefy_list(a[1:]) if a[0] in a[1:] else [a[0]]+uniquefy_list(a[1:]) if len(a)>1 else [a[0]] import pandas as pd import numpy as np uniquifier = lambda alist: pd.Series(alist).drop_duplicates().tolist() # from the chosen answer def f7(seq): seen = set() seen_add = seen.add return [ x for x in seq if not (x in seen or seen_add(x))] alist = np.random.randint(low=0, high=1000, size=10000).tolist() print uniquifier(alist) == f7(alist) # True In [104]: %timeit f7(alist) 1000 loops, best of 3: 1.3 ms per loop In [110]: %timeit uniquifier(alist) 100 loops, best of 3: 4.39 ms per loop def deduplicate(l): count = {} (read,write) = (0,0) while read < len(l): if l[read] in count: read += 1 continue count[l[read]] = True l[write] = l[read] read += 1 write += 1 return l[0:write] text = "ask not what your country can do for you ask what you can do for your country" sentence = text.split(" ") noduplicates = [(sentence[i]) for i in range (0,len(sentence)) if sentence[i] not in sentence[:i]] print(noduplicates) ['ask', 'not', 'what', 'your', 'country', 'can', 'do', 'for', 'you'] >>> list(dict.fromkeys('abracadabra')) ['a', 'b', 'r', 'c', 'd'] >>> from collections import OrderedDict >>> list(OrderedDict.fromkeys('abracadabra')) ['a', 'b', 'r', 'c', 'd'] >>> from iteration_utilities import unique_everseen >>> lst = [1,1,1,2,3,2,2,2,1,3,4] >>> list(unique_everseen(lst)) [1, 2, 3, 4] %matplotlib notebook from iteration_utilities import unique_everseen from collections import OrderedDict from more_itertools import unique_everseen as mi_unique_everseen def f7(seq): seen = set() seen_add = seen.add return [x for x in seq if not (x in seen or seen_add(x))] def iteration_utilities_unique_everseen(seq): return list(unique_everseen(seq)) def more_itertools_unique_everseen(seq): return list(mi_unique_everseen(seq)) def odict(seq): return list(OrderedDict.fromkeys(seq)) from simple_benchmark import benchmark b = benchmark([f7, iteration_utilities_unique_everseen, more_itertools_unique_everseen, odict], {2**i: list(range(2**i)) for i in range(1, 20)}, 'list size (no duplicates)') b.plot() import random b = benchmark([f7, iteration_utilities_unique_everseen, more_itertools_unique_everseen, odict], {2**i: [random.randint(0, 2**(i-1)) for _ in range(2**i)] for i in range(1, 20)}, 'list size (lots of duplicates)') b.plot() b = benchmark([f7, iteration_utilities_unique_everseen, more_itertools_unique_everseen, odict], {2**i: [1]*(2**i) for i in range(1, 20)}, 'list size (only duplicates)') b.plot() >>> lst = [{1}, {1}, {2}, {1}, {3}] >>> list(unique_everseen(lst)) [{1}, {2}, {3}] import pandas as pd my_list = [0, 1, 2, 3, 4, 1, 2, 3, 5] >>> pd.Series(my_list).drop_duplicates().tolist() # Output: # [0, 1, 2, 3, 4, 5] >>> lst = [1, 2, 1, 3, 3, 2, 4] >>> list(dict.fromkeys(lst)) [1, 2, 3, 4] for i in range(len(l)-1,0,-1): if l[i] in l[:i]: del l[i] In [91]: from random import randint, seed In [92]: seed('20080808') ; l = [randint(1,6) for _ in range(12)] # Beijing Olympics In [93]: for i in range(len(l)-1,0,-1): ...: print(l) ...: print(i, l[i], l[:i], end='') ...: if l[i] in l[:i]: ...: print( ': remove', l[i]) ...: del l[i] ...: else: ...: print() ...: print(l) [6, 5, 1, 4, 6, 1, 6, 2, 2, 4, 5, 2] 11 2 [6, 5, 1, 4, 6, 1, 6, 2, 2, 4, 5]: remove 2 [6, 5, 1, 4, 6, 1, 6, 2, 2, 4, 5] 10 5 [6, 5, 1, 4, 6, 1, 6, 2, 2, 4]: remove 5 [6, 5, 1, 4, 6, 1, 6, 2, 2, 4] 9 4 [6, 5, 1, 4, 6, 1, 6, 2, 2]: remove 4 [6, 5, 1, 4, 6, 1, 6, 2, 2] 8 2 [6, 5, 1, 4, 6, 1, 6, 2]: remove 2 [6, 5, 1, 4, 6, 1, 6, 2] 7 2 [6, 5, 1, 4, 6, 1, 6] [6, 5, 1, 4, 6, 1, 6, 2] 6 6 [6, 5, 1, 4, 6, 1]: remove 6 [6, 5, 1, 4, 6, 1, 2] 5 1 [6, 5, 1, 4, 6]: remove 1 [6, 5, 1, 4, 6, 2] 4 6 [6, 5, 1, 4]: remove 6 [6, 5, 1, 4, 2] 3 4 [6, 5, 1] [6, 5, 1, 4, 2] 2 1 [6, 5] [6, 5, 1, 4, 2] 1 5 [6] [6, 5, 1, 4, 2] In [94]: # for hashable sequence def remove_duplicates(items): seen = set() for item in items: if item not in seen: yield item seen.add(item) a = [1, 5, 2, 1, 9, 1, 5, 10] list(remove_duplicates(a)) # [1, 5, 2, 9, 10] # for unhashable sequence def remove_duplicates(items, key=None): seen = set() for item in items: val = item if key is None else key(item) if val not in seen: yield item seen.add(val) a = [ {'x': 1, 'y': 2}, {'x': 1, 'y': 3}, {'x': 1, 'y': 2}, {'x': 2, 'y': 4}] list(remove_duplicates(a, key=lambda d: (d['x'],d['y']))) # [{'x': 1, 'y': 2}, {'x': 1, 'y': 3}, {'x': 2, 'y': 4}] def DelDupes(aseq) : seen = set() return [x for x in aseq if (x.lower() not in seen) and (not seen.add(x.lower()))] def HasDupes(aseq) : s = set() return any(((x.lower() in s) or s.add(x.lower())) for x in aseq) def GetDupes(aseq) : s = set() return set(x for x in aseq if ((x.lower() in s) or s.add(x.lower()))) list1 = ["hello", " ", "w", "o", "r", "l", "d"] sorted(set(list1 ), key=lambda x:list1.index(x)) ["hello", " ", "w", "o", "r", "l", "d"] >>> import pandas as pd >>> lst = [1, 2, 1, 3, 3, 2, 4] >>> pd.unique(lst) array([1, 2, 3, 4]) def solve(arr): return list(dict.fromkeys(arr[::-1]))[::-1] x = [1, 2, 1, 3, 1, 4] # brute force method arr = [] for i in x: if not i in arr: arr.insert(x[i],i) # recursive method tmp = [] def remove_duplicates(j=0): if j < len(x): if not x[j] in tmp: tmp.append(x[j]) i = j+1 remove_duplicates(i) remove_duplicates()