Python 如何优化两个元组列表的组合并删除它们的重复项?

Python 如何优化两个元组列表的组合并删除它们的重复项?,python,list,duplicates,tuples,Python,List,Duplicates,Tuples,从这里,我可以从1个元组列表中删除元组中第2个元素的副本 假设我有两个元组列表: alist = [(0.7897897,'this is a foo bar sentence'), (0.653234, 'this is a foo bar sentence'), (0.353234, 'this is a foo bar sentence'), (0.325345, 'this is not really a foo bar'), (0.323234, 'this is a foo bar

从这里,我可以从1个元组列表中删除元组中第2个元素的副本

假设我有两个元组列表:

alist = [(0.7897897,'this is a foo bar sentence'),
(0.653234, 'this is a foo bar sentence'),
(0.353234, 'this is a foo bar sentence'),
(0.325345, 'this is not really a foo bar'),
(0.323234, 'this is a foo bar sentence'),]

blist = [(0.64637,'this is a foo bar sentence'),
(0.534234, 'i am going to foo bar this sentence'),
(0.453234, 'this is a foo bar sentence'),
(0.323445, 'this is not really a foo bar')]
如果第二个元素与列表中的分数相同,我需要合并分数,并获得所需的输出:

clist = [(0.51,'this is a foo bar sentence'), # 0.51 = 0.789 * 0.646
(0.201, 'this is not really a foo bar')] # 0.201  = 0.325 * 0.323
目前,我正在通过这样做来实现clist,但是当我的alist和blist有大约5500多个元组时,需要5秒钟以上的时间,其中第二个元素每个大约有20-40个单词。有没有办法使以下功能更快


我会使用字典/集合来消除重复项并提供快速查找:

alist = [(0.7897897,'this is a foo bar sentence'),
(0.653234, 'this is a foo bar sentence'),
(0.353234, 'this is a foo bar sentence'),
(0.325345, 'this is not really a foo bar'),
(0.323234, 'this is a foo bar sentence'),]

blist = [(0.64637,'this is a foo bar sentence'),
(0.534234, 'i am going to foo bar this sentence'),
(0.453234, 'this is a foo bar sentence'),
(0.323445, 'this is not really a foo bar')]

bdict = {k:v for v,k in reversed(blist)}
clist = []
cset = set()
for v,k in alist:
   if k not in cset:
      b = bdict.get(k, None)
      if b is not None:
        clist.append((v * b, k))
        cset.add(k)
print(clist)
在这里,blist用于消除每个句子除了第一次出现以外的所有内容,并提供逐句快速查找

如果您不关心clist的顺序,可以稍微简化结构:

bdict = {k:v for v,k in reversed(blist)}
cdict = {}
for v,k in alist:
   if k not in cdict:
      b = bdict.get(k, None)
      if b is not None:
        cdict[k] = v * b
print(list((k,v) for v,k in cdict.items()))

我会使用字典/集合来消除重复项并提供快速查找:

alist = [(0.7897897,'this is a foo bar sentence'),
(0.653234, 'this is a foo bar sentence'),
(0.353234, 'this is a foo bar sentence'),
(0.325345, 'this is not really a foo bar'),
(0.323234, 'this is a foo bar sentence'),]

blist = [(0.64637,'this is a foo bar sentence'),
(0.534234, 'i am going to foo bar this sentence'),
(0.453234, 'this is a foo bar sentence'),
(0.323445, 'this is not really a foo bar')]

bdict = {k:v for v,k in reversed(blist)}
clist = []
cset = set()
for v,k in alist:
   if k not in cset:
      b = bdict.get(k, None)
      if b is not None:
        clist.append((v * b, k))
        cset.add(k)
print(clist)
在这里,blist用于消除每个句子除了第一次出现以外的所有内容,并提供逐句快速查找

如果您不关心clist的顺序,可以稍微简化结构:

bdict = {k:v for v,k in reversed(blist)}
cdict = {}
for v,k in alist:
   if k not in cdict:
      b = bdict.get(k, None)
      if b is not None:
        cdict[k] = v * b
print(list((k,v) for v,k in cdict.items()))

假设元组由元组中的第一项按降序排序,则在单个列表中存在重复项的情况下保留具有最高第一项的元组,如果元组中对应的第二项相同,则合并两个列表中的分数:

# remove duplicates (take the 1st item among duplicates)
a, b = [{sentence: score for score, sentence in reversed(lst)}
        for lst in [alist, blist]]

# merge (leave only tuples that have common 2nd items (sentences))
clist = [(a[s]*b[s], s) for s in a.viewkeys() & b.viewkeys()]
clist.sort(reverse=True) # sort by (score, sentence) in descending order
print(clist)
输出:

[(0.510496368389, 'this is a foo bar sentence'),
 (0.10523121352499999, 'this is not really a foo bar')]

假设元组由元组中的第一项按降序排序,则在单个列表中存在重复项的情况下保留具有最高第一项的元组,如果元组中对应的第二项相同,则合并两个列表中的分数:

# remove duplicates (take the 1st item among duplicates)
a, b = [{sentence: score for score, sentence in reversed(lst)}
        for lst in [alist, blist]]

# merge (leave only tuples that have common 2nd items (sentences))
clist = [(a[s]*b[s], s) for s in a.viewkeys() & b.viewkeys()]
clist.sort(reverse=True) # sort by (score, sentence) in descending order
print(clist)
输出:

[(0.510496368389, 'this is a foo bar sentence'),
 (0.10523121352499999, 'this is not really a foo bar')]

有没有考虑过字典?有没有考虑过字典?这有一个问题:bdict={k:v代表v,k在reversedblist}当有两个句子“这是一个foo-bar句子”时,只有最新的一个会在dict中有它的值,在dict理解或构造中,不能给同一个键分配多个值,您需要迭代元组列表并追加,最好使用defaultdictlist@InbarRose:由于设计相反,因此保留第一个。如果我正确理解OP,他/她只希望使用第一个条目。因为它在k,v=>v,k转换之前已按reversed进行了预排序。我看不到OP中的任何地方写有这种愿望,只是来自Alist和Blist的重复语句应该添加它们的值。设法将此函数降低到0.015秒。感谢使用dict/set而不是list的技巧,正是列表中的顺序循环造成了大量的浪费周期。这有一个问题:bdict={k:v代表v,k在reversedblist}当有两个句子“this is a foo-bar-句话”时,dict中只有最新的一个会有它的值,在dict理解或构造中,不能将多个值分配给同一个键,需要迭代元组列表并追加,最好使用defaultdictlist@InbarRose:由于设计相反,因此保留第一个。如果我正确理解OP,他/她只希望使用第一个条目。因为它在k,v=>v,k转换之前已按reversed进行了预排序。我看不到OP中的任何地方写有这种愿望,只是来自Alist和Blist的重复语句应该添加它们的值。设法将此函数降低到0.015秒。感谢使用dict/set代替list的技巧,正是顺序循环通过list导致了大量的浪费周期。