Python 单个列表上最快的嵌套循环（元素是否移除）_Python_Performance_List_Nested Loops

Python 单个列表上最快的嵌套循环（元素是否移除）

python performance list

Python 单个列表上最快的嵌套循环（元素是否移除）,python,performance,list,nested-loops,Python,Performance,List,Nested Loops,我正在寻找有关如何使用两个嵌套循环以最快的方式解析单个列表的建议，避免进行len（list）^2比较，并避免分组重复文件更准确地说：我有一个“文件”对象列表，每个对象都有一个时间戳。我想根据文件的时间戳和时间偏移量对文件进行分组。例如，从文件X开始，我想创建一个包含所有具有时间戳的文件的组（时间戳（X）+偏移量）为此，我做了： for file_a in list: temp_group = group() temp_group.add(file_a) list.remov

我正在寻找有关如何使用两个嵌套循环以最快的方式解析单个列表的建议，避免进行

len（list）^2

比较，并避免分组重复文件

更准确地说：我有一个“文件”对象列表，每个对象都有一个时间戳。我想根据文件的时间戳和时间偏移量对文件进行分组。例如，从文件X开始，我想创建一个包含所有具有

时间戳的文件的组（时间戳（X）+偏移量）

为此，我做了：

for file_a in list:
   temp_group = group()
   temp_group.add(file_a)
   list.remove(file_a)
   for file_b in list:
      if (file_b.timestamp < (file_a.timestamp + offset)):
         temp_group.add(file_b)
         list.remove(file_b)

   groups.add(temp_group)

列表中的文件a的

：
临时组=组（）
临时组添加（文件a）
列表。删除（文件a）
对于列表中的文件_b：
如果（文件时间戳<（文件时间戳+偏移量））：
临时组添加（文件b）
列表。删除（文件）
组。添加（临时组）

（好的，代码更复杂，但这是主要思想）

这显然不起作用，因为我在循环过程中修改列表，会发生奇怪的事情：）

我认为我必须对循环使用“列表”的副本，但是，这也不起作用：

for file_a in list[:]:
   temp_group = group()
   temp_group.add(file_a)
   list.remove(file_a)
   for file_b in list[:]:
      if (file_b.timestamp < (file_a.timestamp + offset)):
         temp_group.add(file_b)
         list.remove(file_b)

   groups.add(temp_group)

列表[：]中的文件a的

：
临时组=组（）
临时组添加（文件a）
列表。删除（文件a）
对于列表[：]中的文件b：
如果（文件时间戳<（文件时间戳+偏移量））：
临时组添加（文件b）
列表。删除（文件）
组。添加（临时组）

嗯。。我知道我可以在不从列表中删除元素的情况下执行此操作，但是我需要标记已经“处理”的元素，并且每次都需要检查它们——这是一种速度惩罚

有谁能给我一些建议，告诉我如何以最快/最好的方式做到这一点

谢谢,

亚历克斯

编辑：我已经找到了另一个解决方案，它并不能完全回答问题，但这正是我真正需要的（我这样问问题的错误）。我之所以在这里发布这篇文章，是因为它可以帮助人们在Python中查找与列表上的循环相关的问题

它可能不是最快的（考虑到列表中“通过”的次数），但是它很容易理解和实现，并且不需要对列表进行排序

我避免排序的原因是它可能需要更多的时间，因为在我创建第一组组组之后，其中一些组将被“锁定”，而未锁定的组将被“解散”，并使用不同的时间偏移重新组合。（解散组时，可能会更改文件顺序，并且需要重新排序）

无论如何，解决办法是自己控制循环索引。如果我从列表中删除一个文件，我会跳过增加索引（例如：当我删除索引“3”时，以前的索引“4”现在是“3”，我不想增加循环计数器，因为我会跳过它）。如果在那个迭代中我没有删除任何项目，那么索引通常会增加。下面是代码（包括一些额外的内容；忽略所有“bucket”内容）：

def重组（自身、时间偏移）：
#创建用于重新分组的文件列表
重新组合文件\u列表=[]
如果len（自组）=0：
#在第一次“重组”中，我们从一份jpeg_列表开始，这样我们就不会进一步更改它
重新组合文件列表=copy.copy（self.jpeg列表）
其他：
i=0
尽管如此：
尝试：
组=自身。组[i]
除索引器外：
打破
如果group.is_locked==False：
重新组合文件列表。扩展（组）
self.groups.remove（组）
持续
其他：
i+=1
bucket_group=FilesGroup（）
bucket\u group.name=c\u bucket\u group\u name
当len（重新组合文件列表）>0时：#我们创建组，直到没有剩余文件为止
文件\u a=重新组合文件\u列表[0]
重新组合文件列表。删除（文件a）
临时组=文件组（）
temp\u group.start\u time=文件\u a.\u iso\u时间
临时组添加（文件a）
#在迭代文件_b时手动管理列表索引，因为我们正在删除文件
i=0
尽管如此：
尝试：
文件\u b=重新组合文件\u列表[i]
除索引器外：
打破
timediff=文件a.\u iso\u时间-文件b.\u iso\u时间
如果时间差天数<0或时间差秒<0：
timediff=file_b.\u iso_time-file_a.\u iso_time
如果时间差<时间偏移：
临时组添加（文件b）
重新组合文件列表。删除（文件b）
继续#：D我们重用旧位置，因为所有元素都向左移动
其他：
i+=1#指数正常增加
self.groups.append（临时组）
#如果临时组太小，则将文件移动到bucket组
如果c_bucket_group_enabled==True：
如果len（临时组）0：
self.groups.append（bucket\u组）

一个简单的解决方案，通过对列表进行排序，然后使用生成器创建组：

def时间偏移（文件，偏移）：
文件=已排序（文件，键=lambda x:x.timestamp）
组=[]
时间戳=0
对于文件中的f：
如果f.时间戳<时间戳+偏移量：
组。追加（f）
其他：
产量组
timestamp=f.timestamp
组=[时间戳]
其他：
产量组
#现在你可以这样做了。。。
对于时间偏移中的组（文件，86400）：
打印组

下面是一个完整的脚本，您可以运行该脚本进行测试：

类文件：
定义初始化（自我，时间戳）：
self.timestamp=时间戳
定义报告（自我）：
return“文件：”%self.timestamp
def gen_文件（n
def regroup(self, time_offset):
    #create list of files to be used for regrouping
    regroup_files_list = []

    if len(self.groups) == 0:
        #on first 'regroup', we start with a copy of jpeg_list, so that we do not change it further on
        regroup_files_list = copy.copy(self.jpeg_list) 

    else:
        i = 0
        while True:
            try:
                group = self.groups[i]
            except IndexError:
                break

            if group.is_locked == False:
                regroup_files_list.extend(group)                    
                self.groups.remove(group)
                continue
            else:
                i += 1

    bucket_group = FilesGroup()
    bucket_group.name = c_bucket_group_name

    while len(regroup_files_list) > 0: #we create groups until there are no files left
        file_a = regroup_files_list[0]
        regroup_files_list.remove(file_a)

        temp_group = FilesGroup()
        temp_group.start_time = file_a._iso_time
        temp_group.add(file_a)

        #manually manage the list index when iterating for file_b, because we're removing files
        i = 0

        while True:
            try:
                file_b = regroup_files_list[i]
            except IndexError:
                break

            timediff = file_a._iso_time - file_b._iso_time              
            if timediff.days < 0 or timediff.seconds < 0:
                timediff = file_b._iso_time - file_a._iso_time

            if timediff < time_offset:
                temp_group.add(file_b)
                regroup_files_list.remove(file_b)
                continue # :D we reuse the old position, because all elements were shifted to the left

            else:
                i += 1 #the index is increased normally

        self.groups.append(temp_group)

        #move files to the bucket group, if the temp group is too small
        if c_bucket_group_enabled == True:                    
            if len(temp_group) < c_bucket_group_min_count:
                for file in temp_group:
                    bucket_group.add(file)
                    temp_group.remove(file)    
            else:
                self.groups.append(temp_group)      

    if len(bucket_group) > 0:
        self.groups.append(bucket_group)

def time_offsets(files, offset):

   files = sorted(files, key=lambda x:x.timestamp)

   group = []   
   timestamp = 0

   for f in files:
      if f.timestamp < timestamp + offset:
         group.append(f)
      else:
         yield group
         timestamp = f.timestamp
         group = [timestamp]
   else:
      yield group

# Now you can do this...
for group in time_offsets(files, 86400):
   print group

class File:
   def __init__(self, timestamp):
      self.timestamp = timestamp

   def __repr__(self):
      return "File: <%d>" % self.timestamp

def gen_files(num=100):
   import random
   files = []
   for i in range(num):
      timestamp = random.randint(0,1000000)
      files.append(File(timestamp))

   return files
      

def time_offsets(files, offset):

   files = sorted(files, key=lambda x:x.timestamp)

   group = []   
   timestamp = 0

   for f in files:
      if f.timestamp < timestamp + offset:
         group.append(f)
      else:
         yield group
         timestamp = f.timestamp
         group = [timestamp]
   else:
      yield group

# Now you can do this to group files by day (assuming timestamp in seconds)
files = gen_files()
for group in time_offsets(files, 86400):
   print group

listA = getListOfFiles()
listB = stableMergesort(listA, lambda el: el.timestamp)
listC = groupAdjacentElementsByTimestampRange(listB, offset)

#This is O(n^2)
while lst:
    file_a=lst.pop()
    temp_group = group()
    temp_group.add(file_a)
    while lst
        file_b=lst[-1] 
        if (file_b.timestamp < (file_a.timestamp + offset)):
            temp_group.add(lst.pop())
    groups.add(temp_group)

# This is O(n)
from collections import defaultdict
groups=defaultdict(list)  # This is why you shouldn't use `list` as a variable name
for item in lst:
    groups[item.timestamp/offset].append(item)