Python 最长等距子序列

Python 最长等距子序列,python,algorithm,Python,Algorithm,我有一百万个按排序顺序排列的整数,我想找出连续对之间的差值相等的最长子序列。比如说 1, 4, 5, 7, 8, 12 有子序列 4, 8, 12 我的天真方法是贪婪的,只检查从每个点可以扩展一个子序列多远。这似乎需要每个点花费O(n²)时间 有没有更快的方法来解决这个问题 更新。我将尽快测试答案中给出的代码(谢谢)。然而,使用n^2内存显然不起作用。到目前为止,还没有以输入为[random.randint(0100000)for r in xrange(200000)]终

我有一百万个按排序顺序排列的整数,我想找出连续对之间的差值相等的最长子序列。比如说

1, 4, 5, 7, 8, 12
有子序列

   4,       8, 12
我的天真方法是贪婪的,只检查从每个点可以扩展一个子序列多远。这似乎需要每个点花费
O(n²)
时间

有没有更快的方法来解决这个问题

更新。我将尽快测试答案中给出的代码(谢谢)。然而,使用n^2内存显然不起作用。到目前为止,还没有以输入为
[random.randint(0100000)for r in xrange(200000)]
终止的代码

计时。我在32位系统上使用以下输入数据进行了测试

a= [random.randint(0,10000) for r in xrange(20000)] 
a.sort()
  • ZelluX的动态编程方法使用1.6G RAM,耗时2分14秒。使用pypy只需9秒!但是,它会因大输入上的内存错误而崩溃
  • Armin的O(nd)时间方法使用pypy需要9秒,但只有20MB的RAM。当然,如果范围更大,情况会更糟。内存使用率低意味着我也可以用a=[random.randint(0100000)for r in xrange(200000)]测试它,但在我用pypy测试的几分钟内它并没有完成
为了能够测试Kluev的I-reran方法

a= [random.randint(0,40000) for r in xrange(28000)] 
a = list(set(a))
a.sort()
列出长度约为
20000
。所有与pypy的计时

  • 泽勒克斯,9秒
  • 克鲁耶夫,20秒
  • 阿明,52秒
看来,如果Zelux方法可以成为线性空间,它将是明显的赢家。

你的解决方案是现在的
O(N^3)
(你说的
O(N^2)每个索引
)。这里是时间的
O(N^2)
和内存解决方案的
O(N^2)

主意 如果我们知道通过索引
i[0]
i[1]
i[2]
i[3]
的子序列,我们就不应该尝试以
i[1]
i[2]
i[3]
开头的子序列

请注意,我编辑了该代码,以便使用排序的
a
更容易一些,但它不适用于相等的元素。您可以轻松地在
O(N)
中检查相等元素的最大数量

伪码 我只寻求最大长度,但这不会改变任何事情

whereInA = {}
for i in range(n):
   whereInA[a[i]] = i; // It doesn't matter which of same elements it points to

boolean usedPairs[n][n];

for i in range(n):
    for j in range(i + 1, n):
       if usedPair[i][j]:
          continue; // do not do anything. It was in one of prev sequences.

    usedPair[i][j] = true;

    //here quite stupid solution:
    diff = a[j] - a[i];
    if diff == 0:
       continue; // we can't work with that
    lastIndex = j
    currentLen = 2
    while whereInA contains index a[lastIndex] + diff :
        nextIndex = whereInA[a[lastIndex] + diff]
        usedPair[lastIndex][nextIndex] = true
        ++currentLen
        lastIndex = nextIndex

    // you may store all indicies here
    maxLen = max(maxLen, currentLen)
关于内存使用的思考
O(n^2)
对于1000000个元素,时间非常慢。但如果要在如此多的元素上运行此代码,最大的问题将是内存使用。
可以做些什么来减少它

  • 将布尔数组更改为位字段以每位存储更多布尔值
  • 使下一个布尔数组变短,因为我们仅在
    i
    时使用
    usedPairs[i][j]
一些启发式方法:

  • 仅储存成对的使用过的标识。(与第一个想法冲突)
  • 删除不再使用的已用磁盘(用于循环中已选择的
    i
    j

    • 算法

      • 遍历列表的主循环
      • 若在预计算列表中找到了数字,那个么它属于该列表中的所有序列,用count+1重新计算所有序列
      • 删除当前元素的所有预计算
      • 重新计算新序列,其中第一个元素的范围是从0到当前,第二个元素是遍历的当前元素(实际上,不是从0到当前,我们可以使用这样一个事实:新元素不应该超过max(a),新列表应该有可能比已经找到的列表更长)
      所以对于列表
      [1,2,4,5,7]
      的输出将是(有点凌乱,请自己编写代码并查看)

      • 索引0,元素1
        • 如果预信用证中的
          1
          ?不,什么也不做
        • 无所事事
      • 索引1,元素2
        • 如果预LC中的
          2
          ?不,什么也不做
        • 检查我们的集合中是否有3=
          1
          +(
          2
          -
          1
          )*2?不,什么也不做
      • 索引2,元素4
        • 如果预信用证中的
          4
          ?不,什么也不做
          • 检查6=
            2
            +(
            4
            -
            2
            )*2是否在我们的集合中?没有
          • 检查7=
            1
            +(
            4
            -
            1
            )*2是否在我们的集合中?是-添加新元素
            {7:{3:{'count':2,'start':1}}}
            7-列表的元素,3是步骤
      • 索引3,元素
        5
        • 如果预信用证中的
          5
          ?不,什么也不做
          • 不要检查
            4
            ,因为6=4+(
            5
            -
            4
            )*2小于计算元素7
          • 检查8=
            2
            +(
            5
            -
            2
            )*2是否在我们的集合中?没有
          • 检查10=
            2
            +(
            5
            -
            1
            )*2-超过最大值(a)==7
      • 索引4,元素
        7
        • 如果在预LC中7?是-将其放入结果中
          • 不要选中
            5
            ,因为9=5+(
            7
            -
            5
            )*2大于最大值(a)==7
      结果=(3,{'count':3,'start':1})#步骤3,count 3,start 1,将其转换为序列

      复杂性

      它不应该超过O(N^2),我认为这是因为搜索新序列的提前终止,我将在稍后尝试提供详细的分析

      代码

      def add_precalc(precalc, start, step, count, res, N):
          if step == 0: return True
          if start + step * res[1]["count"] > N: return False
      
          x = start + step * count
          if x > N or x < 0: return False
      
          if precalc[x] is None: return True
      
          if step not in precalc[x]:
              precalc[x][step] = {"start":start, "count":count}
      
          return True
      
      def work(a):
          precalc = [None] * (max(a) + 1)
          for x in a: precalc[x] = {}
          N, m = max(a), 0
          ind = {x:i for i, x in enumerate(a)}
      
          res = (0, {"start":0, "count":0})
          for i, x in enumerate(a):
              for el in precalc[x].iteritems():
                  el[1]["count"] += 1
                  if el[1]["count"] > res[1]["count"]: res = el
                  add_precalc(precalc, el[1]["start"], el[0], el[1]["count"], res, N)
                  t = el[1]["start"] + el[0] * el[1]["count"]
                  if t in ind and ind[t] > m:
                      m = ind[t]
              precalc[x] = None
      
              for y in a[i - m - 1::-1]:
                  if not add_precalc(precalc, y, x - y, 2, res, N): break
      
          return [x * res[0] + res[1]["start"] for x in range(res[1]["count"])]
      
      def add_precalc(precalc、start、step、count、res、N):
      如果步骤==0:返回True
      如果开始+步骤*res[1][“计数”]>N:返回False
      x=开始+步骤*计数
      如果x>N或x<0:返回False
      如果预信用证[x]为无:返回
      
      A = [1, 4, 5, 7, 8, 12]    # in sorted order
      Aset = set(A)
      
      for d in range(1, 12):
          already_seen = set()
          for a in A:
              if a not in already_seen:
                  b = a
                  count = 1
                  while b + d in Aset:
                      b += d
                      count += 1
                      already_seen.add(b)
                  print "found %d items in %d .. %d" % (count, a, b)
                  # collect here the largest 'count'
      
      import random
      import timeit
      import sys
      
      #s = [1,4,5,7,8,12]
      #s = [2, 6, 7, 10, 13, 14, 17, 18, 21, 22, 23, 25, 28, 32, 39, 40, 41, 44, 45, 46, 49, 50, 51, 52, 53, 63, 66, 67, 68, 69, 71, 72, 74, 75, 76, 79, 80, 82, 86, 95, 97, 101, 110, 111, 112, 114, 115, 120, 124, 125, 129, 131, 132, 136, 137, 138, 139, 140, 144, 145, 147, 151, 153, 157, 159, 161, 163, 165, 169, 172, 173, 175, 178, 179, 182, 185, 186, 188, 195]
      #s = [0, 6, 7, 10, 11, 12, 16, 18, 19]
      
      m = [random.randint(1,40000) for r in xrange(20000)]
      s = list(set(m))
      s.sort()
      
      lenS = len(s)
      halfRange = (s[lenS-1] - s[0]) // 2
      
      while s[lenS-1] - s[lenS-2] > halfRange:
          s.pop()
          lenS -= 1
          halfRange = (s[lenS-1] - s[0]) // 2
      
      while s[1] - s[0] > halfRange:
          s.pop(0)
          lenS -=1
          halfRange = (s[lenS-1] - s[0]) // 2
      
      n = lenS
      
      largest = (s[n-1] - s[0]) // 2
      #largest = 1000 #set the maximum size of d searched
      
      maxS = s[n-1]
      maxD = 0
      maxSeq = 0
      hCount = [None]*(largest + 1)
      hLast = [None]*(largest + 1)
      best = {}
      
      start = timeit.default_timer()
      
      for i in range(1,n):
      
          sys.stdout.write(repr(i)+"\r")
      
          for j in range(i-1,-1,-1):
              d = s[i] - s[j]
              numLeft = n - i
              if d != 0:
                  maxPossible = (maxS - s[i]) // d + 2
              else:
                  maxPossible = numLeft + 2
              ok = numLeft + 2 > maxSeq and maxPossible > maxSeq
      
              if d > largest or (d > maxD and not ok):
                  break
      
              if hLast[d] != None:
                  found = False
                  for k in range (len(hLast[d])-1,-1,-1):
                      tmpLast = hLast[d][k]
                      if tmpLast == j:
                          found = True
                          hLast[d][k] = i
                          hCount[d][k] += 1
                          tmpCount = hCount[d][k]
                          if tmpCount > maxSeq:
                              maxSeq = tmpCount
                              best = {'len': tmpCount, 'd': d, 'last': i}
                      elif s[tmpLast] < s[j]:
                          del hLast[d][k]
                          del hCount[d][k]
                  if not found and ok:
                      hLast[d].append(i)
                      hCount[d].append(2)
              elif ok:
                  if d > maxD: 
                      maxD = d
                  hLast[d] = [i]
                  hCount[d] = [2]
      
      
      end = timeit.default_timer()
      seconds = (end - start)
      
      #print (hCount)
      #print (hLast)
      print(best)
      print(seconds)
      
      input = [1, 4, 5, 7, 8, 12]
      
      [1, 4, 5, 7, 8, 12]
       x  3  4  6  7  11   # distance from point i to point 0
       x  x  1  3  4   8   # distance from point i to point 1
       x  x  x  2  3   7   # distance from point i to point 2
       x  x  x  x  1   5   # distance from point i to point 3
       x  x  x  x  x   4   # distance from point i to point 4
      
      def build_columns(l):
          columns = {}
          for x in l[1:]:
              col = []
              for y in l[:l.index(x)]:
                  col.append(x - y)
              columns[x] = col
          return columns
      
      def algo(input, columns):
          seqs = []
          for index1, number in enumerate(input[1:]):
              index1 += 1 #first item was sliced
              for index2, distance in enumerate(columns[number]):
                  seq = []
                  seq.append(input[index2]) # k-th pred
                  seq.append(number)
                  matches = 1
                  for successor in input[index1 + 1 :]:
                      column = columns[successor]
                      if column[index1] == distance * matches:
                          matches += 1
                          seq.append(successor)
                  if (len(seq) > 2):
                      seqs.append(seq)
          return seqs
      
      print max(sequences, key=len)
      
      def findLESS(A):
        Aset = set(A)
        lmax = 2
        d = 1
        minStep = 0
      
        while (lmax - 1) * minStep <= A[-1] - A[0]:
          minStep = A[-1] - A[0] + 1
          for j, b in enumerate(A):
            if j+d < len(A):
              a = A[j+d]
              step = a - b
              minStep = min(minStep, step)
              if a + step in Aset and b - step not in Aset:
                c = a + step
                count = 3
                while c + step in Aset:
                  c += step
                  count += 1
                if count > lmax:
                  lmax = count
          d += 1
      
        return lmax
      
      print(findLESS([1, 4, 5, 7, 8, 12]))
      
      def findLESS(src):
        r = [False for i in range(src[-1]+1)]
        for x in src:
          r[x] = True
      
        d = 1
        best = 1
      
        while best * d < len(r):
          for s in range(d):
            l = 0
      
            for i in range(s, len(r), d):
              if r[i]:
                l += 1
                best = max(best, l)
              else:
                l = 0
      
          d += 1
      
        return best
      
      
      print(findLESS([1, 4, 5, 7, 8, 12]))
      
      def findLESS(src):
        r = 0
        for x in src:
          r |= 1 << x
      
        d = 1
        best = 1
      
        while best * d < src[-1] + 1:
          c = best
          rr = r
      
          while c & (c-1):
            cc = c & -c
            rr &= rr >> (cc * d)
            c &= c-1
      
          while c != 1:
            c = c >> 1
            rr &= rr >> (c * d)
      
          rr &= rr >> d
      
          while rr:
            rr &= rr >> d
            best += 1
      
          d += 1
      
        return best
      
      random.seed(42)
      s = sorted(list(set([random.randint(0,200000) for r in xrange(140000)])))
      
      s = sorted(list(set([random.randint(0,2000000) for r in xrange(1400000)])))
      
      Size:                         100000   1000000
      Second answer by Armin Rigo:     634         ?
      By Armin Rigo, optimized:         64     >5000
      O(M^2) algorithm:                 53      2940
      O(M^2*L) algorithm:                7       711
      
      lmax = 2
      l = [[2 for i in xrange(n)] for j in xrange(n)]
      for mid in xrange(n - 1):
          prev = mid - 1
          succ = mid + 1
          while (prev >= 0 and succ < n):
              if a[prev] + a[succ] < a[mid] * 2:
                  succ += 1
              elif a[prev] + a[succ] > a[mid] * 2:
                  prev -= 1
              else:
                  l[mid][succ] = l[prev][mid] + 1
                  lmax = max(lmax, l[mid][succ])
                  prev -= 1
                  succ += 1
      
      print lmax
      
      A = [1, 4, 5, 7, 8, 12]    # in sorted order
      Aset = set(A)
      
      lmax = 2
      for j, b in enumerate(A):
          for i in range(j):
              a = A[i]
              step = b - a
              if b + step in Aset and a - step not in Aset:
                  c = b + step
                  count = 3
                  while c + step in Aset:
                      c += step
                      count += 1
                  #print "found %d items in %d .. %d" % (count, a, c)
                  if count > lmax:
                      lmax = count
      
      print lmax