Performance 高效地查找未排序序列中的重复项_Performance_Algorithm_F#_Ienumerable

Performance 高效地查找未排序序列中的重复项

performance algorithm f#

Performance 高效地查找未排序序列中的重复项,performance,algorithm,f#,ienumerable,Performance,Algorithm,F#,Ienumerable,我需要一种非常有效的方法来查找未排序序列中的重复项。这是我提出的，但它有几个缺点，即不必要地统计超过2次的事件在生成重复项之前使用整个序列创建多个中间序列模块顺序= 让我们复制项目= 项目 |>Seq.countBy id |>序列过滤器（snd>>（（序列映射fst 尽管有缺点，我看不出有理由用两倍的代码来代替它。是否可以用相对简洁的代码来改进它？假设序列是有限的，此解决方案需要在序列上运行一次： open System.Collections.Generic let duplic

我需要一种非常有效的方法来查找未排序序列中的重复项。这是我提出的，但它有几个缺点，即

不必要地统计超过2次的事件

在生成重复项之前使用整个序列

创建多个中间序列

模块顺序=
让我们复制项目=
项目
|>Seq.countBy id
|>序列过滤器（snd>>（（序列映射fst

尽管有缺点，我看不出有理由用两倍的代码来代替它。是否可以用相对简洁的代码来改进它？

假设序列是有限的，此解决方案需要在序列上运行一次：

open System.Collections.Generic
let duplicates items =
   let dict = Dictionary()
   items |> Seq.fold (fun acc item -> 
                             match dict.TryGetValue item with
                             | true, 2 -> acc
                             | true, 1 -> dict.[item] <- 2; item::acc
                             | _ -> dict.[item] <- 1; acc) []
         |> List.rev

这里有一个必要的解决方案（诚然要长一点）：

让项目重复=
序号{
设d=System.Collections.Generic.Dictionary（）
对于我来说，我在做什么
将d.TryGetValue（i）与
|false，->d[i]d[i]（）//至少见过两次
}

功能解决方案：

let duplicates items = 
  let test (unique, result) v =
    if not(unique |> Set.contains v) then (unique |> Set.add v ,result) 
    elif not(result |> Set.contains v) then (unique,result |> Set.add v) 
    else (unique, result)
  items |> Seq.fold test (Set.empty, Set.empty) |> snd |> Set.toSeq

let duplicates xs =
  Seq.scan (fun xs x -> Set.add x xs) Set.empty xs
  |> Seq.zip xs
  |> Seq.choose (fun (x, xs) -> if Set.contains x xs then Some x else None)

这是我能想到的最好的“功能性”解决方案，它不会提前消耗整个序列

let duplicates =
    Seq.scan (fun (out, yielded:Set<_>, seen:Set<_>) item -> 
        if yielded.Contains item then
            (None, yielded, seen)
        else
            if seen.Contains item then
                (Some(item), yielded.Add item, seen.Remove item)
            else
                (None, yielded, seen.Add item)
    ) (None, Set.empty, Set.empty)
    >> Seq.Choose (fun (x,_,_) -> x)

让重复=
Seq.scan（乐趣（外出、屈服：集合、观看：集合）项目->
如果已生成。则包含项目
（没有、屈服、看见）
其他的
如果看到。则包含项目
（部分（项目），已生成。添加项目，已查看。删除项目）
其他的
（无、已放弃、已查看。添加项）
)（无，Set.empty，Set.empty）
>>顺序选择（乐趣（x，，，）->x）

更优雅的功能解决方案：

let duplicates items = 
  let test (unique, result) v =
    if not(unique |> Set.contains v) then (unique |> Set.add v ,result) 
    elif not(result |> Set.contains v) then (unique,result |> Set.add v) 
    else (unique, result)
  items |> Seq.fold test (Set.empty, Set.empty) |> snd |> Set.toSeq

let duplicates xs =
  Seq.scan (fun xs x -> Set.add x xs) Set.empty xs
  |> Seq.zip xs
  |> Seq.choose (fun (x, xs) -> if Set.contains x xs then Some x else None)

使用

scan

累积到目前为止看到的所有元素集。然后使用

zip

将每个元素与其之前的元素集组合。最后，使用

choose

过滤掉之前看到的元素集中的元素，即重复的元素

编辑

事实上，我最初的答案是完全错误的。首先，你不想在输出中重复。其次，你想要性能

下面是一个实现您所追求的算法的纯功能解决方案：

let duplicates xs =
  (Map.empty, xs)
  ||> Seq.scan (fun xs x ->
      match Map.tryFind x xs with
      | None -> Map.add x false xs
      | Some false -> Map.add x true xs
      | Some true -> xs)
  |> Seq.zip xs
  |> Seq.choose (fun (x, xs) ->
      match Map.tryFind x xs with
      | Some false -> Some x
      | None | Some true -> None)

这将使用贴图来跟踪每个元素之前是否见过一次或多次，如果看到的元素之前仅见过一次，即第一次复制，则会发射该元素

下面是一个更快的命令式版本：

let duplicates (xs: _ seq) =
  seq { let d = System.Collections.Generic.Dictionary(HashIdentity.Structural)
        let e = xs.GetEnumerator()
        while e.MoveNext() do
          let x = e.Current
          let mutable seen = false
          if d.TryGetValue(x, &seen) then
            if not seen then
              d.[x] <- true
              yield x
          else
            d.[x] <- false }

允许重复（xs:uq）=
seq{let d=System.Collections.Generic.Dictionary（HashIdentity.Structural）
设e=xs.GetEnumerator（）
而e.MoveNext（）做什么
设x=e.电流
让可变的可见=假
如果d.TryGetValue（x，&seen），则
如果不见的话
d、 [x]可能的重复事实上，它是相反的。我只想要重复的。嗯，你想如何存储你已经访问过的值？Set？Dictionary？Dictionary/Set很好。我有点认为这很好，但我觉得值得一问。[1；1；1；2；3；4；4；5]导致该值打印两次。我们的算法非常相似，只是您的集合相交，而我的集合不相交。我想知道，哪一个会更快？为什么是Seq.skip？您可以用Seq.chooseNice捕获替换Seq.filter和Seq.map组合，我忘记了choose。skip是早期代码的产物。您可以去掉seen.Remove-可能是获得一点速度，然后您的解决方案将像我的解决方案一样-集合将相交-除了我的解决方案消耗前面的序列，因此我认为您的解决方案更好（因此为+1）。+1表示聪明，但它的性能明显比我的原始解决方案差。@Daniel Oops，我忘了它应该是高效的！：-）非常好的命令式版本的微改进。顺便说一句，我很确定Keith（kvb）是一个“他”。：-）
let duplicates (xs: _ seq) =
  seq { let d = System.Collections.Generic.Dictionary(HashIdentity.Structural)
        let e = xs.GetEnumerator()
        while e.MoveNext() do
          let x = e.Current
          let mutable seen = false
          if d.TryGetValue(x, &seen) then
            if not seen then
              d.[x] <- true
              yield x
          else
            d.[x] <- false }