C# 如何比较两个字符串数组，找到所有连续的匹配项并保存索引？_C#_Arrays

C# 如何比较两个字符串数组，找到所有连续的匹配项并保存索引？

c# arrays

C# 如何比较两个字符串数组，找到所有连续的匹配项并保存索引？,c#,arrays,C#,Arrays,例如，如果我有以下两个阵列： string[] userSelect = new string[] {"the", "quick", "brown", "dog", "jumps", "over"}; string[] original = new string[] {"the", "quick", "brown", "fox", "jumps", "over", "the", "lazy", "dog"}; 我试图将userSelect数组与原始数组进行比较，并根据索引获取所有连续的匹配。u

例如，如果我有以下两个阵列：

string[] userSelect = new string[] {"the", "quick", "brown", "dog", "jumps", "over"};
string[] original = new string[] {"the", "quick", "brown", "fox", "jumps", "over", "the", "lazy", "dog"};

我试图将userSelect数组与原始数组进行比较，并根据索引获取所有连续的匹配。userSelect数组将始终由原始数组中的字符串组成。因此，输出如下所示：

int[] match0 = new int[] {0, 1, 2}; // indices for "the quick brown"
int[] match2 = new int[] {4, 5}; // indices for "jumps over"
int[] match1 = new int[] {3}; // index for "dog"

userSelect数组长度永远不会超过原始数组长度，但是它可以更短，并且单词可以按任何顺序排列。我该怎么做呢？

这并不能完全满足您的要求，但这是一种非常干净、简单的方法，可以获得包含所有公共字符串的新数组（即取两个数组的交点）

执行

resultls

后，数组将包含

array1

和

array2

中出现的所有字符串（忽略大小写）

如果你想了解一些理论，那么intersect方法是基于你在lambda演算中对集合所做的交集运算。C#中的集合实现了所有常见的集合操作，因此有必要对它们进行一些熟悉。这里有一个维基文章的链接

这是我想到的

var matches = 
    (from l in userSelect.Select((s, i) => new { s, i })
     join r in original.Select((s, i) => new { s, i }) 
     on l.s equals r.s 
     group l by r.i - l.i into g
     from m in g.Select((l, j) => new { l.i, j = l.i - j, k = g.Key })
     group m by new { m.j, m.k } into h
     select h.Select(t => t.i).ToArray())
    .ToArray();

这将输出

matches[0] // { 0, 1, 2 } the quick brown
matches[1] // { 4, 5 } jumps over
matches[2] // { 0 } the 
matches[3] // { 3 } dog

使用输入

{“the”，“quick”，“brown”，“the”，“lazy”，“dog”}

生成：

matches[0] // { 0, 1, 2 } the quick brown
matches[1] // { 0 } the 
matches[2] // { 3 } the
matches[3] // { 3, 4, 5 } the lazy dog

请注意，对

ToArray

的调用是可选的。如果您实际上不需要数组中的结果，可以省去它，节省一点处理时间

要筛选出与其他较大序列完全包含在一起的任何序列，可以运行以下代码（请注意修改后的查询中的

orderby

）：

如果单词不能重复，这会更容易

一般的想法是从原始单词列表中创建一个

词典

。这将告诉你在什么位置使用哪些单词。您的示例词典如下：

key="the", value={0, 6}
key="quick", value={1}
key="brown", value={2}
... etc

现在，当您获得用户的输入时，您可以按顺序逐步完成它，在字典中查找单词以获得位置列表

所以你查一个单词，它就在字典里。保存从字典返回的位置。查下一个单词。您需要处理三个条件：

这个词不在字典里。保存上一个连续分组并转到下一个单词，在那里您可能会开始一个新的分组

单词在字典中，但返回的位置与预期位置均不匹配（预期位置比最后一个单词保存的位置多一个）。保存上一个连续组并转到下一个单词，在那里您可能会开始一个新组

单词在字典中，返回的位置之一与预期位置匹配。保存这些位置并转到下一个单词

我希望你能明白这一点。

使用LINQ增加乐趣经过几次尝试，我提出了一个纯LINQ解决方案，理论上可以是一个线性。我确实试图使它有效，但当然函数解会导致重复计算，因为您无法保持状态

我们先进行一些预处理，以避免以后重复计算。是的，我知道我在用索引做什么是一个值得怀疑的做法，但是如果你小心的话，它会工作的很快：

var index = 0;
var lookup = original.ToLookup(s => s, s => index++);

怪物用

foreach (var occurrence in occurrences) {
  Console.WriteLine(
    "Maximal match starting with '{0}': [{1}]",
    userSelect[occurrence[0]],
    string.Join(", ", occurrence)
  );
}

给予

很明显，您不希望在生产中使用此代码，，到目前为止，另一种（过程性）解决方案更可取。但是，此解决方案的区别在于，除了

查找

，它是纯功能的。当然，也可以从功能上写：

var lookup = original.Select((s, i) => Tuple.Create)
                     .ToLookup(t => t.Item1, t => t.Item2);

工作原理预热时，它会创建一个类似字典的结构，将

原始

中的每个单词与它出现在同一集合中的索引相关联。这将在以后用于从

userSelect

中的每个单词创建尽可能多的匹配序列（例如，“the”将产生两个匹配序列，因为它在

original

中出现两次）

然后：

这很容易，因为它将删除

userSelect

中未出现在

原始版本中的所有单词
 // For each place where the word s appears in original...
.SelectMany((s, i) => lookup[s]
  // Define the two subsequences of userSelect and original to work on.
  // We are trying to find the number of identical elements until first mismatch.
  .Select(j => new { User = userSelect.Skip(i), Original = original.Skip(j), Skipped = j })

  // Use .Zip to find this subsequence
  .Select(t => t.User.Zip(t.Original, (u, v) => Tuple.Create(u, v, t.Skipped)).TakeWhile(tuple => tuple.Item1 == tuple.Item2))

  // Note the index in original where the subsequence started and its length
  .Select(u => new { Word = s, Start = u.Select(v => v.Item3).Min(), Length = u.Count() })
)

此时，我们已将userSelect
中的每个匹配单词投影到具有Start
和Length
属性的匿名对象。然而，匹配长度为N的序列也将导致长度为N-1、N-2、。。。一,
这里的关键是要认识到，对于这些集合中的所有子序列，Start+Length
将是相同的；此外，来自不同集合的子序列将具有不同的Start+Length
之和。因此，让我们利用这个优势来缩减结果：
// Obvious from the above
.GroupBy(v => v.Start + v.Length)

// We want to keep the longest subsequence. Since Start + Length is constant for
// all, it follows the one with the largest Length has the smallest Start:
.Select(g => g.OrderBy(u => u.Start).First())

这仍然会使我们在userSelect
中的每个单词的匹配次数与该单词在original
中出现的次数相同。因此，让我们将其缩减为最长的比赛：
.GroupBy(v => v.Word)
.Select(g => g.OrderByDescending(u => u.Length).First())

我们现在有了一个类似于{Word=“the”，Start=0，Length=3}
的对象。让我们将其转换为userSelect
中的索引数组：
.Select(w => Enumerable.Range(w.Start, w.Length).ToArray())

最后将所有这些阵列放在同一个集合中并完成任务
 这不是很优雅，但效率很高。在索引方面，Linq通常比简单循环更复杂、效率更低
string[] userSelect = new string[] { "the", "quick", "brown", "dog", "jumps", "over" };
string[] original = new string[] { "the", "quick", "brown", "fox", "jumps", "over", "the", "lazy", "dog" };
var consecutiveGroups = new Dictionary<int, IList<string>>();
IList<Tuple<int, string>> uniques = new List<Tuple<int, string>>();

int maxIndex = Math.Min(userSelect.Length, original.Length);
if (maxIndex > 0)
{
    int minIndex = 0;
    int lastMatch = int.MinValue;
    for (int i = 0; i < maxIndex; i++)
    {
        var us = userSelect[i];
        var o = original[i];
        if (us == o)
        {
            if (lastMatch == i - 1)
                consecutiveGroups[minIndex].Add(us);
            else
            {
                minIndex = i;
                consecutiveGroups.Add(minIndex, new List<string>() { us });
            }
            lastMatch = i;
        }
        else
            uniques.Add(Tuple.Create(i, us));
    }
} 

你有试过什么吗？这似乎不太复杂。我试过一点，但并不像我想象的那么简单，因为某些单词可以多次使用，我正在寻找最长的连续比赛。例如，在上面的句子中，“the”可以在句子中出现两次，并且必须同时进行检查。您可以将数组转换为分隔字符串，并使用解决该问题的任何算法来解决问题。我从你的评论中推断，你真的只是想找到最长的一个，而不是每一个梳子
 // For each place where the word s appears in original...
.SelectMany((s, i) => lookup[s]
  // Define the two subsequences of userSelect and original to work on.
  // We are trying to find the number of identical elements until first mismatch.
  .Select(j => new { User = userSelect.Skip(i), Original = original.Skip(j), Skipped = j })

  // Use .Zip to find this subsequence
  .Select(t => t.User.Zip(t.Original, (u, v) => Tuple.Create(u, v, t.Skipped)).TakeWhile(tuple => tuple.Item1 == tuple.Item2))

  // Note the index in original where the subsequence started and its length
  .Select(u => new { Word = s, Start = u.Select(v => v.Item3).Min(), Length = u.Count() })
)

// Obvious from the above
.GroupBy(v => v.Start + v.Length)

// We want to keep the longest subsequence. Since Start + Length is constant for
// all, it follows the one with the largest Length has the smallest Start:
.Select(g => g.OrderBy(u => u.Start).First())

.GroupBy(v => v.Word)
.Select(g => g.OrderByDescending(u => u.Length).First())

.Select(w => Enumerable.Range(w.Start, w.Length).ToArray())

string[] userSelect = new string[] { "the", "quick", "brown", "dog", "jumps", "over" };
string[] original = new string[] { "the", "quick", "brown", "fox", "jumps", "over", "the", "lazy", "dog" };
var consecutiveGroups = new Dictionary<int, IList<string>>();
IList<Tuple<int, string>> uniques = new List<Tuple<int, string>>();

int maxIndex = Math.Min(userSelect.Length, original.Length);
if (maxIndex > 0)
{
    int minIndex = 0;
    int lastMatch = int.MinValue;
    for (int i = 0; i < maxIndex; i++)
    {
        var us = userSelect[i];
        var o = original[i];
        if (us == o)
        {
            if (lastMatch == i - 1)
                consecutiveGroups[minIndex].Add(us);
            else
            {
                minIndex = i;
                consecutiveGroups.Add(minIndex, new List<string>() { us });
            }
            lastMatch = i;
        }
        else
            uniques.Add(Tuple.Create(i, us));
    }
} 

var consecutiveGroupsIndices = consecutiveGroups
    .OrderByDescending(kv => kv.Value.Count)
    .Select(kv => Enumerable.Range(kv.Key, kv.Value.Count).ToArray()
    .ToArray());
foreach(var consIndexGroup in consecutiveGroupsIndices)
    Console.WriteLine(string.Join(",", consIndexGroup));
Console.WriteLine(string.Join(",", uniques.Select(u => u.Item1)));