Regex 子序列搜索_Regex_Language Agnostic_Sequence

Regex 子序列搜索

regex language-agnostic

Regex 子序列搜索,regex,language-agnostic,sequence,Regex,Language Agnostic,Sequence,我有大量的列表（总共35MB），我想搜索子序列：每个术语必须按顺序出现，但不一定连续出现。所以1，2，3分别匹配 1, 2, 3, 4, 5, 6 1, 2, 2, 3, 3, 3 但不是 6, 5, 4, 3, 2, 1 123, 4, 5, 6, 7 （，是分隔符，不是要匹配的字符。）除了在成千上万个序列上运行正则表达式（/1，（[^，]+，）*2，（[^，]+，）*3/），我如何确定哪些序列是匹配的？我可以对序列进行预处理，尽管内存使用需要保持合理（比如说，在现有序列大小的恒定因子范

我有大量的列表（总共35MB），我想搜索子序列：每个术语必须按顺序出现，但不一定连续出现。所以1，2，3分别匹配

1, 2, 3, 4, 5, 6
1, 2, 2, 3, 3, 3

但不是

6, 5, 4, 3, 2, 1
123, 4, 5, 6, 7

（

，

是分隔符，不是要匹配的字符。）

除了在成千上万个序列上运行正则表达式（

/1，（[^，]+，）*2，（[^，]+，）*3/

），我如何确定哪些序列是匹配的？我可以对序列进行预处理，尽管内存使用需要保持合理（比如说，在现有序列大小的恒定因子范围内）。最长的序列很短，不到一个KB，因此您可以假设查询也很短。

也许我误解了，但这不是很简单吗

search = [1 2 3]
for sequence in sequences:
  sidx = 0
  for item in sequence:
    if item==search[sidx]:
       sidx++
       if sidx>=len(search): break
  if sidx>len(search):
    print sequence + "matches"

对于N个序列，它似乎是O（N） O（M）用于搜索子序列长度M

但不确定这是否会比正则表达式快那么多？

这让我想起了生物信息学中的序列比对，在生物信息学中，你试图将一小段DNA与一个大型数据库进行比对。区别在于你的字母表可能更大，以及你对任意长间隔的容忍度增加

查看现有的工具和算法，您可能会发现一些灵感，尤其是Smith Waterman和BLAST。

如果单个数字分散在文件中，并且没有出现在大多数行上，那么对出现的行号进行简单的索引可以加快速度。但是，如果您的数据是以不同顺序重复的相同数字的行，则速度会较慢

要构建索引，只需沿着以下几行传递一次数据：

Hash<int, List<int>> index

line_number = 1
foreach(line in filereader)
{
    line_number += 1
    foreach(parsed_number in line)
        index[parsed_number].append(line)
}

散列索引
线号=1
foreach（文件读取器中的行）
{
行号+=1
foreach（行中解析的\u编号）
索引[已解析的\u编号]。追加（行）
}

该索引可以存储并重新用于数据集。搜索它只需要这样的东西。请原谅混合密码，我已经尽可能清楚地解释了。当子字符串的所有元素都出现在该行时，它“返回”为可能的匹配，而“屈服”为行号

// prefilled hash linking number searched for to a list of line numbers
// the lines should be in ascending order
Hash<int, List<int>> index

// The subsequence we're looking for
List<int> subsequence = {1, 2, 3}
int len = subsequence.length()

// Take all the lists from the index that match the numbers we're looking for
List<List<int>> lines = index[number] for number in subsequence

// holder for our current search row
// has the current lowest line number each element occurs on 
int[] search = new int[len]
for(i = 0; i < len; i++)
    search[i] = lines[i].pop()

while(true)
{
    // minimum line number, substring position and whether they're equal
    min, pos, eq = search[0], 0, true

    // find the lowest line number and whether they all match
    for(i = 0; i < len; i++)
    {
        if(search[i] < min)
            min, p, eq = search[i], i, false
        else if (search[i] > min)
            eq = false
    }

    // if they do all match every one of the numbers occurs on that row
    if(eq)
    {
        yield min; // line has all the elements

        foreach(list in lines)
            if(list.empty())  // one of the numbers isn't in any more lines
                 return

        // update the search to the next lowest line number for every substring element
        for(i = 0; i < len; i++)
            search[i] = lines[i].pop()
    }
    else
    {
        // the lowest line number for each element is not the same, so discard the lowest one
        if(lines[position].empty()) // there are no more lines for the element we'd be updating
            return

        search[position] = lines[position].pop();
    }
}

//将搜索到的预填充哈希链接号链接到行号列表
//这些行应该按升序排列
散列索引
//我们正在寻找的子序列
列表子序列={1,2,3}
int len=子序列长度（）
//从索引中选取与我们要查找的数字匹配的所有列表
列表行=子序列中编号的索引[编号]
//当前搜索行的持有者
//具有每个元素出现的当前最低行号
int[]搜索=新int[len]
对于（i=0；imin）
eq=假
}
//如果它们都匹配，则该行中的每一个数字都会出现
if（eq）
{
yield min；//行包含所有元素
foreach（按行列出）
if（list.empty（））//其中一个数字不在任何其他行中
返回
//将搜索更新为每个子字符串元素的下一个最低行号
对于（i=0；i


注:
这可以简单地扩展到存储行中的位置和行号，然后只有“屈服”点上的一点额外逻辑才能确定实际匹配，而不仅仅是所有项目都存在
我用“pop”来显示它是如何通过行号移动的，但实际上你并不想每次搜索都破坏索引
我假设这些数字都符合整数。如果你有非常大的数字，可以将它扩展为long，甚至将每个数字的字符串表示形式映射为int
有一些加速，特别是在“流行”阶段跳绳，但我更清楚地解释了
无论使用这种方法还是其他方法，您都可以根据数据减少计算量。只需一次运算，就可以确定每一行是升序、降序、全奇数、全偶数，还是最高和最低的数字，从而减少每个子字符串的搜索空间。这些是否有用完全取决于您的数据集。
关于1,3,2,3
？这是一个匹配吗？@JacobEggers:是的，它的第一、第三和第四个成员是1、2、3，所以这是一个匹配。为什么不将序列解析成数字列表并检查它们呢？@cheeken:如果它最终有效的话，我可以这样做。请注意，并非所有数字都能放入机器字。@Charles除了跳过逗号后的空格（字节）外，您必须读取文件中的每个字节，而不管您采取何种方法。此外，35MB的数据并不多，这需要几秒钟才能运行。您需要多久运行一次？忘记说了。这也是一个过程，因此您可以在解析数字时通读文件。你可以很容易地在同一个过程中搜索多个子字符串。如果可能的话，这是我试图避免的事情：遍历列表中的每个字节。我希望我能做一些预处理，让搜索速度更快。我不希望像线性搜索->二进制搜索那样得到指数级的加速，但我希望能超过