C# 是否有人在StringBuilder或Streams周围实现了正则表达式和/或Xml解析器？_C#_Regex_Stringbuilder

C# 是否有人在StringBuilder或Streams周围实现了正则表达式和/或Xml解析器？

c# regex

C# 是否有人在StringBuilder或Streams周围实现了正则表达式和/或Xml解析器？,c#,regex,stringbuilder,C#,Regex,Stringbuilder,我正在构建一个压力测试客户端，该客户端使用尽可能多的线程对服务器进行测试并分析响应。我经常发现自己被垃圾收集（和/或缺乏垃圾收集）所限制，在大多数情况下，我实例化的字符串只是为了将它们传递给正则表达式或Xml解析例程如果您反编译Regex类，您将看到它在内部使用StringBuilder来完成几乎所有的事情，但是您不能将字符串生成器传递给它；在开始使用私有方法之前，它会很有帮助地深入到私有方法中，所以扩展方法也不会解决这个问题。如果您想从System.Xml.Linq中的解析器中获取对象图，则

我正在构建一个压力测试客户端，该客户端使用尽可能多的线程对服务器进行测试并分析响应。我经常发现自己被垃圾收集（和/或缺乏垃圾收集）所限制，在大多数情况下，我实例化的字符串只是为了将它们传递给正则表达式或Xml解析例程

如果您反编译Regex类，您将看到它在内部使用StringBuilder来完成几乎所有的事情，但是您不能将字符串生成器传递给它；在开始使用私有方法之前，它会很有帮助地深入到私有方法中，所以扩展方法也不会解决这个问题。如果您想从System.Xml.Linq中的解析器中获取对象图，则会遇到类似的情况

这不是一个学究式的提前过度优化的例子。我已经研究了这个问题和其他问题。我还分析了我的应用程序，看看上限是从哪里来的，现在使用

Regex.Replace（）

确实在一个方法链中引入了大量开销，在这个方法链中，我试图以每小时数百万次的请求访问服务器，并检查XML响应中的错误和嵌入的诊断代码。我已经摆脱了几乎所有其他阻碍吞吐量的低效问题，我甚至通过扩展StringBuilder在不需要捕获组或反向引用的情况下执行通配符查找/替换，减少了大量的正则表达式开销，但在我看来，有人会打包一个自定义StringBuilder到目前为止，基于流的正则表达式和Xml解析实用程序（或者更好一些）

好吧，那就结束吧，但我要自己做吗

更新：我找到了一种解决办法，可以将峰值内存消耗从几GB降低到几百兆，因此我将其发布在下面。我不会添加它作为答案，因为a）我通常不喜欢这样做，b）我仍然想知道是否有人花时间自定义StringBuilder来执行正则表达式（反之亦然）在我之前

在我的例子中，我不能使用XmlReader，因为我摄取的流在某些元素中包含一些无效的二进制内容。为了解析XML，我必须清空这些元素。我以前使用一个静态编译的正则表达式实例来进行替换，这消耗了大量内存（我正在尝试每秒处理300个10KB的文档）。大幅降低消费的变化是：

我从这里为方便的

IndexOf

方法添加了代码

我添加了一个（非常）粗糙的

WildcardReplace

方法，允许每次调用一个通配符（*或？）

我用

WildcardReplace（）

调用替换了正则表达式的用法，以清空有问题元素的内容

这是非常不恰当的，仅在我自己的目的需要时才进行测试；我本想让它更优雅，更强大，但雅格尼和所有这些，我很着急。代码如下：

/// <summary>
/// Performs basic wildcard find and replace on a string builder, observing one of two 
/// wildcard characters: * matches any number of characters, or ? matches a single character.
/// Operates on only one wildcard per invocation; 2 or more wildcards in <paramref name="find"/>
/// will cause an exception.
/// All characters in <paramref name="replaceWith"/> are treated as literal parts of 
/// the replacement text.
/// </summary>
/// <param name="find"></param>
/// <param name="replaceWith"></param>
/// <returns></returns>
public static StringBuilder WildcardReplace(this StringBuilder sb, string find, string replaceWith) {
    if (find.Split(new char[] { '*' }).Length > 2 || find.Split(new char[] { '?' }).Length > 2 || (find.Contains("*") && find.Contains("?"))) {
        throw new ArgumentException("Only one wildcard is supported, but more than one was supplied.", "find");
    } 
    // are we matching one character, or any number?
    bool matchOneCharacter = find.Contains("?");
    string[] parts = matchOneCharacter ? 
        find.Split(new char[] { '?' }, StringSplitOptions.RemoveEmptyEntries) 
        : find.Split(new char[] { '*' }, StringSplitOptions.RemoveEmptyEntries);
    int startItemIdx; 
    int endItemIdx;
    int newStartIdx = 0;
    int length;
    while ((startItemIdx = sb.IndexOf(parts[0], newStartIdx)) > 0 
        && (endItemIdx = sb.IndexOf(parts[1], startItemIdx + parts[0].Length)) > 0) {
        length = (endItemIdx + parts[1].Length) - startItemIdx;
        newStartIdx = startItemIdx + replaceWith.Length;
        // With "?" wildcard, find parameter length should equal the length of its match:
        if (matchOneCharacter && length > find.Length)
            break;
        sb.Remove(startItemIdx, length);
        sb.Insert(startItemIdx, replaceWith);
    }
    return sb;
}

//
///在字符串生成器上执行基本的通配符查找和替换，观察以下两种情况之一
///通配符：*匹配任意数量的字符，或者？匹配单个字符。
///每次调用仅对一个通配符进行操作；中有2个或多个通配符
///将导致异常。
///中的所有字符都被视为
///替换文本。
/// 
/// 
/// 
/// 
公共静态StringBuilder WildcardReplace（此StringBuilder sb、字符串查找、字符串替换为）{
if（find.Split（新字符[]{'*}）.Length>2 | | find.Split（新字符[]{'？'}）.Length>2 | |（find.Contains（“*”）&&find.Contains（“？”）{
抛出新ArgumentException（“仅支持一个通配符，但提供了多个通配符。”，“查找”）；
} 
//我们是匹配一个字符还是任意数字？
bool matchenecharacter=find.Contains（“？”）；
字符串[]部分=匹配字符？
find.Split（新字符[]{'？'}，StringSplitOptions.RemoveEmptyEntries）
：find.Split（新字符[]{'*}，StringSplitOptions.RemoveEmptyEntries）；
int startItemIdx；
int-endItemIdx；
int newStartIdx=0；
整数长度；
而（（startItemIdx=sb.IndexOf（parts[0]，newStartIdx））>0
&&（endItemIdx=sb.IndexOf（parts[1]，startItemIdx+parts[0].长度））>0）{
长度=（endItemIdx+部分[1]。长度）-startItemIdx；
newStartIdx=startItemIdx+replaceWith.Length；
//使用“？”通配符，查找参数长度应等于其匹配的长度：
if（matchenecharacter&&length>find.length）
打破
sb.移除（startItemIdx，长度）；
sb.插入（startItemIdx，替换为）；
}
归还某人；
}

XmlReader是一种基于流的XML解析器。请参见Mono项目已完成。如果您需要创建一个为特定应用程序的性能定制的正则表达式库，您应该能够从的实现中的最新代码开始。

在这里尝试一下。一切都是基于字符的，效率相对较低。您可以使用任意数量的

s或

？

s。但是，您的

现在是

✪？

现在是★。大约三天的工作使它尽可能干净。您甚至可以在一次扫描中输入多个查询
用法示例：通配符（新的StringBuilder（“Hello and welcome”），“Hello✪W★l“，”be“
导致“变成”
////////////////////////////////////////////////////////////////////////////////////////////////////////
/////////////使用“查找”参数在“文本”中搜索字符串，并使用“替换”参数替换为字符串
// ✪ 表示多个通配符（非贪婪）
// ★ 表示单个通配符
公共StringBuilder通配符（StringBuilder文本、字符串查找、字符串替换、布尔区分大小写=false）
{
返回通配符（文本，新字符串[]{find}，新字符串[]）
////////////////////////////////////////////////////////////////////////////////////////////////////////
///////////// Search for a string/s inside 'text' using the 'find' parameter, and replace with a string/s using the replace parameter
// ✪ represents multiple wildcard characters (non-greedy)
// ★ represents a single wildcard character
public StringBuilder wildcard(StringBuilder text, string find, string replace, bool caseSensitive = false)
{
    return wildcard(text, new string[] { find }, new string[] { replace }, caseSensitive);
}
public StringBuilder wildcard(StringBuilder text, string[] find, string[] replace, bool caseSensitive = false)
{
    if (text.Length == 0) return text;          // Degenerate case

    StringBuilder sb = new StringBuilder();     // The new adjusted string with replacements
    for (int i = 0; i < text.Length; i++)   {   // Go through every letter of the original large text

        bool foundMatch = false;                // Assume match hasn't been found to begin with
        for(int q=0; q< find.Length; q++) {     // Go through each query in turn
            if (find[q].Length == 0) continue;  // Ignore empty queries

            int f = 0;  int g = 0;              // Query cursor and text cursor
            bool multiWild = false;             // multiWild is ✪ symbol which represents many wildcard characters
            int multiWildPosition = 0;          

            while(true) {                       // Loop through query characters
                if (f >= find[q].Length || (i + g) >= text.Length) break;       // Bounds checking
                char cf = find[q][f];                                           // Character in the query (f is the offset)
                char cg = text[i + g];                                          // Character in the text (g is the offset)
                if (!caseSensitive) cg = char.ToLowerInvariant(cg);
                if (cf != '★' && cf != '✪' && cg != cf && !multiWild) break;        // Break search, and thus no match is found
                if (cf == '✪') { multiWild = true; multiWildPosition = f; f++; continue; }              // Multi-char wildcard activated. Move query cursor, and reloop
                if (multiWild && cg != cf && cf != '★') { f = multiWildPosition + 1; g++; continue; }   // Match since MultiWild has failed, so return query cursor to MultiWild position
                f++; g++;                                                           // Reaching here means that a single character was matched, so move both query and text cursor along one
            }

            if (f == find[q].Length) {          // If true, query cursor has reached the end of the query, so a match has been found!!!
                sb.Append(replace[q]);          // Append replacement
                foundMatch = true;
                if (find[q][f - 1] == '✪') { i = text.Length; break; }      // If the MultiWild is the last char in the query, then the rest of the string is a match, and so close off
                i += g - 1;                                                 // Move text cursor along by the amount equivalent to its found match
            }
        }
        if (!foundMatch) sb.Append(text[i]);    // If a match wasn't found at that point in the text, then just append the original character
    }
    return sb;
}