C# 字符串。比较性能(带修剪)
我需要进行大量高性能的不区分大小写的字符串比较,并意识到我的方法。ToLower().Trim()由于分配了所有新字符串,所以非常愚蠢 所以我四处挖掘了一下,这种方式似乎更可取:C# 字符串。比较性能(带修剪),c#,string,string-comparison,C#,String,String Comparison,我需要进行大量高性能的不区分大小写的字符串比较,并意识到我的方法。ToLower().Trim()由于分配了所有新字符串,所以非常愚蠢 所以我四处挖掘了一下,这种方式似乎更可取: String.Compare(txt1,txt2, StringComparison.OrdinalIgnoreCase) 这里唯一的问题是我想忽略前导空格或尾随空格,即Trim(),但如果我使用Trim,我在字符串分配方面也会遇到同样的问题。我想我可以检查每个字符串,看看它是以(“”)开头还是以(“”)结尾,然后再
String.Compare(txt1,txt2, StringComparison.OrdinalIgnoreCase)
这里唯一的问题是我想忽略前导空格或尾随空格,即Trim(),但如果我使用Trim,我在字符串分配方面也会遇到同样的问题。我想我可以检查每个字符串,看看它是以(“”)开头还是以(“”)结尾,然后再修剪。或者算出索引、每个字符串的长度并传递给字符串。比较覆盖
public static int Compare
(
string strA,
int indexA,
string strB,
int indexB,
int length,
StringComparison comparisonType
)
但这看起来相当混乱,如果我没有为两个字符串上的每个尾随空格和前导空格组合生成一个非常大的if-else语句,我可能不得不使用一些整数。。。那么,有什么优雅的解决方案吗
以下是我目前的建议:
public bool IsEqual(string a, string b)
{
return (string.Compare(a, b, StringComparison.OrdinalIgnoreCase) == 0);
}
public bool IsTrimEqual(string a, string b)
{
if (Math.Abs(a.Length- b.Length) > 2 ) // if length differs by more than 2, cant be equal
{
return false;
}
else if (IsEqual(a,b))
{
return true;
}
else
{
return (string.Compare(a.Trim(), b.Trim(), StringComparison.OrdinalIgnoreCase) == 0);
}
}
首先确保您确实需要优化此代码。也许创建字符串的副本不会明显影响您的程序
如果确实需要优化,可以尝试在首次存储字符串时处理字符串,而不是在比较字符串时(假设它发生在程序的不同阶段)。例如,存储字符串的修剪版本和小写版本,以便在比较它们时,您可以使用简单的等价性检查。您不能只修剪(并可能使其小写)每个字符串一次(获取时)?做更多的事情听起来像是过早的优化……我会使用你的代码
String.Compare(txt1,txt2, StringComparison.OrdinalIgnoreCase)
并根据需要添加任何.Trim()
调用。这将在大部分时间保存初始选项4字符串(.ToLower().Trim()
),并在所有时间保存两个字符串(.ToLower()
)
如果在此之后您遇到性能问题,那么您的“混乱”选项可能是最佳选择
关于过早优化的警告是正确的,但我假设您已经测试过,并发现复制字符串浪费了很多时间。在这种情况下,我将尝试以下方法:
int startIndex1, length1, startIndex2, length2;
FindStartAndLength(txt1, out startIndex1, out length1);
FindStartAndLength(txt2, out startIndex2, out length2);
int compareLength = Math.Max(length1, length2);
int result = string.Compare(txt1, startIndex1, txt2, startIndex2, compareLength);
FindStartAndLength是一个函数,用于查找“修剪过的”字符串的起始索引和长度(未经测试,但应给出大致思路):
静态void FindStartAndLength(字符串文本,out int startIndex,out int length)
{
startIndex=0;
while(char.IsWhiteSpace(text[startIndex])&&startIndex0)
长度--;
}
像这样的东西应该可以做到:
public static int TrimCompareIgnoreCase(string a, string b) {
int indexA = 0;
int indexB = 0;
while (indexA < a.Length && Char.IsWhiteSpace(a[indexA])) indexA++;
while (indexB < b.Length && Char.IsWhiteSpace(b[indexB])) indexB++;
int lenA = a.Length - indexA;
int lenB = b.Length - indexB;
while (lenA > 0 && Char.IsWhiteSpace(a[indexA + lenA - 1])) lenA--;
while (lenB > 0 && Char.IsWhiteSpace(b[indexB + lenB - 1])) lenB--;
if (lenA == 0 && lenB == 0) return 0;
if (lenA == 0) return 1;
if (lenB == 0) return -1;
int result = String.Compare(a, indexA, b, indexB, Math.Min(lenA, lenB), true);
if (result == 0) {
if (lenA < lenB) result--;
if (lenA > lenB) result++;
}
return result;
}
输出:
0
您应该根据一个简单的修剪对其进行分析,并与一些实际数据进行比较,看看您将要使用它的用途是否真的有任何不同。您可以实现自己的
StringComparer
。以下是一个基本实现:
public class TrimmingStringComparer : StringComparer
{
private StringComparison _comparisonType;
public TrimmingStringComparer()
: this(StringComparison.CurrentCulture)
{
}
public TrimmingStringComparer(StringComparison comparisonType)
{
_comparisonType = comparisonType;
}
public override int Compare(string x, string y)
{
int indexX;
int indexY;
int lengthX = TrimString(x, out indexX);
int lengthY = TrimString(y, out indexY);
if (lengthX <= 0 && lengthY <= 0)
return 0; // both strings contain only white space
if (lengthX <= 0)
return -1; // x contains only white space, y doesn't
if (lengthY <= 0)
return 1; // y contains only white space, x doesn't
if (lengthX < lengthY)
return -1; // x is shorter than y
if (lengthY < lengthX)
return 1; // y is shorter than x
return String.Compare(x, indexX, y, indexY, lengthX, _comparisonType);
}
public override bool Equals(string x, string y)
{
return Compare(x, y) == 0;
}
public override int GetHashCode(string obj)
{
throw new NotImplementedException();
}
private int TrimString(string s, out int index)
{
index = 0;
while (index < s.Length && Char.IsWhiteSpace(s, index)) index++;
int last = s.Length - 1;
while (last >= 0 && Char.IsWhiteSpace(s, last)) last--;
return last - index + 1;
}
}
公共类TrimmingStringComparer:StringComparer
{
私有字符串比较_comparisonType;
公共微调器()
:此(StringComparison.CurrentCulture)
{
}
公共修剪StringCompariser(StringComparisonType)
{
_comparisonType=comparisonType;
}
公共覆盖整型比较(字符串x、字符串y)
{
int indexX;
内索引;
int lengthX=TrimString(x,out indexX);
int LONGED=TrimString(y,out indexY);
如果(lengthX我注意到您的第一个建议只是比较相等,而不是排序,那么可以进一步节省一些效率
public static bool TrimmedOrdinalIgnoreCaseEquals(string x, string y)
{
//Always check for identity (same reference) first for
//any comparison (equality or otherwise) that could take some time.
//Identity always entails equality, and equality always entails
//equivalence.
if(ReferenceEquals(x, y))
return true;
//We already know they aren't both null as ReferenceEquals(null, null)
//returns true.
if(x == null || y == null)
return false;
int startX = 0;
//note we keep this one further than the last char we care about.
int endX = x.Length;
int startY = 0;
//likewise, one further than we care about.
int endY = y.Length;
while(startX != endX && char.IsWhiteSpace(x[startX]))
++startX;
while(startY != endY && char.IsWhiteSpace(y[startY]))
++startY;
if(startX == endX) //Empty when trimmed.
return startY == endY;
if(startY == endY)
return false;
//lack of bounds checking is safe as we would have returned
//already in cases where endX and endY can fall below zero.
while(char.IsWhiteSpace(x[endX - 1]))
--endX;
while(char.IsWhiteSpace(y[endY - 1]))
--endY;
//From this point on I am assuming you do not care about
//the complications of case-folding, based on your example
//referencing the ordinal version of string comparison
if(endX - startX != endY - startY)
return false;
while(startX != endX)
{
//trade-off: with some data a case-sensitive
//comparison first
//could be more efficient.
if(
char.ToLowerInvariant(x[startX++])
!= char.ToLowerInvariant(y[startY++])
)
return false;
}
return true;
}
当然,没有匹配哈希代码生成器的等式检查器是什么:
public static int TrimmedOrdinalIgnoreCaseHashCode(string str)
{
//Higher CMP_NUM (or get rid of it altogether) gives
//better hash, at cost of taking longer to compute.
const int CMP_NUM = 12;
if(str == null)
return 0;
int start = 0;
int end = str.Length;
while(start != end && char.IsWhiteSpace(str[start]))
++start;
if(start != end)
while(char.IsWhiteSpace(str[end - 1]))
--end;
int skipOn = (end - start) / CMP_NUM + 1;
int ret = 757602046; // no harm matching native .NET with empty string.
while(start < end)
{
//prime numbers are our friends.
ret = unchecked(ret * 251 + (int)(char.ToLowerInvariant(str[start])));
start += skipOn;
}
return ret;
}
public静态int-trimmedOrdinalingOreCaseHashCode(字符串str)
{
//更高的CMP_NUM(或完全去除它)会产生
//更好的散列,以计算时间更长为代价。
常数int CMP_NUM=12;
如果(str==null)
返回0;
int start=0;
int end=str.长度;
while(start!=end&&char.IsWhiteSpace(str[start]))
++开始;
如果(开始!=结束)
while(char.IsWhiteSpace(str[end-1]))
--结束;
int skipOn=(结束-开始)/CMP_NUM+1;
int-ret=757602046;//将本机.NET与空字符串匹配没有任何害处。
while(开始<结束)
{
//素数是我们的朋友。
ret=unchecked(ret*251+(int)(字符ToLowerInvariant(str[start]));
start+=skipOn;
}
返回ret;
}
是什么让您认为存在问题?过早优化是个坏主意-在应用程序变得“太慢”之前,无需进行优化。同时,专注于清晰的代码而不是快速的代码。你能确定编译器没有为你优化这样的情况吗?我还想问,这是否真的需要微观优化?你在这方面真的有性能问题吗?我想你还可以在其他方面获得更大的性能改进对于一个搜索引擎来说,这是一个非常大的字符串集合,所以我认为在这种情况下进行优化是相关的。此外,在工具箱中有一个比较字符串的好方法并不是一件坏事thing@Anon:我不认为这是过早的优化。如果有大量字符串,如果在中创建新字符串,可能需要更长的时间每次比较的状态。只需运行一些测试并亲自查看…好吧,在这种情况下使用更有效的方法没有什么错。使用String.Com
public static bool TrimmedOrdinalIgnoreCaseEquals(string x, string y)
{
//Always check for identity (same reference) first for
//any comparison (equality or otherwise) that could take some time.
//Identity always entails equality, and equality always entails
//equivalence.
if(ReferenceEquals(x, y))
return true;
//We already know they aren't both null as ReferenceEquals(null, null)
//returns true.
if(x == null || y == null)
return false;
int startX = 0;
//note we keep this one further than the last char we care about.
int endX = x.Length;
int startY = 0;
//likewise, one further than we care about.
int endY = y.Length;
while(startX != endX && char.IsWhiteSpace(x[startX]))
++startX;
while(startY != endY && char.IsWhiteSpace(y[startY]))
++startY;
if(startX == endX) //Empty when trimmed.
return startY == endY;
if(startY == endY)
return false;
//lack of bounds checking is safe as we would have returned
//already in cases where endX and endY can fall below zero.
while(char.IsWhiteSpace(x[endX - 1]))
--endX;
while(char.IsWhiteSpace(y[endY - 1]))
--endY;
//From this point on I am assuming you do not care about
//the complications of case-folding, based on your example
//referencing the ordinal version of string comparison
if(endX - startX != endY - startY)
return false;
while(startX != endX)
{
//trade-off: with some data a case-sensitive
//comparison first
//could be more efficient.
if(
char.ToLowerInvariant(x[startX++])
!= char.ToLowerInvariant(y[startY++])
)
return false;
}
return true;
}
public static int TrimmedOrdinalIgnoreCaseHashCode(string str)
{
//Higher CMP_NUM (or get rid of it altogether) gives
//better hash, at cost of taking longer to compute.
const int CMP_NUM = 12;
if(str == null)
return 0;
int start = 0;
int end = str.Length;
while(start != end && char.IsWhiteSpace(str[start]))
++start;
if(start != end)
while(char.IsWhiteSpace(str[end - 1]))
--end;
int skipOn = (end - start) / CMP_NUM + 1;
int ret = 757602046; // no harm matching native .NET with empty string.
while(start < end)
{
//prime numbers are our friends.
ret = unchecked(ret * 251 + (int)(char.ToLowerInvariant(str[start])));
start += skipOn;
}
return ret;
}