Objective c NSString-仅转换为纯字母表(即删除重音符号和标点符号)
我试着在没有标点、空格、重音等的情况下比较名字。 目前,我正在做以下工作:Objective c NSString-仅转换为纯字母表(即删除重音符号和标点符号),objective-c,regex,cocoa,string,nsstring,Objective C,Regex,Cocoa,String,Nsstring,我试着在没有标点、空格、重音等的情况下比较名字。 目前,我正在做以下工作: -(NSString*) prepareString:(NSString*)a { //remove any accents and punctuation; a=[[[NSString alloc] initWithData:[a dataUsingEncoding:NSASCIIStringEncoding allowLossyConversion:YES] encoding:NSASCIIStrin
-(NSString*) prepareString:(NSString*)a {
//remove any accents and punctuation;
a=[[[NSString alloc] initWithData:[a dataUsingEncoding:NSASCIIStringEncoding allowLossyConversion:YES] encoding:NSASCIIStringEncoding] autorelease];
a=[a stringByReplacingOccurrencesOfString:@" " withString:@""];
a=[a stringByReplacingOccurrencesOfString:@"'" withString:@""];
a=[a stringByReplacingOccurrencesOfString:@"`" withString:@""];
a=[a stringByReplacingOccurrencesOfString:@"-" withString:@""];
a=[a stringByReplacingOccurrencesOfString:@"_" withString:@""];
a=[a lowercaseString];
return a;
}
但是,我需要对数百个字符串执行此操作,并且我需要使其更有效。有什么想法吗?考虑使用。你可以这样做:
NSString *searchString = @"This is neat.";
NSString *regexString = @"[\W]";
NSString *replaceWithString = @"";
NSString *replacedString = [searchString stringByReplacingOccurrencesOfRegex:regexString withString:replaceWithString];
NSLog (@"%@", replacedString);
//... Thisisneat
考虑使用,特别是方法(接受NSCharacterSet)和(接受字符串并通过引用返回扫描的字符串)
您可能还希望将其与该选项结合使用,或者与该选项结合使用。这可以简化删除/替换重音的过程,因此您可以专注于删除发音、空格等
如果您必须使用您在问题中提出的方法,那么至少使用NSMutableString和
replaceAccurrencesofString:withString:options:range:
——这将比创建大量几乎相同的自动释放字符串更有效。可能只是减少分配的数量就可以暂时“足够”提高性能。在使用这些解决方案之前,不要忘记使用decomposedStringWithCanonicalMapping
来分解任何重音字母。例如,这将把é(U+00E9)变成é(U+0065U+0301)。然后,当您去掉非字母数字字符时,将保留未注释的字母
这一点之所以重要,是因为你可能不希望,比如说,“dän”和“dän”*被视为相同的东西。如果你去掉所有重音字母,就像这些解决方案中的一些一样,你会得到“dn”,所以这些字符串会比较相等
所以,你应该先分解它们,这样你就可以去掉重音,留下字母
*德国的例子。感谢Joris Weimar提供了它。刚刚遇到了这个问题,可能已经太晚了,但以下是对我有用的东西:
// text is the input string, and this just removes accents from the letters
// lossy encoding turns accented letters into normal letters
NSMutableData *sanitizedData = [text dataUsingEncoding:NSASCIIStringEncoding
allowLossyConversion:YES];
// increase length by 1 adds a 0 byte (increaseLengthBy
// guarantees to fill the new space with 0s), effectively turning
// sanitizedData into a c-string
[sanitizedData increaseLengthBy:1];
// now we just create a string with the c-string in sanitizedData
NSString *final = [NSString stringWithCString:[sanitizedData bytes]];
要结合Luiz和Peter的答案给出一个完整的示例,并添加几行代码,您可以得到下面的代码 代码执行以下操作:
两个示例的输出都是:BuverE_-48比BillyTheKid18756的答案有一个重要的精确性(这一点由Luiz纠正,但在代码解释中并不明显): 不要使用
stringWithCString
作为删除重音符号的第二步,它可以在字符串末尾添加不需要的字符,因为NSData不是以NULL结尾的(正如stringWithCString所期望的那样)。
或者使用它并向NSData添加一个额外的空字节,就像Luiz在他的代码中所做的那样
我认为一个更简单的答案是替换:
NSString *sanitizedText = [NSString stringWithCString:[sanitizedData bytes] encoding:NSASCIIStringEncoding];
作者:
如果我收回BillyTheKid18756的代码,下面是完整正确的代码:
// The input text
NSString *text = @"BûvérÈ!@$&%^&(*^(_()-*/48";
// Defining what characters to accept
NSMutableCharacterSet *acceptedCharacters = [[NSMutableCharacterSet alloc] init];
[acceptedCharacters formUnionWithCharacterSet:[NSCharacterSet letterCharacterSet]];
[acceptedCharacters formUnionWithCharacterSet:[NSCharacterSet decimalDigitCharacterSet]];
[acceptedCharacters addCharactersInString:@" _-.!"];
// Turn accented letters into normal letters (optional)
NSData *sanitizedData = [text dataUsingEncoding:NSASCIIStringEncoding allowLossyConversion:YES];
// Corrected back-conversion from NSData to NSString
NSString *sanitizedText = [[[NSString alloc] initWithData:sanitizedData encoding:NSASCIIStringEncoding] autorelease];
// Removing unaccepted characters
NSString* output = [[sanitizedText componentsSeparatedByCharactersInSet:[acceptedCharacters invertedSet]] componentsJoinedByString:@""];
@接口NSString(过滤)
-(NSString*)stringByFilteringCharacters:(NSCharacterSet*)字符集;
@结束
@实现NSString(过滤)
-(NSString*)stringByFilteringCharacters:(NSCharacterSet*)字符集{
NSMutableString*mutString=[NSMutableString stringWithCapacity:[自身长度]];
对于(int i=0;i<[自身长度];i++){
字符c=[自身字符索引:i];
if(![charSet characteristicmember:c])[mutString appendFormat:@“%c”,c];
}
返回[NSString stringWithString:mutString];
}
@结束
我相信这是最好的解决方案:
根据要转换的字符串的性质,您可能希望设置固定的区域设置(例如英语),而不是使用用户的当前区域设置。这样,您就可以确保在每台机器上得到相同的结果
如果要比较字符串,请使用以下方法之一。不要试图更改数据
- (NSComparisonResult)localizedCompare:(NSString *)aString
- (NSComparisonResult)localizedCaseInsensitiveCompare:(NSString *)aString
- (NSComparisonResult)compare:(NSString *)aString options:(NSStringCompareOptions)mask range:(NSRange)range locale:(id)locale
你需要考虑用户区域设置,用字符串来写东西,尤其是名字之类的东西。 在大多数语言中,像ä和å这样的字符并不相同,只是看起来很相似。它们本质上是不同的字符,具有不同于其他字符的含义,但实际规则和语义对于每个语言环境都是不同的
比较和排序字符串的正确方法是考虑用户的区域设置。其他任何事情都是幼稚的、错误的,而且是在20世纪90年代。别再做了 如果您试图将数据传递给一个不支持非ASCII的系统,那么,这样做是错误的。将其作为数据块传递 加上先规范化字符串(参见Peter Hosey的帖子)预合成或分解,基本上选择一个规范化的表单- (NSString *)decomposedStringWithCanonicalMapping
- (NSString *)decomposedStringWithCompatibilityMapping
- (NSString *)precomposedStringWithCanonicalMapping
- (NSString *)precomposedStringWithCompatibilityMapping
不,这并不像我们想象的那么简单和容易。
是的,这需要明智和谨慎的决策。(一点非英语语言的经验会有所帮助)这些答案对我来说并没有达到预期效果。具体地说,
decomposedStringWithCanonicalMapping
并没有像我预期的那样去除重音/umlauts
下面是我使用的一个变体,它回答了简短的问题:
// replace accents, umlauts etc with equivalent letter i.e 'é' becomes 'e'.
// Always use en_GB (or a locale without the characters you wish to strip) as locale, no matter which language we're taking as input
NSString *processedString = [string stringByFoldingWithOptions: NSDiacriticInsensitiveSearch locale: [NSLocale localeWithLocaleIdentifier: @"en_GB"]];
// remove non-letters
processedString = [[processedString componentsSeparatedByCharactersInSet:[[NSCharacterSet letterCharacterSet] invertedSet]] componentsJoinedByString:@""];
// trim whitespace
processedString = [processedString stringByTrimmingCharactersInSet: [NSCharacterSet whitespaceCharacterSet]];
return processedString;
我想过滤掉除了字母和数字以外的所有东西,所以我修改了Lorean在NSString上实现的一个类别,使其工作起来有点不同。在本例中,您指定的字符串仅包含要保留的字符,其他所有字符都将被过滤掉:
@interface NSString (PraxCategories)
+ (NSString *)lettersAndNumbers;
- (NSString*)stringByKeepingOnlyLettersAndNumbers;
- (NSString*)stringByKeepingOnlyCharactersInString:(NSString *)string;
@end
@implementation NSString (PraxCategories)
+ (NSString *)lettersAndNumbers { return @"abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789"; }
- (NSString*)stringByKeepingOnlyLettersAndNumbers {
return [self stringByKeepingOnlyCharactersInString:[NSString lettersAndNumbers]];
}
- (NSString*)stringByKeepingOnlyCharactersInString:(NSString *)string {
NSCharacterSet *characterSet = [NSCharacterSet characterSetWithCharactersInString:string];
NSMutableString * mutableString = @"".mutableCopy;
for (int i = 0; i < [self length]; i++){
char character = [self characterAtIndex:i];
if([characterSet characterIsMember:character]) [mutableString appendFormat:@"%c", character];
}
return mutableString.copy;
}
@end
或者,例如,如果您想去除除元音以外的所有内容:
string = [string stringByKeepingOnlyCharactersInString:@"aeiouAEIOU"];
如果您仍在学习Objective-C,并且没有使用类别,我建议您尝试一下。它们是放置此类内容的最佳位置,因为它为您分类的类的所有对象提供了更多功能
类别简化并封装您要添加的代码,使其易于在所有项目上重用。这是Objective-C的一大特色 彼得在Swift中的解决方案:
let newString = oldString.componentsSeparatedByCharactersInSet(NSCharacterSet.letterCharacterSet().invertedSet).joinWithSeparator("")
例如:
let oldString = "Jo_ - h !. nn y"
// "Jo_ - h !. nn y"
oldString.componentsSeparatedByCharactersInSet(NSCharacterSet.letterCharacterSet().invertedSet)
// ["Jo", "h", "nn", "y"]
oldString.componentsSeparatedByCharactersInSet(NSCharacterSet.letterCharacterSet().invertedSet).joinWithSeparator("")
// "Johnny"
如何使用正则表达式删除所有标点符号而不使用多个语句?我在尽量避免过去
- (NSString *)decomposedStringWithCanonicalMapping
- (NSString *)decomposedStringWithCompatibilityMapping
- (NSString *)precomposedStringWithCanonicalMapping
- (NSString *)precomposedStringWithCompatibilityMapping
// replace accents, umlauts etc with equivalent letter i.e 'é' becomes 'e'.
// Always use en_GB (or a locale without the characters you wish to strip) as locale, no matter which language we're taking as input
NSString *processedString = [string stringByFoldingWithOptions: NSDiacriticInsensitiveSearch locale: [NSLocale localeWithLocaleIdentifier: @"en_GB"]];
// remove non-letters
processedString = [[processedString componentsSeparatedByCharactersInSet:[[NSCharacterSet letterCharacterSet] invertedSet]] componentsJoinedByString:@""];
// trim whitespace
processedString = [processedString stringByTrimmingCharactersInSet: [NSCharacterSet whitespaceCharacterSet]];
return processedString;
@interface NSString (PraxCategories)
+ (NSString *)lettersAndNumbers;
- (NSString*)stringByKeepingOnlyLettersAndNumbers;
- (NSString*)stringByKeepingOnlyCharactersInString:(NSString *)string;
@end
@implementation NSString (PraxCategories)
+ (NSString *)lettersAndNumbers { return @"abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789"; }
- (NSString*)stringByKeepingOnlyLettersAndNumbers {
return [self stringByKeepingOnlyCharactersInString:[NSString lettersAndNumbers]];
}
- (NSString*)stringByKeepingOnlyCharactersInString:(NSString *)string {
NSCharacterSet *characterSet = [NSCharacterSet characterSetWithCharactersInString:string];
NSMutableString * mutableString = @"".mutableCopy;
for (int i = 0; i < [self length]; i++){
char character = [self characterAtIndex:i];
if([characterSet characterIsMember:character]) [mutableString appendFormat:@"%c", character];
}
return mutableString.copy;
}
@end
NSString *string = someStringValueThatYouWantToFilter;
string = [string stringByKeepingOnlyLettersAndNumbers];
string = [string stringByKeepingOnlyCharactersInString:@"aeiouAEIOU"];
let newString = oldString.componentsSeparatedByCharactersInSet(NSCharacterSet.letterCharacterSet().invertedSet).joinWithSeparator("")
let oldString = "Jo_ - h !. nn y"
// "Jo_ - h !. nn y"
oldString.componentsSeparatedByCharactersInSet(NSCharacterSet.letterCharacterSet().invertedSet)
// ["Jo", "h", "nn", "y"]
oldString.componentsSeparatedByCharactersInSet(NSCharacterSet.letterCharacterSet().invertedSet).joinWithSeparator("")
// "Johnny"