Let’s start with two exotic strings (console output is in the code comments):
NSString* apples = NSGetFrenchWord();
NSString* oranges = NSGetFrenchWord();
NSLog(@"apples == '%@'", apples);
//apples == 'café'
NSLog(@"oranges == '%@'", oranges);
//oranges == 'café'
They look identical, but looks can be deceiving.
NSLog(@"isEqual? %@", [apples isEqual:oranges] ? @"YES" : @"NO");
//isEqual? NO
NSLog(@"[apples length] == %lu", [apples length]);
//[apples length] == 4
NSLog(@"[oranges length] == %lu", [oranges length]);
//[oranges length] == 5
But if you were to sort them they should be the same, right?
NSLog(@"NSOrderedSame? %@", [apples compare:oranges] == NSOrderedSame ? @"YES" : @"NO");
//NSOrderedSame? YES
Well at least sorting works. Let’s inspect the strings one character at a time.
NSString* CodePoints(NSString* str)
{
NSMutableString* codePoints = [NSMutableString string];
for(NSUInteger i = 0; i < [str length]; ++i){
long ch = (long)[str characterAtIndex:i];
[codePoints appendFormat:@"%0.4lX ", ch];
}
return codePoints;
}
NSLog(@"apples == %@", CodePoints(apples));
//apples == 0063 0061 0066 00E9
NSLog(@"oranges == %@", CodePoints(oranges));
//oranges == 0063 0061 0066 0065 0301
So they are, in fact, different strings.
If you were to look up the above Unicode characters (a.k.a code points) you would find that:
-
0063
is ‘c’ -
0061
is ‘a’ -
0066
is ‘f’ -
0065
is ‘e’ -
00E9
is ‘é’ (LATIN SMALL LETTER E WITH ACUTE) -
0301
is ‘´’ (COMBINING ACUTE ACCENT)
There are at least two ways to represent the glyph ‘é’ in Unicode. One way
is with the single code point 00E9
. The other is with two code points: an ‘e’
code point (0065
) followed by a combining acute accent code point (0301
).
Unicode sort of works like ASCII, but not quite.
This is where Unicode normalization/equivalence comes into play.
“Normalizing” a Unicode string simply involves taking all the glyphs that look
the same and giving them the same code point sequence. You can “compose” all
the glyphs, which will translate the two code points 0065
0301
into a single code point 00E9
. You can also “decompose” all the glyphs, which
will do the opposite. There are four types of Unicode normalization, and
NSString
provides methods for all of them:
- decomposedStringWithCanonicalMapping
- decomposedStringWithCompatibilityMapping
- precomposedStringWithCanonicalMapping
- precomposedStringWithCompatibilityMapping
Mad props to Simon for the tip.