International Collectanea for Unicode
Collation Framework
Contents
- What is collation?
- The rule symbols and their
usage
- Interesting Examples
- Implementation Details
- Building the Collation
Table
- Incremental Comparison
Diagram
- Generating a Collation
Table
- Q and A
What is collation?
Collation framework performs locale-sensitive string comparison. The user of
this class can use this class to build searching and sorting routines for
natural language text, build table of contents for large documentation or
create efficient index look up for database entries.
The ICU Collator classes provides services to allow:
- Simple, data-driven, table
based collation.
- Easily customizble for your
needs.
- Merging different resources
made possible.
- Behind the scene
transforming the ASCII data file into a binary file for efficiency.
- Offering both incremental
comparison for simple comparison and collation keys for batch processes.
There are 4 comparison levels in the Collator classes to allow different
levels of difference to be considered significant:
- Primary: a letter difference.
For example, 'a' and 'b'.
- Secondary: an accent
difference. For example, 'ä' and 'å'.
- Tertiary: a case difference.
For example, 'a' and 'A'.
- Identical: no difference. For
example, 'a' and 'a'.
The rule symbols and their usage
A string is decomposed to be one or more collation elements when using with
the collation classes. The collation rules specify the order of these collation
elements. The collation table is composed of a list of collation rules, where
each rule is of three forms:
- <modifier>
- <relation>
<text-argument>
- <reset>
<text-argument1> <relation> <text-argument2>
<modifier>
- '@': French secondary, accent
weights sorted backwards.
<text-argument>
A text-argument is any sequence of characters, excluding special characters
(that is, common whitespace characters [0009-000D, 0020] and rule syntax
characters [0021-002F, 003A-0040, 005B-0060, 007B-007E]). If those characters
are desired, you can put them in single quotes (e.g. ampersand => '&').
Note that unquoted white space characters are ignored; e.g. "b c" is
treated as "bc".
<relation>
- '<' : Greater, as a letter
difference (primary)
- ';' : Greater, as an accent
difference (secondary)
- ',' : Greater, as a case
difference (tertiary)
- '=' : Equal
<reset>
- '&': Indicates that
text-argument2 follows the position to where the reset text-argument1
would be sorted.
Interesting Examples
The following is a list of interesting examples of the rules and some string
comparison results using those rules. The comparison relation will be denoted
as "<" of primary difference of less than, "<<" of
secondary difference of less than, "<<<" of teriatry
difference of less than and "==" of equal to relationships:
- Rule " a, A < b, B
< c, C < ch, cH, Ch, CH < d, D < e, E": this rule simply
says, sorts letters 'a', 'b', 'c', 'd' and 'e' in that order with primary
weights. 'ch' is sorted as a significant letter between 'c' and 'd' with
primary weights and upper cased letters sorts after lower cased letters
with tertiary weights. For example, "abc" <<<
"ABC" and "achb" < "adb".
- Rule " a, A < b, B
< c, C < d, D < e, E & AE; ä ": this will sort letters
'a', 'b', 'c', 'd' and 'e' in that order with primary weights. 'ä' will
sort as with a secondary less than to the sequence of 'A' following 'E'.
For example, "aeb" << "äb" and "acb"
< "äb".
- Rule ".... q, Q &
Question'-'mark = '?' ....": the rule shows how to sort symbols to be
equivalent to the corrsponding text. In this example, "?" ==
"Question-mark". Note that the special symbols need to be quoted
in the rule.
- Rule ".... & aa ; a-
& ee ; e- & ii ; i- & oo ; o- & uu ; u- ....": this
rule demonstrates how to specify prolonged vowels in Japanese. In this
case, "aa" is sorted as with a secondary less than to
"a-". For example, "baab" << "ba-b".
Implementation Details
Three parts of the code will be carefully examined here:
- Building the collation rule
table. (see mergecol.cpp, ptnentry.cpp and tblcoll.cpp)
- Incremental comparison
algorithm for simple string comparison. (RuleBasedCollator.compare() in
tblcoll.cpp)
- Collation key generation and
its format. (RuleBasedCollator.getCollationKey() in tblcoll.cpp)
Building the Collation Table
The process of building a collation table is as following:
- Parse the rule text into a
list of pattern entries. Each pattern has the content of current core
characters, extension character and the strength relation. (In
ptnentry.cpp)
- Inserts each entry at the
correct position based on the <reset> arguements. (In mergecol.cpp)
- Build the compacted, highly
efficient look-up table based on the list of pattern entries. (In
tblcoll.cpp)
Incremental Comparison Diagram
Generating a Collation Key
The control flow of generating a collation key is as the following:
- Retrieve the next collation
element of the source string. Go to step 5 when reaches the end of string.
- Append the primary weight of
element to the primary weight buffer.
- Checks if it's necessary to
process secondary weights. If so, append the secondary weights to the
secondary weight buffer. If the collator is marked to process French
secondary, reverse the order of all the secondary weights before encounters
the next primary weight.
- Checks if it's necessary to
process tertiary weights. If so, append the tertiary weights to the
tertiary weight buffer.
- Concatenate the primary
weight buffer, secondary weight buffer and tertiary weight buffer and add
a null delimiter among the weights. Return the concatenated buffer as the
collation key.
Q & A
- How do I customize the
collation sequence?
A: Using the RuleBasedCollator constructor, the user of the collation
framework can then create his/her own Collator with a customized rule.
- Will the collation framwork
support the surrogate and private use characters?
A: It's part of our future work items. However, no firm schedule has
been set for this yet.
- How does the French
secondary turn-on affect the generation of collation key?
A: In French, the secondary differences are sorted backwards so this will
invoke the collation key to reverse the secondary weights in the keys.
- Is there any support for
composing characters? If so, how does it work?
A: Yes, it is based on the Normalizer interface. When a expanding
character is detected, the rule builder will construct collation entries
for the precomposed version internally to handle the composed characters
correctly.
- Is there any plan for
performance improvement, for instance, contracting/expanding character
lookup?
A: Yes, the performance enhancement is an ongoing work item.
ReadMe for International
Collectanea for Unicode