0cf2043f7f
X-SVN-Rev: 8922
498 lines
18 KiB
HTML
498 lines
18 KiB
HTML
<html>
|
||
|
||
<head>
|
||
<meta http-equiv="Content-Type" content="text/html; charset=windows-1252">
|
||
<meta name="GENERATOR" content="Microsoft FrontPage 4.0">
|
||
<meta name="ProgId" content="FrontPage.Editor.Document">
|
||
<title>XML Collation Specification</title>
|
||
<style>
|
||
<!--
|
||
th { background-color: #9999CC; border-style: solid; border-width: 1px; padding: 4 }
|
||
td { background-color: #CCCCFF; border-style: solid; border-width: 1px; padding: 4 }
|
||
table { border-style: solid; border-width: 1px }
|
||
-->
|
||
</style>
|
||
</head>
|
||
|
||
<body style="margin:2em">
|
||
|
||
<h1 align="center">XML Collation Specification</h1>
|
||
<p align="center"><i><font size="4"><b><font color="#FF0000">Early Draft:</font></b>
|
||
MED 2002-06-21</font></i></p>
|
||
<p>This document defines an XML vocabulary for exchanging tailoring rules, and
|
||
specifying comparison options. It allows any two implementations to exchange a
|
||
specification of collation. Using the same specification, the two
|
||
implementations will achieve the same results in comparing strings.</p>
|
||
<p> The rules are defined by correspondence with the <i>basic</i> <a href="http://oss.software.ibm.com/icu/userguide/Collate_Customization.html">ICU
|
||
rule syntax</a> (used in ICU and Java) and/or the ICU parameterizations. You
|
||
should be familiar with the UCA and the ICU implementation of it before
|
||
continuing with the rest of this document.</p>
|
||
<blockquote>
|
||
<p><b>Note: </b>ICU provides a concise format for specifying orderings, based
|
||
on tailorings to the UCA. For example, to specify that k and q follow 'c', one
|
||
can use the rule: "& c < k < q". The rules also allow
|
||
people to set default general parameter values, such as whether uppercase is
|
||
before lowercase or not.</p>
|
||
<p>Java contains an earlier version of ICU, and has not been updated recently.
|
||
It does not support any of the basic syntax marked with [...], and its default
|
||
table is not the UCA.</p>
|
||
<p>It is not necessary for ICU to be used in the underlying implementation.
|
||
The features are simply described here in terms of the ICU capabilities, since
|
||
that is easier than duplicating the text.</p>
|
||
</blockquote>
|
||
<p>Like the ICU rules, the tailoring syntax is designed to be independent of the
|
||
actual weights used in any particular UCA table. That way the same rules can be
|
||
applied to UCA versions over time, even if the underlying weights change.</p>
|
||
<h3><a name="Document_Structure">Document Structure</a></h3>
|
||
<p>The following describes the overall document structure used to specify a
|
||
collation in XML.</p>
|
||
<p><code><collation name="somename"><br>
|
||
<base .../><br>
|
||
<settings .../><br>
|
||
<rules><br>
|
||
<!-- rules go here, if there are any --><br>
|
||
</rules><br>
|
||
</collation></code></p>
|
||
<table border="1" width="100%">
|
||
<tr>
|
||
<td width="100%"><b>TBD:</b>
|
||
<ul>
|
||
<li><b>Add DTD</b></li>
|
||
<li><b>Clarify how versions work.</b></li>
|
||
<li><b>Add Namespace</b></li>
|
||
</ul>
|
||
</td>
|
||
</tr>
|
||
</table>
|
||
<h3><a name="Base">Base</a></h3>
|
||
<p>There must be exactly one base element. The base element indicates the
|
||
collation ordering that is to be used as a foundation. This base collation
|
||
ordering can be modified (tailored) by a rules element, and the settings in the
|
||
base can be overridden by the settings element. The rules are treated as if they
|
||
were appended to the rules in the URL. When the xml:lang is used, then the rules
|
||
in the ICU repository with that version are specified. There are two alternative
|
||
attributes:</p>
|
||
<table>
|
||
<tr>
|
||
<th>Attribute</th>
|
||
<th>Options</th>
|
||
<th>XML Example</th>
|
||
<th>Description</th>
|
||
</tr>
|
||
<tr>
|
||
<td>uca</td>
|
||
<td><i>uca version/unicode version</i></td>
|
||
<td>uca="3.1.1d1/3.2.0"</td>
|
||
<td>Specifies the UCA version</td>
|
||
</tr>
|
||
<tr>
|
||
<td>src</td>
|
||
<td><i>URL</i></td>
|
||
<td>src="http://www.foo.com/sort_en_us.xml"</td>
|
||
<td>Points to a different collation specification.</td>
|
||
</tr>
|
||
</table>
|
||
<p>The first one is used for a direct table, one that either uses the UCA alone,
|
||
or modifies it with settings and/or rules. The second one is used to refer to a
|
||
pre-existing document in this format, which can also be modified with settings
|
||
and/or rules.</p>
|
||
<p><i>Example 1:<br>
|
||
The following specifies a German phonebook ordering, by setting the umlauted
|
||
letters to be equivalent to base + e.</i></p>
|
||
<blockquote>
|
||
<pre><collation name="German Phonebook Ordering">
|
||
<base uca="3.1.1d1/3.2.0"/>
|
||
<rules>
|
||
<reset/> ae <t/> <20>
|
||
<reset/> AE <t/> <20>
|
||
<reset/> oe <t/> <20>
|
||
<reset/> OE <t/> <20>
|
||
<reset/> ue <t/> <20>
|
||
<reset/> UE <t/> <20>
|
||
</rules>
|
||
</collation></pre>
|
||
</blockquote>
|
||
<p><i>Example 2:<br>
|
||
Supposing the above is on the web at <a href="http://www.foo.com/de_de_phonebook.xml">http://www.foo.com/de_de_phonebook.xml</a>,
|
||
the following modifies that to sort uppercase first, and sort the character '@'
|
||
as if it were spelled out.</i></p>
|
||
<blockquote>
|
||
<pre><collation name="German Phonebook Ordering, Uppercase First with Ampersand">
|
||
<base src="http://www.foo.com/de_de_phonebook.xml"/>
|
||
<setting caseFirst="upper"/>
|
||
<rules>
|
||
<reset/> @ <t/> Affenschwanz
|
||
</rules>
|
||
</collation></pre>
|
||
</blockquote>
|
||
<h3><a name="Setting_Options">Setting Options</a></h3>
|
||
<p>There must be exactly one settings element. It contains global settings on
|
||
the collation sequence. For example, <setting
|
||
strength="secondary"> will only compare strings based on their
|
||
primary and secondary weights, ignoring any weaker weights.</p>
|
||
<p>The following table provides a list of valid attributes. If any of the
|
||
attributes is not present, the default for the base is used. The default for the
|
||
UCA is listed in italics below, but it may be modified by the base. The effect
|
||
of these attributes is defined by reference to the effect of the <a href="http://oss.software.ibm.com/icu/apiref/ucol_8h.html#a69">setAttributes</a>
|
||
API (except for variableTop, which corresponds to the <a href="http://oss.software.ibm.com/icu/apiref/classCollator.html#a21">setVariableTop</a>
|
||
API). <i>[Ed. Note: This is temporary, until the textual description is brought
|
||
in here]. </i>The basic example is given where the setting can also be given
|
||
with rules in the basic syntax.</p>
|
||
<table>
|
||
<tbody>
|
||
<tr>
|
||
<th>Attribute</th>
|
||
<th>Options</th>
|
||
<th>Basic Example </th>
|
||
<th>XML Example</th>
|
||
</tr>
|
||
<tr>
|
||
<td>alternate</td>
|
||
<td><i>non-ignorable</i><br>
|
||
shifted</td>
|
||
<td><font color="#000000"><code>[alternate non-ignorable]</code></font></td>
|
||
<td><code>alternate="non-ignorable"</code></td>
|
||
</tr>
|
||
<tr>
|
||
<td>backwards</td>
|
||
<td>on<br>
|
||
<i>off</i></td>
|
||
<td><font color="#000000"><code>[backwards on] </code></font></td>
|
||
<td><code>backwards="on"</code></td>
|
||
</tr>
|
||
<tr>
|
||
<td>normalization</td>
|
||
<td>on<br>
|
||
off</td>
|
||
<td><font color="#000000"><code>[normalization on] </code></font></td>
|
||
<td><code>normalization="off"</code></td>
|
||
</tr>
|
||
<tr>
|
||
<td>caseLevel</td>
|
||
<td>on<br>
|
||
off</td>
|
||
<td><font color="#000000"><code>[caseLevel on]</code></font></td>
|
||
<td><code>caseLevel="off"</code></td>
|
||
</tr>
|
||
<tr>
|
||
<td>caseFirst</td>
|
||
<td>upper<br>
|
||
lower<br>
|
||
off</td>
|
||
<td><font color="#000000"><code>[caseFirst off]</code></font></td>
|
||
<td><code>caseFirst="off"</code></td>
|
||
</tr>
|
||
<tr>
|
||
<td>hiraganaQ</td>
|
||
<td>on<br>
|
||
off</td>
|
||
<td><code>[hiraganaQ on]</code></td>
|
||
<td><code>hiraganaQuarternary="on"</code></td>
|
||
</tr>
|
||
<tr>
|
||
<td><font color="#000000">strength</font></td>
|
||
<td>primary (1)<br>
|
||
secondary (2)<br>
|
||
tertiary (3)<br>
|
||
quarternary (4)<br>
|
||
identical (5)</td>
|
||
<td><code>[strength 1]</code></td>
|
||
<td><code>strength="primary"</code></td>
|
||
</tr>
|
||
<tr>
|
||
<td>variableTop<sup>1</sup></td>
|
||
<td><font color="#000000">at character(s)<br>
|
||
before character(s)<br>
|
||
after character(s)</font></td>
|
||
<td><code>& x = [variable top]</code></td>
|
||
<td><code>variableTopAfter="x"</code></td>
|
||
</tr>
|
||
</tbody>
|
||
</table>
|
||
<blockquote>
|
||
<p><b>Issue:</b> This syntax might limit the characters in variableTop, since
|
||
attributes can't handle all characters. Perhaps this needs to be a separate
|
||
element.</p>
|
||
<ol>
|
||
<li>The default value for variableTop depends on the UCA setting. For
|
||
example, in 3.1.1d1, the value is:<br>
|
||
U+1D7C3 MATHEMATICAL SANS-SERIF BOLD ITALIC PARTIAL DIFFERENTIAL. See
|
||
below for the layout.</li>
|
||
</ol>
|
||
</blockquote>
|
||
<h2><a name="Rules">Rules</a></h2>
|
||
<p>The rules section, if there is one, contains rules that tailor whatever was
|
||
in the base. The rule syntax, while valid XML, is somewhat unusual. The goal is
|
||
to have clearly expressed rules, with a concise format, that parallels the Basic
|
||
syntax as much as possible.</p>
|
||
<h3><a name="Orderings">Orderings</a></h3>
|
||
<p>The following are the normal orderings used for the bulk of characters.</p>
|
||
<table>
|
||
<tr>
|
||
<th>Basic Symbol</th>
|
||
<th>Basic Example</th>
|
||
<th>XML Symbol</th>
|
||
<th>XML Example</th>
|
||
<th>Description</th>
|
||
</tr>
|
||
<tr>
|
||
<td align="center"><code>< </code></td>
|
||
<td><code>a < b </code></td>
|
||
<td><code><p/></code></td>
|
||
<td><code>a <p/> b</code></td>
|
||
<td>Make 'b' sort after 'a', as a <i>primary</i> (base-character) difference</td>
|
||
</tr>
|
||
<tr>
|
||
<td align="center"><code><< </code></td>
|
||
<td><code>a << <20> </code></td>
|
||
<td><code><s/></code></td>
|
||
<td><code>a <s/> <20></code></td>
|
||
<td>Make '<27>' sort after 'a' as a <i>secondary</i> (accent) difference</td>
|
||
</tr>
|
||
<tr>
|
||
<td align="center"><code><<< </code></td>
|
||
<td><code>a <<< A </code></td>
|
||
<td><code><t/></code></td>
|
||
<td><code>a <t/> A</code></td>
|
||
<td>Make 'A' sort after 'a' as a <i>tertiary</i> (case) difference</td>
|
||
</tr>
|
||
<tr>
|
||
<td align="center"><code>= </code></td>
|
||
<td><code>x = y </code></td>
|
||
<td><code><eq/></code></td>
|
||
<td><code>v <eq/> w</code></td>
|
||
<td>Make 'w' sort exactly the same as 'v'</td>
|
||
</tr>
|
||
<tr>
|
||
<td align="center"><code>& </code></td>
|
||
<td><code>& Z </code></td>
|
||
<td><code><reset/></code></td>
|
||
<td><code><reset/> Z</code></td>
|
||
<td>Don't change the ordering of Z, but place subsequent characters relative
|
||
to it.</td>
|
||
</tr>
|
||
</table>
|
||
<p>Note that each character is placed relative to the characters <i>before</i>
|
||
it. Thus the following means "change the weight of W so that it comes after
|
||
Z, and with a primary difference.</p>
|
||
<blockquote>
|
||
<pre><reset/> Z <p> W</pre>
|
||
</blockquote>
|
||
<h3><a name="Escaping_Characters">Escaping Characters</a></h3>
|
||
<p>Unfortunately, XML does not have the capability to contain all Unicode code
|
||
points. Due to this, extra syntax is required to represent those code points
|
||
that cannot be otherwise represented. This corresponds to the quoting mechanism
|
||
used in the basic syntax. This also must be used where spaces are significant
|
||
(otherwise they are stripped).</p>
|
||
<table>
|
||
<tr>
|
||
<th>Basic Example</th>
|
||
<th>XML Example</th>
|
||
</tr>
|
||
<tr>
|
||
<td><code>'\u0000'</code></td>
|
||
<td><code><cp hex="0"></code></td>
|
||
</tr>
|
||
</table>
|
||
<h3><a name="Contractions">Contractions</a></h3>
|
||
<p>To sort a sequence as a single item (contraction), just use the sequence,
|
||
e.g.</p>
|
||
<table>
|
||
<tr>
|
||
<th>BASIC Example</th>
|
||
<th>XML Example</th>
|
||
<th>Description</th>
|
||
</tr>
|
||
<tr>
|
||
<td><code>& k < ch</code></td>
|
||
<td><code><reset/> k <p/> ch</code></td>
|
||
<td>Make the sequence 'ch' sort after 'k', as a primary (base-character)
|
||
difference</td>
|
||
</tr>
|
||
</table>
|
||
<h3><a name="Expansions">Expansions</a></h3>
|
||
<p>There are two ways to handle expansions (where a character sorts as a
|
||
sequence) with both the basic syntax and the XML syntax. The first method is to
|
||
reset to the sequence of characters. The second is to use the extension
|
||
sequence. Both are equivalent in practice (unless the reset sequence happens to
|
||
be a contraction).</p>
|
||
<table>
|
||
<tr>
|
||
<th>Basic</th>
|
||
<th>XML</th>
|
||
<th>Description</th>
|
||
</tr>
|
||
<tr>
|
||
<td><code>& ae < </code><EFBFBD></td>
|
||
<td><code><reset/> ae <p/> </code><EFBFBD></td>
|
||
<td>Make 'k' sort after the sequence 'ch'; thus 'k' will behave as if it
|
||
expands to a character after 'c' followed by an 'h'. (unless 'ch' is
|
||
defined beforehand as a contraction).</td>
|
||
</tr>
|
||
<tr>
|
||
<td><code>& a < </code><EFBFBD><code> / e</code></td>
|
||
<td><code><reset/> a <p/> </code><EFBFBD><code> <x/> e</code></td>
|
||
<td>Make 'k' sort after the sequence 'ch'; thus 'k' will behave as if it
|
||
expands to a character after 'c' followed by an 'h'.</td>
|
||
</tr>
|
||
</table>
|
||
<p>In the basic syntax, you can reset variable top by treating it as if it were
|
||
a character. In XML, it is always an option on settings, as described above.</p>
|
||
<h3><a name="Context_Before">Context Before</a></h3>
|
||
<p>The context before a character can affect how it is ordered, such as in
|
||
Japanese. This could be expressed with a combination of contractions and
|
||
expansions, but is faster using a context. (The actual weights produced are
|
||
different, but the resulting string comparisons are the same.)</p>
|
||
<table>
|
||
<tr>
|
||
<th>Basic</th>
|
||
<th>XML</th>
|
||
</tr>
|
||
<tr>
|
||
<td><code>& ァ<br>
|
||
<<< ァ | ー<br>
|
||
= ァ | ー<br>
|
||
= ぁ | ー</code></td>
|
||
<td><code><reset/></code><code> </code><code>ァ<br>
|
||
<t></code><code> </code><code>ァ</code><code> </code><code><context/></code><code> </code><code>ー<br>
|
||
<eq></code><code> </code><code>ァ</code><code> </code><code><context/></code><code> </code><code>ー<br>
|
||
<eq></code><code> </code><code>ぁ</code><code> </code><code><context/></code><code> </code><code>ー</code></td>
|
||
</tr>
|
||
</table>
|
||
<h3><a name="Placing_Characters_Before_Others">Placing Characters Before Others</a></h3>
|
||
<p>There are certain circumstances where characters need to be placed before a
|
||
given character, rather than after. This is the case with Pinyin, for example,
|
||
where certain accented letters are positioned before the base letter. That is
|
||
accomplished with the following syntax.</p>
|
||
<table>
|
||
<tbody>
|
||
<tr>
|
||
<th>Item</th>
|
||
<th>Options</th>
|
||
<th>Basic Example </th>
|
||
<th>XML Example</th>
|
||
</tr>
|
||
<tr>
|
||
<td>before </td>
|
||
<td>primary<br>
|
||
secondary<br>
|
||
tertiary<br>
|
||
identical</td>
|
||
<td><code>& [before 1] a<br>
|
||
<< <20></code></td>
|
||
<td><code><reset before="primary"/> a<br>
|
||
<s/> <EFBFBD></code></td>
|
||
</tr>
|
||
</tbody>
|
||
</table>
|
||
<h3><a name="Logical_Reset_Positions">Logical Reset Positions</a></h3>
|
||
<p>The UCA has the following structure for primary weights, going from low to
|
||
high.</p>
|
||
<table>
|
||
<tr>
|
||
<th valign="top" align="center" bgcolor="#CCCCFF">Items</th>
|
||
<th valign="top" align="center" bgcolor="#CCCCFF">Description</th>
|
||
<th valign="top" align="center" bgcolor="#CCCCFF">UCA Examples</th>
|
||
</tr>
|
||
<tr>
|
||
<td>first tertiary ignorable<br>
|
||
...<br>
|
||
last tertiary ignorable</td>
|
||
<td>primary, secondary, tertiary weights = ignore</td>
|
||
<td>Control Codes<br>
|
||
Format Characters<br>
|
||
Hebrew Points<br>
|
||
Tibetan Signs<br>
|
||
...</td>
|
||
</tr>
|
||
<tr>
|
||
<td>first secondary ignorable<br>
|
||
...<br>
|
||
last secondary ignorable</td>
|
||
<td>primary, secondary weights = ignore</td>
|
||
<td>None in UCA</td>
|
||
</tr>
|
||
<tr>
|
||
<td>first primary ignorable<br>
|
||
...<br>
|
||
last primary ignorable</td>
|
||
<td>primary weights = ignore</td>
|
||
<td>Most combining marks</td>
|
||
</tr>
|
||
<tr>
|
||
<td>first variable<br>
|
||
...<br>
|
||
last variable</td>
|
||
<td>primary weights != ignore,<br>
|
||
<i> <b>if</b> alternate = non-ignorable<br>
|
||
</i><br>
|
||
primary, secondary, tertiary weights = ignore,<br>
|
||
<i><b>if</b> alternate = shifted</i></td>
|
||
<td>Whitespace,<br>
|
||
Punctuation,<br>
|
||
Symbols</td>
|
||
</tr>
|
||
<tr>
|
||
<td>first non-ignorable<br>
|
||
...<br>
|
||
last non-ignorable</td>
|
||
<td>primary weights != ignore</td>
|
||
<td>Small number of exceptional symbols<br>
|
||
[e.g. U+02D0 MODIFIER LETTER TRIANGULAR COLON]<br>
|
||
Numbers<br>
|
||
Latin<br>
|
||
Greek<br>
|
||
...</td>
|
||
</tr>
|
||
<tr>
|
||
<td><i>implicits</i></td>
|
||
<td>primary weights != ignore,<br>
|
||
<i>assigned automatically</i></td>
|
||
<td>CJK, CJK compatibility (that are not decomposed)<br>
|
||
CJK Extension A, B<br>
|
||
Unassigned</td>
|
||
</tr>
|
||
<tr>
|
||
<td>first trailing<br>
|
||
...<br>
|
||
last trailing</td>
|
||
<td>primary weights != ignore,<br>
|
||
<i>used for trailing syllable components</i></td>
|
||
<td>Jamo Trailing<br>
|
||
Jamo Leading</td>
|
||
</tr>
|
||
</table>
|
||
<p>Each of the above values (except <i>implicits</i>) can be used with a reset
|
||
to position characters after (or before) that logical position. That allows
|
||
characters to be ordered before or after a logical position rather than a
|
||
specific character.</p>
|
||
<blockquote>
|
||
<p>The reason for this is so that tailorings can be more stable. A future
|
||
version of the UCA might add characters at any point in the above list.
|
||
Suppose that you set character X to be after Y. It could be that you want X to
|
||
come after Y, no matter what future characters are added; or it could be that
|
||
you just want Y to come after a given logical position, e.g. after the last
|
||
primary ignorable.</p>
|
||
</blockquote>
|
||
<p>Here is an example of the syntax:</p>
|
||
<table>
|
||
<tr>
|
||
<th>Basic</th>
|
||
<th>XML</th>
|
||
</tr>
|
||
<tr>
|
||
<td><code>& [first tertiary ignorable]<br>
|
||
<< <EFBFBD></code></td>
|
||
<td><code><reset/><position at="first tertiary
|
||
ignorable"/><br>
|
||
<s/> <EFBFBD></code></td>
|
||
</tr>
|
||
</table>
|
||
<p>For example, to make a character be a secondary ignorable, one can make it be
|
||
immediately after (at a secondary level) a specific character (like a combining
|
||
dieresis), or one can make it be immediately after the last secondary ignorable.</p>
|
||
|
||
</body>
|
||
|
||
</html>
|