227 lines
9.0 KiB
HTML
227 lines
9.0 KiB
HTML
|
<html>
|
||
|
|
||
|
<head>
|
||
|
<meta http-equiv="Content-Type" content="text/html; charset=windows-1252">
|
||
|
<meta name="GENERATOR" content="Microsoft FrontPage 4.0">
|
||
|
<meta name="ProgId" content="FrontPage.Editor.Document">
|
||
|
<title>Unicode Character Database</title>
|
||
|
<style>
|
||
|
<!--
|
||
|
table { padding: 4 }
|
||
|
td { padding: 4 }
|
||
|
-->
|
||
|
</style>
|
||
|
</head>
|
||
|
|
||
|
<body>
|
||
|
|
||
|
<span class="cb" id style="DISPLAY: block">
|
||
|
<h1 align="center">Unicode Character Database (UCD) in XML Format</h1>
|
||
|
<h1 align="center"><b><font color="#FF0000">WARNING: FORMAT IS DRAFT!</font></b></h1>
|
||
|
<p align="center">MD 2000.10.16</p>
|
||
|
<table border="1" width="40%" align="right" cellspacing="4" cellpadding="0">
|
||
|
<tr>
|
||
|
<td width="100%" bgcolor="#C0C0C0"><span class="cb" id
|
||
|
style="DISPLAY: block">
|
||
|
<h4 align="center">Using Internet Explorer</h4>
|
||
|
<p>The UCD-Main.xml file can be read in Internet Explorer (5.0 and above).
|
||
|
However:</p>
|
||
|
<ul>
|
||
|
<li>It may take a few minutes to load completely.</li>
|
||
|
<li>The XML parser in IE does not appear to be conformant: it seems to
|
||
|
break on</span> the following valid code points (and others):
|
||
|
<ul>
|
||
|
<li><IEbugs<br>
|
||
|
c1='&#xFFF9;'<br>
|
||
|
c2='&#xFFFA;'<br>
|
||
|
c3='&#xFFFB;'<br>
|
||
|
c4='&#xFFFC;'<br>
|
||
|
c5='&#xFFFD;'<br>
|
||
|
c6='&#xF0000;'<br>
|
||
|
c7='&#xFFFFD;'<br>
|
||
|
c8='&#x100000;'<br>
|
||
|
c9='&#x10FFFD;'/></li>
|
||
|
</ul>
|
||
|
</li>
|
||
|
</ul>
|
||
|
</td>
|
||
|
</tr>
|
||
|
</table>
|
||
|
<p><a href="UCD-Main.xml">UCD-Main.xml</a> provides an XML format for the main
|
||
|
files in the Unicode Character Database. These include:</p>
|
||
|
<ul>
|
||
|
<li><code>UnicodeData.txt</code></li>
|
||
|
<li><code>ArabicShaping.txt</code></li>
|
||
|
<li><code>Jamo.txt</code></li>
|
||
|
<li><code>SpecialCasing.txt</code></li>
|
||
|
<li><code>CompositionExclusions.txt</code></li>
|
||
|
<li><code>EastAsianWidth.txt</code></li>
|
||
|
<li><code>LineBreak.txt</code></li>
|
||
|
<li><code>BidiMirroring.txt</code></li>
|
||
|
<li><code>CaseFolding.txt</code></li>
|
||
|
<li><code>Blocks.txt</code></li>
|
||
|
<li><code>PropList.alpha.txt</code></li>
|
||
|
</ul>
|
||
|
<p>Other files in the UCD have very different structure or purpose, and are best
|
||
|
expressed with separate files. Some annotational data, such as that in
|
||
|
NamesList.txt or the 10646 comment in UnicodeData, is also best served with
|
||
|
separate files. The current UCD files not yet in XML format are:</p>
|
||
|
<ul>
|
||
|
<li><code>Unihan.txt</code></li>
|
||
|
<li><code>NamesList.txt</code></li>
|
||
|
<li><code>Index.txt</code></li>
|
||
|
<li><code>NormalizationTest.txt</code></li>
|
||
|
</ul>
|
||
|
<h3>Format</h3>
|
||
|
<p>The Unicode blocks are provided as a list of <block .../> elements,
|
||
|
with attributes providing the start, end, and name.</p>
|
||
|
<p>Each assigned code point is a <e .../> element, with attributes
|
||
|
supplying specific properties. The meaning of the attributes is specified below.
|
||
|
There is one exception: large ranges of code points for characters such as
|
||
|
Hangul Syllables are abbreviated by indicating the start and end of the range.</p>
|
||
|
<p>Because of the volume of data, the attribute names are abbreviated. A <a
|
||
|
href="#AttributeAbbreviations">key</a> explains the abbreviations, and relates
|
||
|
them to the fields and values of the original UCD semicolon-delimited files.
|
||
|
With few exceptions, the values in the XML are directly copied from data in the
|
||
|
original UCD semicolon-delimited files. Those exceptions are described <a
|
||
|
href="http://www.unicode.org/Public/3.0-Update1/UnicodeCharacterDatabase-3.0.1.html#DataModifications">below</a>.</p>
|
||
|
<p>Numeric character references (NCRs) are used to encode the Unicode code
|
||
|
points. Some Unicode code points cannot be transmitted in XML, even as NCRs (see
|
||
|
<a href="http://www.w3.org/TR/REC-xml#charsets">http://www.w3.org/TR/REC-xml#charsets</a>),
|
||
|
or would not be visibly distinct (TAB, CR, LF) in the data. Such code points are
|
||
|
represented by '#xX;', where X is a hex number.</p>
|
||
|
<h3><a name="AttributeAbbreviations">Attribute Abbreviations</a></h3>
|
||
|
<p>To reduce the size of the document, the following attribute abbreviations are
|
||
|
used. If an attribute is missing, that means it gets a default value. The
|
||
|
defaults are listed in parentheses below. If there is no specific default, then
|
||
|
a missing attribute should be read as N/A (not applicable). A default with '='
|
||
|
means the default is the value of another other field (recursively!). Thus if
|
||
|
the titlecase attribute is missing, then the value is the same as the uppercase.
|
||
|
If that in turn is missing, then the value is the same as the code point itself.</p>
|
||
|
<p>For a description of the source files, see <a
|
||
|
href="http://www.unicode.org/Public/UNIDATA/UnicodeCharacterDatabase.html">UnicodeCharacterDatabase.html</a>.
|
||
|
That file also has links to the descriptions of the fields within the files.
|
||
|
Since the PropList values are so long, they will probably also be abbreviated in
|
||
|
the future.</p>
|
||
|
<table border="1" width="100%">
|
||
|
<tr>
|
||
|
<td width="50%" valign="top"><span class="cb" id style="DISPLAY: block">
|
||
|
<h4>UnicodeData</h4>
|
||
|
<p> c: code point<br>
|
||
|
n: name<br>
|
||
|
gc: general category (Lo)<br>
|
||
|
cc: combining class (0)<br>
|
||
|
bc: bidi category (L)<br>
|
||
|
dm: decomposition mapping<br>
|
||
|
dt: decomposition type (canonical)<br>
|
||
|
nt: numeric type<br>
|
||
|
nv: numeric value<br>
|
||
|
bm: bidi mirrored (N)<br>
|
||
|
uc: uppercase (=c)<br>
|
||
|
lc: lowercase (=c)<br>
|
||
|
tc: titlecase (=uc)</p>
|
||
|
<h4>SpecialCasing:</h4>
|
||
|
<p> sl: special lower (=lc)<br>
|
||
|
su: special upper (=uc)<br>
|
||
|
st: special title (=su)<br>
|
||
|
sc: special case condition</p>
|
||
|
<h4>CaseFolding:</h4>
|
||
|
<p> fc: foldcase (=sl)</span></td>
|
||
|
<td width="50%" valign="top"><span class="cb" id style="DISPLAY: block">
|
||
|
<h4>CompositionExclusions:</h4>
|
||
|
<p> ce: composition exclusion (N)</p>
|
||
|
<h4>EastAsianWidth:</h4>
|
||
|
<p> ea: east asian width (N)</p>
|
||
|
<h4>Jamo:</h4>
|
||
|
<p> jn: jamo name</p>
|
||
|
<h4>LineBreak:</h4>
|
||
|
<p> lb: line break class (AL)</p>
|
||
|
<h4>ArabicShaping:</h4>
|
||
|
<p> jt: joining type<br>
|
||
|
jg: joining group</p>
|
||
|
<h4>BidiMirroring:</h4>
|
||
|
<p> bg: bidi mirroring glyph (=c)</p>
|
||
|
<p><b>PropList:</b></p>
|
||
|
<p> xs: space-delimited list of properties from the file</p>
|
||
|
<p><b><i>WARNING: these values are likely to change!</i></b></span></td>
|
||
|
</tr>
|
||
|
</table>
|
||
|
<br>
|
||
|
<h3><a name="DataModifications">Data Modifications</a></h3>
|
||
|
</span>
|
||
|
<p>The XML format is generated from the original semicolon-delimited UCD files.
|
||
|
In general, all fields and values are direct copies. However, there are some
|
||
|
changes, detailed below.</p>
|
||
|
<h4>1. Some redundant or annotational fields are omitted</h4>
|
||
|
<table border="1" width="100%">
|
||
|
<tr>
|
||
|
<td width="50%" valign="top"><b>UnicodeData<br>
|
||
|
</b>1.0 Name<br>
|
||
|
10646 comment<br>
|
||
|
<br>
|
||
|
<b>CaseFolding<br>
|
||
|
</b>Type (since it is computable from whether the fold equals the normal
|
||
|
lowercase)
|
||
|
<p><b>ArabicShaping<br>
|
||
|
</b>Name<br>
|
||
|
<br>
|
||
|
<b>EastAsianWidth<br>
|
||
|
</b>Name<br>
|
||
|
<br>
|
||
|
<b>LineBreak<br>
|
||
|
</b>Name</p>
|
||
|
</td>
|
||
|
<td width="50%" valign="top"><b>PropList</b><font face="Times New Roman"
|
||
|
color="#000000">
|
||
|
<p>The fields are based on the proposed PropList.alpha, which changes the
|
||
|
fields considerably.</p>
|
||
|
</font>
|
||
|
<p><span class="cb" id style="display: block"><b><i>WARNING: other values
|
||
|
are also likely to change!</i></b></span></p>
|
||
|
</td>
|
||
|
</tr>
|
||
|
</table>
|
||
|
<h4>2. Some fields are broken into several fields; others may be combined into a
|
||
|
single field</h4>
|
||
|
<ul>
|
||
|
<li><b>dt: </b>decomposition tag
|
||
|
<ul>
|
||
|
<li>the 'tag' field extracted from the decomposition mapping. If there is
|
||
|
no tag, the value is "canonical". Only has meaning if there is
|
||
|
a decomposition (<b>dm</b>).</li>
|
||
|
</ul>
|
||
|
</li>
|
||
|
<li><b>nt: </b>numeric type
|
||
|
<ul>
|
||
|
<li>an enumeration [decimal, digit, numeric] for the type of number. It
|
||
|
replaces having duplicate field values for numbers</li>
|
||
|
</ul>
|
||
|
</li>
|
||
|
<li><b>rg: </b>range
|
||
|
<ul>
|
||
|
<li>used for ranges of values that share characteristics, instead of
|
||
|
having to do a substring check.<br>
|
||
|
"START" corresponds to "<..., First>"<br>
|
||
|
"END" corresponds to "<..., Last>"</li>
|
||
|
</ul>
|
||
|
</li>
|
||
|
<li><b>nc: </b>name computed
|
||
|
<ul>
|
||
|
<li>if "COMPUTED", indicates that the name must be computed:
|
||
|
e.g. Hangul Syllables, Ideographs</li>
|
||
|
</ul>
|
||
|
</li>
|
||
|
<li><b>na: </b>name annotation
|
||
|
<ul>
|
||
|
<li>used for code points that do not really have associated names, like
|
||
|
control characters and private use characters. The data in that case is
|
||
|
either extracted from the "<...>" style name in the old
|
||
|
format, or gotten from the "1.0 Unicode name".</li>
|
||
|
</ul>
|
||
|
</li>
|
||
|
</ul>
|
||
|
|
||
|
</body>
|
||
|
|
||
|
</html>
|