MD 2000.10.16
Using Internet ExplorerThe UCD-Main.xml file can be read in Internet Explorer (5.0 and above). However:
|
UCD-Main.xml provides an XML format for the main files in the Unicode Character Database. These include:
UnicodeData.txt
ArabicShaping.txt
Jamo.txt
SpecialCasing.txt
CompositionExclusions.txt
EastAsianWidth.txt
LineBreak.txt
BidiMirroring.txt
CaseFolding.txt
Blocks.txt
PropList.alpha.txt
Other files in the UCD have very different structure or purpose, and are best expressed with separate files. Some annotational data, such as that in NamesList.txt or the 10646 comment in UnicodeData, is also best served with separate files. The current UCD files not yet in XML format are:
Unihan.txt
NamesList.txt
Index.txt
NormalizationTest.txt
The Unicode blocks are provided as a list of <block .../> elements, with attributes providing the start, end, and name.
Each assigned code point is a <e .../> element, with attributes supplying specific properties. The meaning of the attributes is specified below. There is one exception: large ranges of code points for characters such as Hangul Syllables are abbreviated by indicating the start and end of the range.
Because of the volume of data, the attribute names are abbreviated. A key explains the abbreviations, and relates them to the fields and values of the original UCD semicolon-delimited files. With few exceptions, the values in the XML are directly copied from data in the original UCD semicolon-delimited files. Those exceptions are described below.
Numeric character references (NCRs) are used to encode the Unicode code points. Some Unicode code points cannot be transmitted in XML, even as NCRs (see http://www.w3.org/TR/REC-xml#charsets), or would not be visibly distinct (TAB, CR, LF) in the data. Such code points are represented by '#xX;', where X is a hex number.
To reduce the size of the document, the following attribute abbreviations are used. If an attribute is missing, that means it gets a default value. The defaults are listed in parentheses below. If there is no specific default, then a missing attribute should be read as N/A (not applicable). A default with '=' means the default is the value of another other field (recursively!). Thus if the titlecase attribute is missing, then the value is the same as the uppercase. If that in turn is missing, then the value is the same as the code point itself.
For a description of the source files, see UnicodeCharacterDatabase.html. That file also has links to the descriptions of the fields within the files. Since the PropList values are so long, they will probably also be abbreviated in the future.
UnicodeData c: code point SpecialCasing: sl: special lower (=lc) CaseFolding:fc: foldcase (=sl) |
CompositionExclusions:ce: composition exclusion (N) EastAsianWidth:ea: east asian width (N) Jamo:jn: jamo name LineBreak:lb: line break class (AL) ArabicShaping: jt: joining type BidiMirroring:bg: bidi mirroring glyph (=c) PropList: xs: space-delimited list of properties from the file WARNING: these values are likely to change! |
The XML format is generated from the original semicolon-delimited UCD files. In general, all fields and values are direct copies. However, there are some changes, detailed below.
UnicodeData 1.0 Name 10646 comment CaseFolding Type (since it is computable from whether the fold equals the normal lowercase) ArabicShaping |
PropList
The fields are based on the proposed PropList.alpha, which changes the fields considerably. WARNING: other values are also likely to change! |