Unicode Character Database (UCD) in XML Format

WARNING: FORMAT IS DRAFT!

MD 2000.10.16

Using Internet Explorer

The UCD-Main.xml file can be read in Internet Explorer (5.0 and above). However:

It may take a few minutes to load completely.
The XML parser in IE does not appear to be conformant: it seems to break on the following valid code points (and others):
- <IEbugs
  c1='￹'
  c2='￺'
  c3='￻'
  c4=''
  c5='�'
  c6='󰀀'
  c7='󿿽'
  c8='􀀀'
  c9='􏿽'/>

UCD-Main.xml provides an XML format for the main files in the Unicode Character Database. These include:

UnicodeData.txt
ArabicShaping.txt
Jamo.txt
SpecialCasing.txt
CompositionExclusions.txt
EastAsianWidth.txt
LineBreak.txt
BidiMirroring.txt
CaseFolding.txt
Blocks.txt
PropList.alpha.txt

Other files in the UCD have very different structure or purpose, and are best expressed with separate files. Some annotational data, such as that in NamesList.txt or the 10646 comment in UnicodeData, is also best served with separate files. The current UCD files not yet in XML format are:

Unihan.txt
NamesList.txt
Index.txt
NormalizationTest.txt

Format

The Unicode blocks are provided as a list of <block .../> elements, with attributes providing the start, end, and name.

Each assigned code point is a <e .../> element, with attributes supplying specific properties. The meaning of the attributes is specified below. There is one exception: large ranges of code points for characters such as Hangul Syllables are abbreviated by indicating the start and end of the range.

Because of the volume of data, the attribute names are abbreviated. A key explains the abbreviations, and relates them to the fields and values of the original UCD semicolon-delimited files. With few exceptions, the values in the XML are directly copied from data in the original UCD semicolon-delimited files. Those exceptions are described below.

Numeric character references (NCRs) are used to encode the Unicode code points. Some Unicode code points cannot be transmitted in XML, even as NCRs (see http://www.w3.org/TR/REC-xml#charsets), or would not be visibly distinct (TAB, CR, LF) in the data. Such code points are represented by '#xX;', where X is a hex number.

Attribute Abbreviations

To reduce the size of the document, the following attribute abbreviations are used. If an attribute is missing, that means it gets a default value. The defaults are listed in parentheses below. If there is no specific default, then a missing attribute should be read as N/A (not applicable). A default with '=' means the default is the value of another other field (recursively!). Thus if the titlecase attribute is missing, then the value is the same as the uppercase. If that in turn is missing, then the value is the same as the code point itself.

For a description of the source files, see UnicodeCharacterDatabase.html. That file also has links to the descriptions of the fields within the files. Since the PropList values are so long, they will probably also be abbreviated in the future.

UnicodeData

c: code point
n: name
gc: general category (Lo)
cc: combining class (0)
bc: bidi category (L)
dm: decomposition mapping
dt: decomposition type (canonical)
nt: numeric type
nv: numeric value
bm: bidi mirrored (N)
uc: uppercase (=c)
lc: lowercase (=c)
tc: titlecase (=uc)

SpecialCasing:

sl: special lower (=lc)
su: special upper (=uc)
st: special title (=su)
sc: special case condition

CaseFolding:

fc: foldcase (=sl)

CompositionExclusions:

ce: composition exclusion (N)

EastAsianWidth:

ea: east asian width (N)

Jamo:

jn: jamo name

LineBreak:

lb: line break class (AL)

ArabicShaping:

jt: joining type
jg: joining group

BidiMirroring:

bg: bidi mirroring glyph (=c)

PropList:

xs: space-delimited list of properties from the file

WARNING: these values are likely to change!

Data Modifications

The XML format is generated from the original semicolon-delimited UCD files. In general, all fields and values are direct copies. However, there are some changes, detailed below.

1. Some redundant or annotational fields are omitted

UnicodeData
1.0 Name
10646 comment

CaseFolding
Type (since it is computable from whether the fold equals the normal lowercase)

ArabicShaping
Name

EastAsianWidth
Name

LineBreak
Name

PropList

The fields are based on the proposed PropList.alpha, which changes the fields considerably.

WARNING: other values are also likely to change!

2. Some fields are broken into several fields; others may be combined into a single field

dt: decomposition tag
- the 'tag' field extracted from the decomposition mapping. If there is no tag, the value is "canonical". Only has meaning if there is a decomposition (dm).
nt: numeric type
- an enumeration [decimal, digit, numeric] for the type of number. It replaces having duplicate field values for numbers
rg: range
- used for ranges of values that share characteristics, instead of having to do a substring check.
  "START" corresponds to "<..., First>"
  "END" corresponds to "<..., Last>"
nc: name computed
- if "COMPUTED", indicates that the name must be computed: e.g. Hangul Syllables, Ideographs
na: name annotation
- used for code points that do not really have associated names, like control characters and private use characters. The data in that case is either extracted from the "<...>" style name in the old format, or gotten from the "1.0 Unicode name".