Unicode Character Database (UCD) in XML Format

WARNING: FORMAT IS DRAFT!

MD 2000.10.16

Using Internet Explorer

The UCD-Main.xml file can be read in Internet Explorer (5.0 and above). However:

  • It may take a few minutes to load completely.
  • The XML parser in IE does not appear to be conformant: it seems to break on the following valid code points (and others):
    • <IEbugs
      c1='&#xFFF9;'
      c2='&#xFFFA;'
      c3='&#xFFFB;'
      c4='&#xFFFC;'
      c5='&#xFFFD;'
      c6='&#xF0000;'
      c7='&#xFFFFD;'
      c8='&#x100000;'
      c9='&#x10FFFD;'/>

UCD-Main.xml provides an XML format for the main files in the Unicode Character Database. These include:

Other files in the UCD have very different structure or purpose, and are best expressed with separate files. Some annotational data, such as that in NamesList.txt or the 10646 comment in UnicodeData, is also best served with separate files. The current UCD files not yet in XML format are:

Format

The Unicode blocks are provided as a list of <block .../> elements, with attributes providing the start, end, and name.

Each assigned code point is a <e .../> element, with attributes supplying specific properties. The meaning of the attributes is specified below. There is one exception: large ranges of code points  for characters such as Hangul Syllables are abbreviated by indicating the start and end of the range.

Because of the volume of data, the attribute names are abbreviated. A key explains the abbreviations, and relates them to the fields and values of the original UCD semicolon-delimited files. With few exceptions, the values in the XML are directly copied from data in the original UCD semicolon-delimited files. Those exceptions are described below.

Numeric character references (NCRs) are used to encode the Unicode code points. Some Unicode code points cannot be transmitted in XML, even as NCRs (see http://www.w3.org/TR/REC-xml#charsets), or would not be visibly distinct (TAB, CR, LF) in the data. Such code points are represented by '#xX;', where X is a hex number.

Attribute Abbreviations

To reduce the size of the document, the following attribute abbreviations are used. If an attribute is missing, that means it gets a default value. The defaults are listed in parentheses below. If there is no specific default, then a missing attribute should be read as N/A (not applicable). A default with '=' means the default is the value of another other field (recursively!). Thus if the titlecase attribute is missing, then the value is the same as the uppercase. If that in turn is missing, then the value is the same as the code point itself.

For a description of the source files, see UnicodeCharacterDatabase.html. That file also has links to the descriptions of the fields within the files. Since the PropList values are so long, they will probably also be abbreviated in the future.

UnicodeData

  c: code point
  n: name
  gc: general category (Lo)
  cc: combining class (0)
  bc: bidi category (L)
  dm: decomposition mapping
  dt: decomposition type (canonical)
  nt: numeric type
  nv: numeric value
  bm: bidi mirrored (N)
  uc: uppercase (=c)
  lc: lowercase (=c)
  tc: titlecase (=uc)

SpecialCasing:

  sl: special lower (=lc)
  su: special upper (=uc)
  st: special title (=su)
  sc: special case condition

CaseFolding:

  fc: foldcase (=sl)

CompositionExclusions:

  ce: composition exclusion (N)

EastAsianWidth:

  ea: east asian width (N)

Jamo:

  jn: jamo name

LineBreak:

  lb: line break class (AL)

ArabicShaping:

  jt: joining type
  jg: joining group

BidiMirroring:

  bg: bidi mirroring glyph (=c)

PropList:

  xs: space-delimited list of properties from the file

WARNING: these values are likely to change!


Data Modifications

The XML format is generated from the original semicolon-delimited UCD files. In general, all fields and values are direct copies. However, there are some changes, detailed below.

1. Some redundant or annotational fields are omitted

UnicodeData
1.0 Name
10646 comment

CaseFolding
Type (since it is computable from whether the fold equals the normal lowercase)

ArabicShaping
Name

EastAsianWidth
Name

LineBreak
Name

PropList

The fields are based on the proposed PropList.alpha, which changes the fields considerably.

WARNING: other values are also likely to change!

2. Some fields are broken into several fields; others may be combined into a single field