International Components for Unicode

ICU Codepage Conversion

The ICU conversion API is a set of C functions used to convert to and from Unicode and various character sets (codepages, encodings, character encoding schemes).

Conversion-related files:

Converter types

In order to handle many kinds of character encoding schemes, ICU has a number of converter implementations, one per type. Some of these types are for purely algorithmic conversions that do not need to load data. For example, the UTF converters calculate Unicode code points from the input bytes, and vice versa. Also, the ISO_2022 converter starts without any specific conversion data table until it needs to - handling escape sequences and the general structure of ISO 2022 is done with static data.

Many other encodings share common characteristics and need by definition tables to convert text between them and Unicode. A converter object for such an encoding is instantiated by loading a (.cnv) data file (typically from the single, common ICU data file) and associating it with a converter type implementation depending on the type information in the data.

The following describes specifics about each converter type:

MBCS

The MBCS converter is a data-based converter for Multi-Byte Characater Sets. It has been reimplemented for ICU 1.6 to handle a wider range of such encodings. Its current capabilities and limitations are:

The conversion to Unicode uses a state machine to achieve the above capabilities with reasonable data file sizes. The state machine information itself is loaded with the conversion data and defines the structure of the codepage, including which byte sequences are valid, unassigned, and illegal. This data cannot (or not easily) be computed from the pure mapping data. Instead, the .ucm files for MBCS encodings have additional entries that are specific to ICU's makeconv and this converter type. They are additional header lines that start with <icu:state>. Each such line defines one state of the state machine. The state machine uses a table of as many rows as there are states (= as many as there are <icu:state> lines). Each row has 256 entries, one for each possible byte value.

The state table lines in the .ucm header follow the following EBNF-like grammar (whitespace is allowed between all tokens):

    row=[firstentry ','] entry (',' entry)*
    firstentry="initial" | "surrogates"
               (initial state (default for state 0), output is all surrogate pairs)
Each state table row description (that follows the <icu:state>) begins with an optional initial or surrogates keyword and is followed by one or more column entries. For the purpose of MBCS state tables, the states=rows in the table are numbered beginning with 0 at the first such line in the .ucm file header. The numbers are assigned implicitly by makeconv in order of the <icu:state> lines.
    entry=range [':' nextstate] ['.' [action]]
    range=number ['-' number]
    nextstate=number
              (0..7f)
    action='u' | 's' | 'p' | 'i'
           (unassigned, state change only, surrogate pair, illegal)
    number=(1- or 2-digit hexadecimal number)
Each column entry consists at least of a hexadecimal byte value or value range and is separated by the following column entry by a comma. The column entry specifies how to interpret an input byte in the row's state. If neither a next state nor an action is explicitly specified - only the byte value (range) is given - then the byte value terminates the byte sequence, results in a valid mapping to a Unicode BMP character, and the state number is reset to 0.

The next state can be explicitly specified with a separating colon (:) followed by the number of the state (=number/index of the row, starting at 0). This is mostly used for intermediate byte values, i.e., for bytes that are not the last ones in a sequence. The state machine needs to proceed to the next state and read another byte. In this case, no other action is specified.

If the byte value(s) terminate(s) a byte sequence, then the byte sequence results in the following depending on the action that is announced with a period (.) followed by a letter:

If an action is specified but no next state, then the next state number defaults to 0. In other words, a byte value (range) terminates a sequence if there is an action specified for it, or when there is neither an action nor a next state - in this case, it defaults to "valid, next state is 0" (equivalent to :0.).

If a byte value is not specified in any column entry of a row, then it is illegal in the current state. If a byte value is specified in more than one column entry of the same row, then the last one is used. This allows to specify common properties for a wide byte value range followed by a few exceptions and is easier than having to specify mutually exclusive ranges, especially if many of them have the same properties.

The optional keyword at the beginning of a state line has the following effect:

When converting to Unicode, the state machine starts in state number 0. In each iteration, it reads one input (codepage) byte and either just goes to the next state as specified, or treats it as a final byte with the specified action and an optional non-0 next (initial) state. This means that a state table needs to have at least as many state rows as the maximum number of bytes per character, which is the maximum length of any byte sequence.

Examples for MBCS state tables