diff --git a/icu4c/docs/conversion_interface.htm b/icu4c/docs/conversion_interface.htm deleted file mode 100644 index 55611d3c0a..0000000000 --- a/icu4c/docs/conversion_interface.htm +++ /dev/null @@ -1,231 +0,0 @@ - - - -ICU Codepage Conversion - - - - -

International Components for Unicode

- -

ICU Codepage Conversion

- -

The ICU conversion API is a set of C functions used to convert to and from -Unicode and various character sets (codepages, encodings, character encoding schemes).

- -

Conversion-related files:

- - - -

Converter types

- -

In order to handle many kinds of character encoding schemes, ICU has a number of -converter implementations, one per type. Some of these types are for purely algorithmic -conversions that do not need to load data. For example, the UTF converters calculate -Unicode code points from the input bytes, and vice versa. Also, the ISO_2022 converter -starts without any specific conversion data table until it needs to - handling escape -sequences and the general structure of ISO 2022 is done with static data.

- -

Many other encodings share common characteristics and need by definition tables -to convert text between them and Unicode. A converter object for such an encoding -is instantiated by loading a (.cnv) data file (typically from the single, common -ICU data file) and associating it with a converter type implementation depending -on the type information in the data.

- -

The following describes specifics about each converter type:

- -

MBCS

- -

The MBCS converter is a data-based converter for Multi-Byte Characater Sets. -It has been reimplemented for ICU 1.6 to handle a wider range of such encodings. -Its current capabilities and limitations are: -

-

- -

The conversion to Unicode uses a state machine to achieve the above capabilities with -reasonable data file sizes. The state machine information itself is loaded with the -conversion data and defines the structure of the codepage, including which byte sequences -are valid, unassigned, and illegal. This data cannot (or not easily) be computed from -the pure mapping data. Instead, the .ucm files for MBCS encodings have additional entries -that are specific to ICU's makeconv and this converter type. They are additional header lines -that start with <icu:state>. Each such line defines one state of the state machine. -The state machine uses a table of as many rows as there are states (= as many as there are -<icu:state> lines). Each row has 256 entries, one for each possible byte value.

- -

The state table lines in the .ucm header follow the following EBNF-like grammar -(whitespace is allowed between all tokens): -

-    row=[firstentry ','] entry (',' entry)*
-    firstentry="initial" | "surrogates"
-               (initial state (default for state 0), output is all surrogate pairs)
-
-Each state table row description (that follows the <icu:state>) -begins with an optional initial or surrogates keyword -and is followed by one or more column entries. -For the purpose of MBCS state tables, the states=rows in the table are numbered -beginning with 0 at the first such line in the .ucm file header. -The numbers are assigned implicitly by makeconv in order of the <icu:state> -lines. -
-    entry=range [':' nextstate] ['.' [action]]
-    range=number ['-' number]
-    nextstate=number
-              (0..7f)
-    action='u' | 's' | 'p' | 'i'
-           (unassigned, state change only, surrogate pair, illegal)
-    number=(1- or 2-digit hexadecimal number)
-
-Each column entry consists at least of a hexadecimal byte value or value range -and is separated by the following column entry by a comma. -The column entry specifies how to interpret an input byte in the row's state. -If neither a next state nor an action is explicitly specified - only the byte -value (range) is given - then the byte value terminates the byte sequence, -results in a valid mapping to a Unicode BMP character, and the state number is -reset to 0.

- -

The next state can be explicitly specified with a separating -colon (:) followed by the number of the state (=number/index of the row, -starting at 0). This is mostly used for intermediate byte values, i.e., for -bytes that are not the last ones in a sequence. The state machine needs to -proceed to the next state and read another byte. In this case, no other action -is specified.

- -

If the byte value(s) terminate(s) a byte sequence, then the byte -sequence results in the following depending on the action that is announced with -a period (.) followed by a letter: -

-If an action is specified but no next state, then the next state number defaults to 0. -In other words, a byte value (range) terminates a sequence if there is an action -specified for it, or when there is neither an action nor a next state - in this case, -it defaults to "valid, next state is 0" (equivalent to :0.).

- -

If a byte value is not specified in any column entry of a row, then it is -illegal in the current state. If a byte value is specified in more than one column -entry of the same row, then the last one is used. This allows to specify common -properties for a wide byte value range followed by a few exceptions and is easier than -having to specify mutually exclusive ranges, especially if many of them have the -same properties.

- -

The optional keyword at the beginning of a state line has the following effect: -

-

- -

When converting to Unicode, the state machine starts in state number 0. -In each iteration, it reads one input (codepage) byte and either just goes to -the next state as specified, or treats it as a final byte with the specified action -and an optional non-0 next (initial) state. This means that a state table needs to -have at least as many state rows as the maximum number of bytes per character, -which is the maximum length of any byte sequence.

- -

Examples for MBCS state tables

- - - - -