From 75ce8d5384be40c6cbeaa3afd15714ae5c076163 Mon Sep 17 00:00:00 2001
From: Markus Scherer The ICU conversion API is a set of C functions used to convert to and from
-Unicode and various character sets (codepages, encodings, character encoding schemes). In order to handle many kinds of character encoding schemes, ICU has a number of
-converter implementations, one per type. Some of these types are for purely algorithmic
-conversions that do not need to load data. For example, the UTF converters calculate
-Unicode code points from the input bytes, and vice versa. Also, the ISO_2022 converter
-starts without any specific conversion data table until it needs to - handling escape
-sequences and the general structure of ISO 2022 is done with static data. Many other encodings share common characteristics and need by definition tables
-to convert text between them and Unicode. A converter object for such an encoding
-is instantiated by loading a (.cnv) data file (typically from the single, common
-ICU data file) and associating it with a converter type implementation depending
-on the type information in the data. The following describes specifics about each converter type: The MBCS converter is a data-based converter for Multi-Byte Characater Sets.
-It has been reimplemented for ICU 1.6 to handle a wider range of such encodings.
-Its current capabilities and limitations are:
-International Components for Unicode
-
-ICU Codepage Conversion
-
-Conversion-related files:
-
-
-
-
-
- For C, the API is defined in ucnv.h;
- advanced functionality is also defined in ucnv_err.h (callbacks) and
- in ucnv_cb.h (output functions for custom callbacks).
- For C++ the API is defined in convert.h (the C++ class is a wrapper around the C implementation).
- In addition, the file icu/data/convrtrs.txt contains information about "aliases", i.e.,
- alternative names for converters. It is read by gencnval (in icu/source/tools/gencnval)
- which writes the binary file cnvalias.dat that also gets packaged into the common data file.Converter types
-
-MBCS
-
-
-
-
The conversion to Unicode uses a state machine to achieve the above capabilities with
-reasonable data file sizes. The state machine information itself is loaded with the
-conversion data and defines the structure of the codepage, including which byte sequences
-are valid, unassigned, and illegal. This data cannot (or not easily) be computed from
-the pure mapping data. Instead, the .ucm files for MBCS encodings have additional entries
-that are specific to ICU's makeconv and this converter type. They are additional header lines
-that start with <icu:state>
. Each such line defines one state of the state machine.
-The state machine uses a table of as many rows as there are states (= as many as there are
-<icu:state>
lines). Each row has 256 entries, one for each possible byte value.
The state table lines in the .ucm header follow the following EBNF-like grammar -(whitespace is allowed between all tokens): -
- row=[firstentry ','] entry (',' entry)* - firstentry="initial" | "surrogates" - (initial state (default for state 0), output is all surrogate pairs) --Each state table row description (that follows the
<icu:state>
)
-begins with an optional initial
or surrogates
keyword
-and is followed by one or more column entries.
-For the purpose of MBCS state tables, the states=rows in the table are numbered
-beginning with 0 at the first such line in the .ucm file header.
-The numbers are assigned implicitly by makeconv in order of the <icu:state>
-lines.
-- entry=range [':' nextstate] ['.' [action]] - range=number ['-' number] - nextstate=number - (0..7f) - action='u' | 's' | 'p' | 'i' - (unassigned, state change only, surrogate pair, illegal) - number=(1- or 2-digit hexadecimal number) --Each column entry consists at least of a hexadecimal byte value or value range -and is separated by the following column entry by a comma. -The column entry specifies how to interpret an input byte in the row's state. -If neither a next state nor an action is explicitly specified - only the byte -value (range) is given - then the byte value terminates the byte sequence, -results in a valid mapping to a Unicode BMP character, and the state number is -reset to 0. - -
The next state can be explicitly specified with a separating
-colon (:
) followed by the number of the state (=number/index of the row,
-starting at 0). This is mostly used for intermediate byte values, i.e., for
-bytes that are not the last ones in a sequence. The state machine needs to
-proceed to the next state and read another byte. In this case, no other action
-is specified.
If the byte value(s) terminate(s) a byte sequence, then the byte
-sequence results in the following depending on the action that is announced with
-a period (.
) followed by a letter:
-
u
- Unassigned. The byte sequence is valid but does not encode a character.p
- surrogate Pair. The byte sequence is valid and may result ini
- Illegal. The byte sequence is illegal. This is the default for
- all byte values in a row that are not otherwise specified with
- column entries.s
- State change only. The byte sequence does not encode any character
- but may change the state number. This could be used with simple, stateful
- encodings (using, for example, SI/SO codes),
- but ICU currently does not take advantage of it.:0.
).
-
-If a byte value is not specified in any column entry of a row, then it is -illegal in the current state. If a byte value is specified in more than one column -entry of the same row, then the last one is used. This allows to specify common -properties for a wide byte value range followed by a few exceptions and is easier than -having to specify mutually exclusive ranges, especially if many of them have the -same properties.
- -The optional keyword at the beginning of a state line has the following effect: -
initial
: The state machine can start reading byte sequences
- in this state. State 0 is always an initial state. Only initial states can be
- next states for final byte values. In an initial state, the Unicode mappings
- for all final bytes are also stored directly in the state table.surrogates
: All Unicode mappings for final bytes in non-initial
- states are stored in a separate table of 16-bit Unicode (UTF-16) code units.
- Since most legacy codepages map only to Unicode code points up to U+ffff
- (the Basic Multilingual Plane, BMP), the default allocation per mapping
- result is one 16-bit unit. Individual byte values can be specified to map
- to surrogate pairs (= two 16-bit units) with action letter p
.
- The surrogates
keyword specifies this for the entire state (row).
- Surrogate pair mapping entries can still hold single units depending on the
- actual mapping data, but single-unit mapping entries cannot hold a pair of units.
- Mapping to single-unit entries is the default because the mapping is faster,
- uses half as much memory in the code units table, and is sufficient for most
- legacy codepages.When converting to Unicode, the state machine starts in state number 0. -In each iteration, it reads one input (codepage) byte and either just goes to -the next state as specified, or treats it as a final byte with the specified action -and an optional non-0 next (initial) state. This means that a state table needs to -have at least as many state rows as the maximum number of bytes per character, -which is the maximum length of any byte sequence.
- -- 0-7f -- This single-row state table describes US-ASCII. - Byte values from 0 to 0x7f are valid and map to Unicode character up to U+ffff. - Byte values from 0x80 to 0xff are illegal.
- 0-7f, 81-9f:1, a0-df, e0-fc:1 - 40-7e, 80-fc -- This two-row state table describes the structure of Shift-JIS, which encodes some characters - with one byte each, and others with two bytes each. - Bytes 0 to 0x7f and 0xa0 to 0xdf are valid single-byte encodings. - Bytes 0x81 to 0x9f and 0xe0 to 0xfc are lead bytes, i.e., they are followed by one of - the bytes that are specified as valid in state 1. - A byte sequence of 0x85 0x61 is valid, while a single byte of 0x80 or 0xff is illegal. - Similarly, a byte sequence of 0x85 0x31 is illegal.
- 0-8d, 8e:2, 8f:3, 90-9f, a1-fe:1 - a1-fe - a1-e4 - a1-fe:1, a1:4, a3-af:4, b6:4, d6:4, da-db:4, ed-f2:4 - a1-fe.u -- This fairly complicated state table describes EUC-JP. - Valid byte sequences are one, two, or three bytes long. - Two-byte sequences have lead byte 0x8e and end in state 2, or - lead bytes 0xa1 to 0xfe and end in state 1. - Three-byte sequences have a lead byte of 0x8f and continue in state 3. - Some final byte value ranges are entirely unassigned, therefore they end in state 4 - with an action letter of
u
for "unassigned" to save significant memory
- for the code units table.
- Assigned three-byte sequences end in state 1 like most two-byte sequences.