ICU-795 move this documentation into the user guide
X-SVN-Rev: 3414
This commit is contained in:
parent
4b2a7a5603
commit
75ce8d5384
@ -1,231 +0,0 @@
|
||||
<html lang="en">
|
||||
<head>
|
||||
<meta http-equiv="Content-Type" content="text/html; charset=us-ascii">
|
||||
<title>ICU Codepage Conversion</title>
|
||||
</head>
|
||||
|
||||
<body>
|
||||
|
||||
<h1>International Components for Unicode</h1>
|
||||
|
||||
<h2>ICU Codepage Conversion</h2>
|
||||
|
||||
<p>The ICU conversion API is a set of C functions used to convert to and from
|
||||
Unicode and various character sets (codepages, encodings, character encoding schemes).</p>
|
||||
|
||||
<h3>Conversion-related files:</h3>
|
||||
|
||||
<ul>
|
||||
<li>API: The API header files are in icu/source/common/unicode:<br>
|
||||
For C, the API is defined in ucnv.h;
|
||||
advanced functionality is also defined in ucnv_err.h (callbacks) and
|
||||
in ucnv_cb.h (output functions for custom callbacks).<br>
|
||||
For C++ the API is defined in convert.h (the C++ class is a wrapper around the C implementation).</li>
|
||||
<li>Implementation: The converter implementation files are in icu/source/common;
|
||||
all such files begin with "ucnv". The C++ wrapper implementation is in convert.cpp.</li>
|
||||
<li>Conversion table generation tool: The makeconv tool that generates binary conversion files
|
||||
from text files is in icu/source/tools/makeconv.
|
||||
It reads .ucm text files with a format that is close to what the AIX tool uconvdef uses.
|
||||
makeconv writes one binary, memory-mappable .cnv file per .ucm file.</li>
|
||||
<li>Conversion data: The .ucm text files with the conversion table data are all in
|
||||
the icu/data folder. During the build process, makeconv generates binary .cnv files from
|
||||
each of them, and the pkgdata tool includes them into the common data file.<br>
|
||||
In addition, the file icu/data/convrtrs.txt contains information about "aliases", i.e.,
|
||||
alternative names for converters. It is read by gencnval (in icu/source/tools/gencnval)
|
||||
which writes the binary file cnvalias.dat that also gets packaged into the common data file.</li>
|
||||
</ul>
|
||||
|
||||
<h2>Converter types</h2>
|
||||
|
||||
<p>In order to handle many kinds of character encoding schemes, ICU has a number of
|
||||
converter implementations, one per type. Some of these types are for purely algorithmic
|
||||
conversions that do not need to load data. For example, the UTF converters calculate
|
||||
Unicode code points from the input bytes, and vice versa. Also, the ISO_2022 converter
|
||||
starts without any specific conversion data table until it needs to - handling escape
|
||||
sequences and the general structure of ISO 2022 is done with static data.</p>
|
||||
|
||||
<p>Many other encodings share common characteristics and need by definition tables
|
||||
to convert text between them and Unicode. A converter object for such an encoding
|
||||
is instantiated by loading a (.cnv) data file (typically from the single, common
|
||||
ICU data file) and associating it with a converter type implementation depending
|
||||
on the type information in the data.</p>
|
||||
|
||||
<p>The following describes specifics about each converter type:</p>
|
||||
|
||||
<h3>MBCS</h3>
|
||||
|
||||
<p>The MBCS converter is a data-based converter for Multi-Byte Characater Sets.
|
||||
It has been reimplemented for ICU 1.6 to handle a wider range of such encodings.
|
||||
Its current capabilities and limitations are:
|
||||
<ul>
|
||||
<li>Support for variable-length, byte-based encodings with 1 to 4 bytes per character.</li>
|
||||
<li>Support for all Unicode characters (code points 0..0x10ffff).
|
||||
Since ICU uses UTF-16 as its Unicode encoding form, this means that surrogate
|
||||
pairs are fully supported.</li>
|
||||
<li>Efficient distinction of unassigned vs. illegal byte sequences.</li>
|
||||
<li>It would possible in fromUnicode() to directly deal with simple
|
||||
stateful encodings. (This is currently not used.)</li>
|
||||
<li>It is possible to convert Unicode code points other than U+0000
|
||||
to a single zero byte (but not as a fallback).</li>
|
||||
<li>It is not otherwise possible to convert from Unicode to byte sequences
|
||||
with leading zero bytes.</li>
|
||||
</ul>
|
||||
</p>
|
||||
|
||||
<p>The conversion to Unicode uses a state machine to achieve the above capabilities with
|
||||
reasonable data file sizes. The state machine information itself is loaded with the
|
||||
conversion data and defines the structure of the codepage, including which byte sequences
|
||||
are valid, unassigned, and illegal. This data cannot (or not easily) be computed from
|
||||
the pure mapping data. Instead, the .ucm files for MBCS encodings have additional entries
|
||||
that are specific to ICU's makeconv and this converter type. They are additional header lines
|
||||
that start with <code><icu:state></code>. Each such line defines one state of the state machine.
|
||||
The state machine uses a table of as many rows as there are states (= as many as there are
|
||||
<code><icu:state></code> lines). Each row has 256 entries, one for each possible byte value.</p>
|
||||
|
||||
<p>The state table lines in the .ucm header follow the following EBNF-like grammar
|
||||
(whitespace is allowed between all tokens):
|
||||
<pre>
|
||||
row=[firstentry ','] entry (',' entry)*
|
||||
firstentry="initial" | "surrogates"
|
||||
(initial state (default for state 0), output is all surrogate pairs)
|
||||
</pre>
|
||||
Each state table row description (that follows the <code><icu:state></code>)
|
||||
begins with an optional <code>initial</code> or <code>surrogates</code> keyword
|
||||
and is followed by one or more column entries.
|
||||
For the purpose of MBCS state tables, the states=rows in the table are numbered
|
||||
beginning with 0 at the first such line in the .ucm file header.
|
||||
The numbers are assigned implicitly by makeconv in order of the <code><icu:state></code>
|
||||
lines.
|
||||
<pre>
|
||||
entry=range [':' nextstate] ['.' [action]]
|
||||
range=number ['-' number]
|
||||
nextstate=number
|
||||
(0..7f)
|
||||
action='u' | 's' | 'p' | 'i'
|
||||
(unassigned, state change only, surrogate pair, illegal)
|
||||
number=(1- or 2-digit hexadecimal number)
|
||||
</pre>
|
||||
Each column entry consists at least of a hexadecimal byte value or value range
|
||||
and is separated by the following column entry by a comma.
|
||||
The column entry specifies how to interpret an input byte in the row's state.
|
||||
If neither a next state nor an action is explicitly specified - only the byte
|
||||
value (range) is given - then the byte value terminates the byte sequence,
|
||||
results in a valid mapping to a Unicode BMP character, and the state number is
|
||||
reset to 0.</p>
|
||||
|
||||
<p>The next state can be explicitly specified with a separating
|
||||
colon (<code>:</code>) followed by the number of the state (=number/index of the row,
|
||||
starting at 0). This is mostly used for intermediate byte values, i.e., for
|
||||
bytes that are not the last ones in a sequence. The state machine needs to
|
||||
proceed to the next state and read another byte. In this case, no other action
|
||||
is specified.</p>
|
||||
|
||||
<p>If the byte value(s) terminate(s) a byte sequence, then the byte
|
||||
sequence results in the following depending on the action that is announced with
|
||||
a period (<code>.</code>) followed by a letter:
|
||||
<ul>
|
||||
<li><code>u</code> - Unassigned. The byte sequence is valid but does not encode a character.</li>
|
||||
<li>(no letter) - valid. If no action letter is specified, then
|
||||
the byte sequence is valid and encodes a Unicode character up to
|
||||
U+ffff.</li>
|
||||
<li><code>p</code> - surrogate Pair. The byte sequence is valid and may result in</li>
|
||||
<li><code>i</code> - Illegal. The byte sequence is illegal. This is the default for
|
||||
all byte values in a row that are not otherwise specified with
|
||||
column entries.</li>
|
||||
<li><code>s</code> - State change only. The byte sequence does not encode any character
|
||||
but may change the state number. This could be used with simple, stateful
|
||||
encodings (using, for example, SI/SO codes),
|
||||
but ICU currently does not take advantage of it.</li>
|
||||
</ul>
|
||||
If an action is specified but no next state, then the next state number defaults to 0.
|
||||
In other words, a byte value (range) terminates a sequence if there is an action
|
||||
specified for it, or when there is neither an action nor a next state - in this case,
|
||||
it defaults to "valid, next state is 0" (equivalent to <code>:0.</code>).</p>
|
||||
|
||||
<p>If a byte value is not specified in any column entry of a row, then it is
|
||||
illegal in the current state. If a byte value is specified in more than one column
|
||||
entry of the same row, then the last one is used. This allows to specify common
|
||||
properties for a wide byte value range followed by a few exceptions and is easier than
|
||||
having to specify mutually exclusive ranges, especially if many of them have the
|
||||
same properties.</p>
|
||||
|
||||
<p>The optional keyword at the beginning of a state line has the following effect:
|
||||
<ul>
|
||||
<li><code>initial</code>: The state machine can start reading byte sequences
|
||||
in this state. State 0 is always an initial state. Only initial states can be
|
||||
next states for final byte values. In an initial state, the Unicode mappings
|
||||
for all final bytes are also stored directly in the state table.</li>
|
||||
<li><code>surrogates</code>: All Unicode mappings for final bytes in non-initial
|
||||
states are stored in a separate table of 16-bit Unicode (UTF-16) code units.
|
||||
Since most legacy codepages map only to Unicode code points up to U+ffff
|
||||
(the Basic Multilingual Plane, BMP), the default allocation per mapping
|
||||
result is one 16-bit unit. Individual byte values can be specified to map
|
||||
to surrogate pairs (= two 16-bit units) with action letter <code>p</code>.
|
||||
The <code>surrogates</code> keyword specifies this for the entire state (row).
|
||||
Surrogate pair mapping entries can still hold single units depending on the
|
||||
actual mapping data, but single-unit mapping entries cannot hold a pair of units.
|
||||
Mapping to single-unit entries is the default because the mapping is faster,
|
||||
uses half as much memory in the code units table, and is sufficient for most
|
||||
legacy codepages.</li>
|
||||
</ul>
|
||||
</p>
|
||||
|
||||
<p>When converting to Unicode, the state machine starts in state number 0.
|
||||
In each iteration, it reads one input (codepage) byte and either just goes to
|
||||
the next state as specified, or treats it as a final byte with the specified action
|
||||
and an optional non-0 next (initial) state. This means that a state table needs to
|
||||
have at least as many state rows as the maximum number of bytes per character,
|
||||
which is the maximum length of any byte sequence.</p>
|
||||
|
||||
<h4>Examples for MBCS state tables</h4>
|
||||
|
||||
<ul>
|
||||
<li>US-ASCII:
|
||||
<pre>
|
||||
0-7f
|
||||
</pre>
|
||||
This single-row state table describes US-ASCII.
|
||||
Byte values from 0 to 0x7f are valid and map to Unicode character up to U+ffff.
|
||||
Byte values from 0x80 to 0xff are illegal.<br>
|
||||
</li>
|
||||
<li>Shift-JIS:
|
||||
<pre>
|
||||
0-7f, 81-9f:1, a0-df, e0-fc:1
|
||||
40-7e, 80-fc
|
||||
</pre>
|
||||
This two-row state table describes the structure of Shift-JIS, which encodes some characters
|
||||
with one byte each, and others with two bytes each.
|
||||
Bytes 0 to 0x7f and 0xa0 to 0xdf are valid single-byte encodings.
|
||||
Bytes 0x81 to 0x9f and 0xe0 to 0xfc are lead bytes, i.e., they are followed by one of
|
||||
the bytes that are specified as valid in state 1.
|
||||
A byte sequence of 0x85 0x61 is valid, while a single byte of 0x80 or 0xff is illegal.
|
||||
Similarly, a byte sequence of 0x85 0x31 is illegal.<br>
|
||||
</li>
|
||||
<li>EUC-JP:
|
||||
<pre>
|
||||
0-8d, 8e:2, 8f:3, 90-9f, a1-fe:1
|
||||
a1-fe
|
||||
a1-e4
|
||||
a1-fe:1, a1:4, a3-af:4, b6:4, d6:4, da-db:4, ed-f2:4
|
||||
a1-fe.u
|
||||
</pre>
|
||||
This fairly complicated state table describes EUC-JP.
|
||||
Valid byte sequences are one, two, or three bytes long.
|
||||
Two-byte sequences have lead byte 0x8e and end in state 2, or
|
||||
lead bytes 0xa1 to 0xfe and end in state 1.
|
||||
Three-byte sequences have a lead byte of 0x8f and continue in state 3.
|
||||
Some final byte value ranges are entirely unassigned, therefore they end in state 4
|
||||
with an action letter of <code>u</code> for "unassigned" to save significant memory
|
||||
for the code units table.
|
||||
Assigned three-byte sequences end in state 1 like most two-byte sequences.<br>
|
||||
<em>Note: </em>This reuse of a final or intermediate state is valid for as long
|
||||
as there is no circle in the state chain. The mappings will be unique because of
|
||||
the different path to the shared state.
|
||||
(Sharing a state saves some memory: Each state table row occupies 1kB in the .cnv file.)<br>
|
||||
This table also shows the redefinition of byte value ranges within one state row
|
||||
(number 3) as a shorthand.<br>
|
||||
</li>
|
||||
</ul>
|
||||
|
||||
</body>
|
||||
</html>
|
Loading…
Reference in New Issue
Block a user