ICU-795 move this documentation into the user guide

X-SVN-Rev: 3414
2001-01-12 21:16:14 +00:00 · 2001-01-12 21:16:14 +00:00 · 75ce8d5384
commit 75ce8d5384
parent 4b2a7a5603
1 changed files with 0 additions and 231 deletions
--- a/icu4c/docs/conversion_interface.htm
+++ b/icu4c/docs/conversion_interface.htm
@ -1,231 +0,0 @@
-<html lang="en">
-<head>
-<meta http-equiv="Content-Type" content="text/html; charset=us-ascii">
-<title>ICU Codepage Conversion</title>
-</head>
-
-<body>
-
-<h1>International Components for Unicode</h1>
-
-<h2>ICU Codepage Conversion</h2>
-
-<p>The ICU conversion API is a set of C functions used to convert to and from
-Unicode and various character sets (codepages, encodings, character encoding schemes).</p>
-
-<h3>Conversion-related files:</h3>
-
-<ul>
-    <li>API: The API header files are in icu/source/common/unicode:<br>
-        For C, the API is defined in ucnv.h;
-        advanced functionality is also defined in ucnv_err.h (callbacks) and
-        in ucnv_cb.h (output functions for custom callbacks).<br>
-        For C++ the API is defined in convert.h (the C++ class is a wrapper around the C implementation).</li>
-    <li>Implementation: The converter implementation files are in icu/source/common;
-        all such files begin with "ucnv". The C++ wrapper implementation is in convert.cpp.</li>
-    <li>Conversion table generation tool: The makeconv tool that generates binary conversion files
-        from text files is in icu/source/tools/makeconv.
-        It reads .ucm text files with a format that is close to what the AIX tool uconvdef uses.
-        makeconv writes one binary, memory-mappable .cnv file per .ucm file.</li>
-    <li>Conversion data: The .ucm text files with the conversion table data are all in
-        the icu/data folder. During the build process, makeconv generates binary .cnv files from
-        each of them, and the pkgdata tool includes them into the common data file.<br>
-        In addition, the file icu/data/convrtrs.txt contains information about "aliases", i.e.,
-        alternative names for converters. It is read by gencnval (in icu/source/tools/gencnval)
-        which writes the binary file cnvalias.dat that also gets packaged into the common data file.</li>
-</ul>
-
-<h2>Converter types</h2>
-
-<p>In order to handle many kinds of character encoding schemes, ICU has a number of
-converter implementations, one per type. Some of these types are for purely algorithmic
-conversions that do not need to load data. For example, the UTF converters calculate
-Unicode code points from the input bytes, and vice versa. Also, the ISO_2022 converter
-starts without any specific conversion data table until it needs to - handling escape
-sequences and the general structure of ISO 2022 is done with static data.</p>
-
-<p>Many other encodings share common characteristics and need by definition tables
-to convert text between them and Unicode. A converter object for such an encoding
-is instantiated by loading a (.cnv) data file (typically from the single, common
-ICU data file) and associating it with a converter type implementation depending
-on the type information in the data.</p>
-
-<p>The following describes specifics about each converter type:</p>
-
-<h3>MBCS</h3>
-
-<p>The MBCS converter is a data-based converter for Multi-Byte Characater Sets.
-It has been reimplemented for ICU 1.6 to handle a wider range of such encodings.
-Its current capabilities and limitations are:
-<ul>
-    <li>Support for variable-length, byte-based encodings with 1 to 4 bytes per character.</li>
-    <li>Support for all Unicode characters (code points 0..0x10ffff).
-        Since ICU uses UTF-16 as its Unicode encoding form, this means that surrogate
-        pairs are fully supported.</li>
-    <li>Efficient distinction of unassigned vs. illegal byte sequences.</li>
-    <li>It would possible in fromUnicode() to directly deal with simple
-        stateful encodings. (This is currently not used.)</li>
-    <li>It is possible to convert Unicode code points other than U+0000
-        to a single zero byte (but not as a fallback).</li>
-    <li>It is not otherwise possible to convert from Unicode to byte sequences
-        with leading zero bytes.</li>
-</ul>
-</p>
-
-<p>The conversion to Unicode uses a state machine to achieve the above capabilities with
-reasonable data file sizes. The state machine information itself is loaded with the
-conversion data and defines the structure of the codepage, including which byte sequences
-are valid, unassigned, and illegal. This data cannot (or not easily) be computed from
-the pure mapping data. Instead, the .ucm files for MBCS encodings have additional entries
-that are specific to ICU's makeconv and this converter type. They are additional header lines
-that start with <code>&lt;icu:state></code>. Each such line defines one state of the state machine.
-The state machine uses a table of as many rows as there are states (= as many as there are
-<code>&lt;icu:state></code> lines). Each row has 256 entries, one for each possible byte value.</p>
-
-<p>The state table lines in the .ucm header follow the following EBNF-like grammar
-(whitespace is allowed between all tokens):
-<pre>
-    row=[firstentry ','] entry (',' entry)*
-    firstentry="initial" | "surrogates"
-               (initial state (default for state 0), output is all surrogate pairs)
-</pre>
-Each state table row description (that follows the <code>&lt;icu:state></code>)
-begins with an optional <code>initial</code> or <code>surrogates</code> keyword
-and is followed by one or more column entries.
-For the purpose of MBCS state tables, the states=rows in the table are numbered
-beginning with 0 at the first such line in the .ucm file header.
-The numbers are assigned implicitly by makeconv in order of the <code>&lt;icu:state></code>
-lines.
-<pre>
-    entry=range [':' nextstate] ['.' [action]]
-    range=number ['-' number]
-    nextstate=number
-              (0..7f)
-    action='u' | 's' | 'p' | 'i'
-           (unassigned, state change only, surrogate pair, illegal)
-    number=(1- or 2-digit hexadecimal number)
-</pre>
-Each column entry consists at least of a hexadecimal byte value or value range
-and is separated by the following column entry by a comma.
-The column entry specifies how to interpret an input byte in the row's state.
-If neither a next state nor an action is explicitly specified - only the byte
-value (range) is given - then the byte value terminates the byte sequence,
-results in a valid mapping to a Unicode BMP character, and the state number is
-reset to 0.</p>
-
-<p>The next state can be explicitly specified with a separating
-colon (<code>:</code>) followed by the number of the state (=number/index of the row,
-starting at 0). This is mostly used for intermediate byte values, i.e., for
-bytes that are not the last ones in a sequence. The state machine needs to
-proceed to the next state and read another byte. In this case, no other action
-is specified.</p>
-
-<p>If the byte value(s) terminate(s) a byte sequence, then the byte
-sequence results in the following depending on the action that is announced with
-a period (<code>.</code>) followed by a letter:
-<ul>
-    <li><code>u</code> - Unassigned. The byte sequence is valid but does not encode a character.</li>
-    <li>(no letter) - valid. If no action letter is specified, then
-        the byte sequence is valid and encodes a Unicode character up to
-        U+ffff.</li>
-    <li><code>p</code> - surrogate Pair. The byte sequence is valid and may result in</li>
-    <li><code>i</code> - Illegal. The byte sequence is illegal. This is the default for
-        all byte values in a row that are not otherwise specified with
-        column entries.</li>
-    <li><code>s</code> - State change only. The byte sequence does not encode any character
-        but may change the state number. This could be used with simple, stateful
-        encodings (using, for example, SI/SO codes),
-        but ICU currently does not take advantage of it.</li>
-</ul>
-If an action is specified but no next state, then the next state number defaults to 0.
-In other words, a byte value (range) terminates a sequence if there is an action
-specified for it, or when there is neither an action nor a next state - in this case,
-it defaults to "valid, next state is 0" (equivalent to <code>:0.</code>).</p>
-
-<p>If a byte value is not specified in any column entry of a row, then it is
-illegal in the current state. If a byte value is specified in more than one column
-entry of the same row, then the last one is used. This allows to specify common
-properties for a wide byte value range followed by a few exceptions and is easier than
-having to specify mutually exclusive ranges, especially if many of them have the
-same properties.</p>
-
-<p>The optional keyword at the beginning of a state line has the following effect:
-<ul>
-    <li><code>initial</code>: The state machine can start reading byte sequences
-        in this state. State 0 is always an initial state. Only initial states can be
-        next states for final byte values. In an initial state, the Unicode mappings
-        for all final bytes are also stored directly in the state table.</li>
-    <li><code>surrogates</code>: All Unicode mappings for final bytes in non-initial
-        states are stored in a separate table of 16-bit Unicode (UTF-16) code units.
-        Since most legacy codepages map only to Unicode code points up to U+ffff
-        (the Basic Multilingual Plane, BMP), the default allocation per mapping
-        result is one 16-bit unit. Individual byte values can be specified to map
-        to surrogate pairs (= two 16-bit units) with action letter <code>p</code>.
-        The <code>surrogates</code> keyword specifies this for the entire state (row).
-        Surrogate pair mapping entries can still hold single units depending on the
-        actual mapping data, but single-unit mapping entries cannot hold a pair of units.
-        Mapping to single-unit entries is the default because the mapping is faster,
-        uses half as much memory in the code units table, and is sufficient for most
-        legacy codepages.</li>
-</ul>
-</p>
-
-<p>When converting to Unicode, the state machine starts in state number 0.
-In each iteration, it reads one input (codepage) byte and either just goes to
-the next state as specified, or treats it as a final byte with the specified action
-and an optional non-0 next (initial) state. This means that a state table needs to
-have at least as many state rows as the maximum number of bytes per character,
-which is the maximum length of any byte sequence.</p>
-
-<h4>Examples for MBCS state tables</h4>
-
-<ul>
-    <li>US-ASCII:
-    <pre>
-    0-7f
-    </pre>
-    This single-row state table describes US-ASCII.
-    Byte values from 0 to 0x7f are valid and map to Unicode character up to U+ffff.
-    Byte values from 0x80 to 0xff are illegal.<br>
-    &nbsp;</li>
-    <li>Shift-JIS:
-    <pre>
-    0-7f, 81-9f:1, a0-df, e0-fc:1
-    40-7e, 80-fc
-    </pre>
-    This two-row state table describes the structure of Shift-JIS, which encodes some characters
-    with one byte each, and others with two bytes each.
-    Bytes 0 to 0x7f and 0xa0 to 0xdf are valid single-byte encodings.
-    Bytes 0x81 to 0x9f and 0xe0 to 0xfc are lead bytes, i.e., they are followed by one of
-    the bytes that are specified as valid in state 1.
-    A byte sequence of 0x85 0x61 is valid, while a single byte of 0x80 or 0xff is illegal.
-    Similarly, a byte sequence of 0x85 0x31 is illegal.<br>
-    &nbsp;</li>
-    <li>EUC-JP:
-    <pre>
-    0-8d, 8e:2, 8f:3, 90-9f, a1-fe:1
-    a1-fe
-    a1-e4
-    a1-fe:1, a1:4, a3-af:4, b6:4, d6:4, da-db:4, ed-f2:4
-    a1-fe.u
-    </pre>
-    This fairly complicated state table describes EUC-JP.
-    Valid byte sequences are one, two, or three bytes long.
-    Two-byte sequences have lead byte 0x8e and end in state 2, or
-    lead bytes 0xa1 to 0xfe and end in state 1.
-    Three-byte sequences have a lead byte of 0x8f and continue in state 3.
-    Some final byte value ranges are entirely unassigned, therefore they end in state 4
-    with an action letter of <code>u</code> for "unassigned" to save significant memory
-    for the code units table.
-    Assigned three-byte sequences end in state 1 like most two-byte sequences.<br>
-    <em>Note: </em>This reuse of a final or intermediate state is valid for as long
-    as there is no circle in the state chain. The mappings will be unique because of
-    the different path to the shared state.
-    (Sharing a state saves some memory: Each state table row occupies 1kB in the .cnv file.)<br>
-    This table also shows the redefinition of byte value ranges within one state row
-    (number 3) as a shorthand.<br>
-    &nbsp;</li>
-</ul>
-
-</body>
-</html>