scuffed-code/icu4c/source/tools/makeconv/gb18030/gb18030.html

175 lines
10 KiB
HTML
Raw Normal View History

<html lang="en">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=us-ascii">
<title>GB 18030</title>
</head>
<!-- Copyright (C) 2000, International Business Machines Corporation and others. All Rights Reserved. -->
<body>
<h1>GB 18030</h1>
<p align="right">Markus Scherer, 2000-nov-30</p>
<p>GB 18030 is a new Chinese codepage standard, published 2000-mar-17, that is designed for
<ul>
<li>Upwards compatibility with the GB 2312-1980 standard</li>
<li>Compatibility with the GBK specification, updated for Unicode 3.0</li>
<li>Full coverage of all Unicode code points similar to a UTF</li>
</ul>
After discussions between the Chinese standards agency and IT companies, GB 18030 was <em>republished</em>.
On 2000-nov-30, a modified mapping table file was released, and
the text of the standard is expected to be republished in December.</p>
<p>Byte sequence structure:
<ul>
<li>Single-byte: 00-7f</li>
<li>Two-byte: 81-fe | 40-7e, 80-fe</li>
<li>Four-byte: 81-fe | 30-39 | 81-fe | 30-39</li>
</ul></p>
<p>Special properties of GB 18030:
<ul>
<li>Huge: 1.6 million codepage code points &mdash; probably the largest codepage</li>
<li>Similar to UTF: All 1.1 million Unicode code points U+0000-U+10ffff
except for surrogates U+d800-U+dfff
map to and from GB 18030 codes.</li>
<li>Most of these mappings, except for parts of the BMP, can be done algorithmically.
This makes it an unusual mix of a Unicode encoding with a traditional codepage.</li>
<li>It is not possible for all codepage byte sequences to determine the length of
the sequence from the first byte. This is unusual.</li>
</ul></p>
<h2>Generating a GB 18030 mapping table</h2>
<p>GB 18030 is derived from existing standards and specifications,
and a mapping table can be generated from existing data with modifications.<br>
On 2000-nov-30, the Chinese standards agency released a mapping table that differs
from the original specification from 2000-mar-17. It changes all four-byte GB sequences
for Unicode BMP code points, removes any mappings for single surrogates,
removes all fallback mappings, and changes some mappings to further update the GBK portion to Unicode 3.0.</p>
<p>This following description illustrates the genesis and structure of GB 18030.
The actual data that is included here contains the actual one- and two-byte GB 18030 mappings
for Unicode BMP code points as released by the Chinese standards agency.
To skip the historical discussion, continue at the <a href="#officialdata">discussion of released data</a>.</p>
<p>Historical discussion based on the specification from 2000-mar-17:
<ol>
<li>GBK is a specification (not a standard) that is an extension of GB 2312-1980
to cover the ideographs in Unicode 2.0. Microsoft co-authored GBK.
Get a GBK table, e.g. the one for Microsoft Windows 2000 codepage 936
from <a href="http://oss.software.ibm.com/icu/charset/">ICU sample charsets</a>.
(There are <a href="http://oss.software.ibm.com/icu/charset/CharMaps-UCM/">.ucm files</a>
including <a href="http://oss.software.ibm.com/icu/charset/CharMaps-UCM/windows-936-2000.ucm">windows-936-2000.ucm</a>.)</li>
<li>From the Microsoft codepage table, remove all fallback mappings and the one for GB+ff.
Note that the Windows 2000 version contains the Euro sign at GB+80=U+20ac.
Leave it in there for GB 18030.</li>
<li>Get a copy of appendix E of the GB 18030 standard.
There are 79 characters with "temporary" and "new" Unicode mappings.
The temporary ones map to private-use code points because the characters were not assigned in Unicode 2.0.
In the data, change them from roundtrip mappings to fallbacks.
The new mappings are to Unicode 3.0 code points.
Add them as roundtrip mappings to your data.</li>
<li>U+0080 is not currently mapped by the standard.
Also, there is a small number of known errors, typos, and ambiguities in the original standard publication.
See <a href="ftp://ftp.oreilly.com/pub/examples/nutshell/cjkv/pdf/GB18030_Summary.pdf">this summary</a>.
I have added U+0080=GB+8432eb38 to my data, corrected GB+fe5e=U+2e97, and added fallbacks from GB to Unicode
for the four-byte codes for U+2e97 and U+303e.
This is not official at this point!</li>
<li>You should arrive at data like <a href="gbkuni30.txt">gbkuni30.txt</a>.
This file has the following simplified format on each line:<br>
<code>unicode (':' | '>' | '<') gb ['*' ['*']]</code><br>
The left column contains the Unicode code point, the right column the byte sequence in GB 18030.
The delimiter is either a colon for roundtrip mappings or a greater-than sign
for fallbacks from Unicode to the codepage, or a less-than sign for fallbacks from the codepage to Unicode.
I have marked mappings of the appendix E characters with a star.
In addition, I have marked mappings that <em>should be</em> in appendix E with a double star.</li>
</ol>
</p>
<p><a name="officialdata"></a>
The above description explains the state of the data from 2000-mar-17 with corrections.
On 2000-nov-30, a new mapping table was released that is now the base for the supplied <a href="gbkuni30.txt">gbkuni30.txt</a>.
Compared with the above, it
<ul>
<li>includes explicit mappings for the ASCII characters</li>
<li>corrects the mappings for U+2e97 and U+303e</li>
<li>removes any mappings for single surrogates</li>
<li>removes all fallback mappings, specifying only roundtrip mappings between GB 18030 and Unicode 3.0</li>
<li>changes the mapping for the Euro sign and removes the fullwidth Euro sign (GB+80 is not used any more)</li>
<li>re-enumerates all four-byte GB sequences for Unicode BMP code points from U+0080</li>
</ul>
This results in a new codepage definition that removes backwards compatibility with GBK for some 80 characters
because of the lack of fallback mappings.
The re-enumeration of all four-byte mappings for the Unicode BMP means that about 40000 mappings change from
the original specification, including about 25000 mappings for characters that are assigned in Unicode 3.0.</p>
<p>This is how the data is prepared for use with ICU, as was done for the actual ICU implementation of the GB 18030 converter:
<ol>
<li>Compile <a href="gbmake4.c">gbmake4</a> and run it with the above file as stdin input.
You will get as output all the four-byte mappings for all
BMP code points that do not have a one-byte or two-byte mapping.</li>
<li>All Unicode code points on the supplementary planes, U+10000-U+10ffff, are mapped as well.
Their GB 18030 codes are four-byte sequences starting at GB+90308130.
You can enumerate them lexically by keeping the second and fourth bytes
between 0x30 and 0x39 and the third byte between 0x81 and 0xfe. For example:</li>
<pre>
U+10000=GB+90308130
U+10001=GB+90308131
U+10002=GB+90308132
...
U+1000a=GB+90308230
U+1000b=GB+90308231
...
U+10ffff=GB+e3329a35
</pre>
You can calculate linear values and differences between GB 18030 four-byte sequences
with <a href="lineargb.c">lineargb</a>.
<li>Done! The result is a set of 0x110000 mappings!</li>
<li>Of course, an economic implementation would handle the mappings for the
supplementary planes algorithmically.
Also, large parts of the BMP mappings are contiguous and can be
handled similarly. For an ICU MBCS converter, U+fffe and U+ffff should
in any case be special-cased because these values have special meaning in .cnv files.</li>
<li>You can have gbmake4 generate a list of contiguous four-byte ranges in the BMP.
Run it with the same input but specify "r" as an argument.
Sort the output descending.
Select the ranges that you deem useful, add the one including U+fffe and U+ffff.
For example, see <a href="ranges.txt">ranges.txt</a>.</li>
<li>If you concatenate gbkuni30.txt and your selected ranges including the
"ranges" line in between, you can run this through gbmake4 again and
get a mapping table without the code points in the ranges.</li>
<li>For an ICU converter, turn your data into a .ucm file and
add the header information.
Keep the roundtrip/fallback information:
roundtrip mappings (':') need a trailing "|0", fallback mappings ('>') a trailing "|1".
You can use <a href="gbtoucm.c">gbtoucm</a>.<br>
Use <a href="gbucm.bat">gbucm.bat</a> to generate such a file from the data provided here.</li>
<li>Also for an ICU MBCS converter, you need to specify a state table for the codepage
that describes its structure. For example, with the supplementary planes and the
<a href="ranges.txt">suggested ranges</a> handled algorithmically and therefore
declared as "unassigned", see this <a href="gbstates.txt">sample state table</a>.
(Tip: for determining which bytes can be marked as unassigned for the ranges,
it helps to sort the ranges by byte sequence values.)</li>
<li>All valid four-byte codepage code points that do not map to
any Unicode code point are of course unassigned.
This includes 9012 sequences with a 0x84 lead byte and 9824 with a 0xe3 lead byte,
as well as about 0.5 million with lead bytes 0x85..0x8f and 0xe4..0xfe.</li>
</ol>
For comparison with other data files, it is possible to reformat a mapping file so that it
contains only mappings from Unicode to GB or only from GB to Unicode.
<a href="gbsingle.c">gbsingle</a> takes input in the simple format of gbkuni30.txt
and outputs even simpler format:<br>
With no command line argument, it writes<br>
<code>unicode ':' gb</code><br>
including the fallback mappings from Unicode to the codepage.
With a "gb" argument, it writes<br>
<code>gb ':' unicode</code><br>
including the fallback mappings from the codepage to Unicode.<br>
Use <a href="gbbmp.bat">gbbmp.bat</a> to generate such files (and one combined file)
for all BMP code points from the data provided here.
</p>
</body>
</html>