0eb0675c69
X-SVN-Rev: 3092
175 lines
10 KiB
HTML
175 lines
10 KiB
HTML
<html lang="en">
|
|
<head>
|
|
<meta http-equiv="Content-Type" content="text/html; charset=us-ascii">
|
|
<title>GB 18030</title>
|
|
</head>
|
|
|
|
<!-- Copyright (C) 2000, International Business Machines Corporation and others. All Rights Reserved. -->
|
|
|
|
<body>
|
|
<h1>GB 18030</h1>
|
|
<p align="right">Markus Scherer, 2000-nov-30</p>
|
|
|
|
<p>GB 18030 is a new Chinese codepage standard, published 2000-mar-17, that is designed for
|
|
<ul>
|
|
<li>Upwards compatibility with the GB 2312-1980 standard</li>
|
|
<li>Compatibility with the GBK specification, updated for Unicode 3.0</li>
|
|
<li>Full coverage of all Unicode code points similar to a UTF</li>
|
|
</ul>
|
|
After discussions between the Chinese standards agency and IT companies, GB 18030 was <em>republished</em>.
|
|
On 2000-nov-30, a modified mapping table file was released, and
|
|
the text of the standard is expected to be republished in December.</p>
|
|
|
|
<p>Byte sequence structure:
|
|
<ul>
|
|
<li>Single-byte: 00-7f</li>
|
|
<li>Two-byte: 81-fe | 40-7e, 80-fe</li>
|
|
<li>Four-byte: 81-fe | 30-39 | 81-fe | 30-39</li>
|
|
</ul></p>
|
|
|
|
<p>Special properties of GB 18030:
|
|
<ul>
|
|
<li>Huge: 1.6 million codepage code points — probably the largest codepage</li>
|
|
<li>Similar to UTF: All 1.1 million Unicode code points U+0000-U+10ffff
|
|
except for surrogates U+d800-U+dfff
|
|
map to and from GB 18030 codes.</li>
|
|
<li>Most of these mappings, except for parts of the BMP, can be done algorithmically.
|
|
This makes it an unusual mix of a Unicode encoding with a traditional codepage.</li>
|
|
<li>It is not possible for all codepage byte sequences to determine the length of
|
|
the sequence from the first byte. This is unusual.</li>
|
|
</ul></p>
|
|
|
|
<h2>Generating a GB 18030 mapping table</h2>
|
|
|
|
<p>GB 18030 is derived from existing standards and specifications,
|
|
and a mapping table can be generated from existing data with modifications.<br>
|
|
On 2000-nov-30, the Chinese standards agency released a mapping table that differs
|
|
from the original specification from 2000-mar-17. It changes all four-byte GB sequences
|
|
for Unicode BMP code points, removes any mappings for single surrogates,
|
|
removes all fallback mappings, and changes some mappings to further update the GBK portion to Unicode 3.0.</p>
|
|
|
|
<p>This following description illustrates the genesis and structure of GB 18030.
|
|
The actual data that is included here contains the actual one- and two-byte GB 18030 mappings
|
|
for Unicode BMP code points as released by the Chinese standards agency.
|
|
To skip the historical discussion, continue at the <a href="#officialdata">discussion of released data</a>.</p>
|
|
|
|
<p>Historical discussion based on the specification from 2000-mar-17:
|
|
<ol>
|
|
<li>GBK is a specification (not a standard) that is an extension of GB 2312-1980
|
|
to cover the ideographs in Unicode 2.0. Microsoft co-authored GBK.
|
|
Get a GBK table, e.g. the one for Microsoft Windows 2000 codepage 936
|
|
from <a href="http://oss.software.ibm.com/icu/charset/">ICU sample charsets</a>.
|
|
(There are <a href="http://oss.software.ibm.com/icu/charset/CharMaps-UCM/">.ucm files</a>
|
|
including <a href="http://oss.software.ibm.com/icu/charset/CharMaps-UCM/windows-936-2000.ucm">windows-936-2000.ucm</a>.)</li>
|
|
<li>From the Microsoft codepage table, remove all fallback mappings and the one for GB+ff.
|
|
Note that the Windows 2000 version contains the Euro sign at GB+80=U+20ac.
|
|
Leave it in there for GB 18030.</li>
|
|
<li>Get a copy of appendix E of the GB 18030 standard.
|
|
There are 79 characters with "temporary" and "new" Unicode mappings.
|
|
The temporary ones map to private-use code points because the characters were not assigned in Unicode 2.0.
|
|
In the data, change them from roundtrip mappings to fallbacks.
|
|
The new mappings are to Unicode 3.0 code points.
|
|
Add them as roundtrip mappings to your data.</li>
|
|
<li>U+0080 is not currently mapped by the standard.
|
|
Also, there is a small number of known errors, typos, and ambiguities in the original standard publication.
|
|
See <a href="ftp://ftp.oreilly.com/pub/examples/nutshell/cjkv/pdf/GB18030_Summary.pdf">this summary</a>.
|
|
I have added U+0080=GB+8432eb38 to my data, corrected GB+fe5e=U+2e97, and added fallbacks from GB to Unicode
|
|
for the four-byte codes for U+2e97 and U+303e.
|
|
This is not official at this point!</li>
|
|
<li>You should arrive at data like <a href="gbkuni30.txt">gbkuni30.txt</a>.
|
|
This file has the following simplified format on each line:<br>
|
|
<code>unicode (':' | '>' | '<') gb ['*' ['*']]</code><br>
|
|
The left column contains the Unicode code point, the right column the byte sequence in GB 18030.
|
|
The delimiter is either a colon for roundtrip mappings or a greater-than sign
|
|
for fallbacks from Unicode to the codepage, or a less-than sign for fallbacks from the codepage to Unicode.
|
|
I have marked mappings of the appendix E characters with a star.
|
|
In addition, I have marked mappings that <em>should be</em> in appendix E with a double star.</li>
|
|
</ol>
|
|
</p>
|
|
|
|
<p><a name="officialdata"></a>
|
|
The above description explains the state of the data from 2000-mar-17 with corrections.
|
|
On 2000-nov-30, a new mapping table was released that is now the base for the supplied <a href="gbkuni30.txt">gbkuni30.txt</a>.
|
|
Compared with the above, it
|
|
<ul>
|
|
<li>includes explicit mappings for the ASCII characters</li>
|
|
<li>corrects the mappings for U+2e97 and U+303e</li>
|
|
<li>removes any mappings for single surrogates</li>
|
|
<li>removes all fallback mappings, specifying only roundtrip mappings between GB 18030 and Unicode 3.0</li>
|
|
<li>changes the mapping for the Euro sign and removes the fullwidth Euro sign (GB+80 is not used any more)</li>
|
|
<li>re-enumerates all four-byte GB sequences for Unicode BMP code points from U+0080</li>
|
|
</ul>
|
|
This results in a new codepage definition that removes backwards compatibility with GBK for some 80 characters
|
|
because of the lack of fallback mappings.
|
|
The re-enumeration of all four-byte mappings for the Unicode BMP means that about 40000 mappings change from
|
|
the original specification, including about 25000 mappings for characters that are assigned in Unicode 3.0.</p>
|
|
|
|
<p>This is how the data is prepared for use with ICU, as was done for the actual ICU implementation of the GB 18030 converter:
|
|
<ol>
|
|
<li>Compile <a href="gbmake4.c">gbmake4</a> and run it with the above file as stdin input.
|
|
You will get as output all the four-byte mappings for all
|
|
BMP code points that do not have a one-byte or two-byte mapping.</li>
|
|
<li>All Unicode code points on the supplementary planes, U+10000-U+10ffff, are mapped as well.
|
|
Their GB 18030 codes are four-byte sequences starting at GB+90308130.
|
|
You can enumerate them lexically by keeping the second and fourth bytes
|
|
between 0x30 and 0x39 and the third byte between 0x81 and 0xfe. For example:</li>
|
|
<pre>
|
|
U+10000=GB+90308130
|
|
U+10001=GB+90308131
|
|
U+10002=GB+90308132
|
|
...
|
|
U+1000a=GB+90308230
|
|
U+1000b=GB+90308231
|
|
...
|
|
U+10ffff=GB+e3329a35
|
|
</pre>
|
|
You can calculate linear values and differences between GB 18030 four-byte sequences
|
|
with <a href="lineargb.c">lineargb</a>.
|
|
<li>Done! The result is a set of 0x110000 mappings!</li>
|
|
<li>Of course, an economic implementation would handle the mappings for the
|
|
supplementary planes algorithmically.
|
|
Also, large parts of the BMP mappings are contiguous and can be
|
|
handled similarly. For an ICU MBCS converter, U+fffe and U+ffff should
|
|
in any case be special-cased because these values have special meaning in .cnv files.</li>
|
|
<li>You can have gbmake4 generate a list of contiguous four-byte ranges in the BMP.
|
|
Run it with the same input but specify "r" as an argument.
|
|
Sort the output descending.
|
|
Select the ranges that you deem useful, add the one including U+fffe and U+ffff.
|
|
For example, see <a href="ranges.txt">ranges.txt</a>.</li>
|
|
<li>If you concatenate gbkuni30.txt and your selected ranges including the
|
|
"ranges" line in between, you can run this through gbmake4 again and
|
|
get a mapping table without the code points in the ranges.</li>
|
|
<li>For an ICU converter, turn your data into a .ucm file and
|
|
add the header information.
|
|
Keep the roundtrip/fallback information:
|
|
roundtrip mappings (':') need a trailing "|0", fallback mappings ('>') a trailing "|1".
|
|
You can use <a href="gbtoucm.c">gbtoucm</a>.<br>
|
|
Use <a href="gbucm.bat">gbucm.bat</a> to generate such a file from the data provided here.</li>
|
|
<li>Also for an ICU MBCS converter, you need to specify a state table for the codepage
|
|
that describes its structure. For example, with the supplementary planes and the
|
|
<a href="ranges.txt">suggested ranges</a> handled algorithmically and therefore
|
|
declared as "unassigned", see this <a href="gbstates.txt">sample state table</a>.
|
|
(Tip: for determining which bytes can be marked as unassigned for the ranges,
|
|
it helps to sort the ranges by byte sequence values.)</li>
|
|
<li>All valid four-byte codepage code points that do not map to
|
|
any Unicode code point are of course unassigned.
|
|
This includes 9012 sequences with a 0x84 lead byte and 9824 with a 0xe3 lead byte,
|
|
as well as about 0.5 million with lead bytes 0x85..0x8f and 0xe4..0xfe.</li>
|
|
</ol>
|
|
For comparison with other data files, it is possible to reformat a mapping file so that it
|
|
contains only mappings from Unicode to GB or only from GB to Unicode.
|
|
<a href="gbsingle.c">gbsingle</a> takes input in the simple format of gbkuni30.txt
|
|
and outputs even simpler format:<br>
|
|
With no command line argument, it writes<br>
|
|
<code>unicode ':' gb</code><br>
|
|
including the fallback mappings from Unicode to the codepage.
|
|
With a "gb" argument, it writes<br>
|
|
<code>gb ':' unicode</code><br>
|
|
including the fallback mappings from the codepage to Unicode.<br>
|
|
Use <a href="gbbmp.bat">gbbmp.bat</a> to generate such files (and one combined file)
|
|
for all BMP code points from the data provided here.
|
|
</p>
|
|
|
|
</body>
|
|
</html>
|