ICU-544 update for re-release mapping file and specification

X-SVN-Rev: 3077
This commit is contained in:
Markus Scherer 2000-11-30 22:07:47 +00:00
parent f8024cc8bd
commit 04b0bcfbf8
7 changed files with 84253 additions and 84287 deletions

File diff suppressed because it is too large Load Diff

File diff suppressed because it is too large Load Diff

View File

@ -8,14 +8,17 @@
<body>
<h1>GB 18030</h1>
<p align="right">Markus Scherer, 2000-oct-21</p>
<p align="right">Markus Scherer, 2000-nov-30</p>
<p>GB 18030 is a new Chinese codepage standard, published 2000-mar-17, that is designed for
<ul>
<li>Upwards compatibility with the GB 2312-1980 standard</li>
<li>Compatibility with the GBK specification, updated for Unicode 3.0</li>
<li>Full coverage of all Unicode code points similar to a UTF</li>
</ul></p>
</ul>
After discussions between the Chinese standards agency and IT companies, GB 18030 was <em>republished</em>.
On 2000-nov-30, a modified mapping table file was released, and
the text of the standard is expected to be republished in December.</p>
<p>Byte sequence structure:
<ul>
@ -27,9 +30,9 @@
<p>Special properties of GB 18030:
<ul>
<li>Huge: 1.6 million codepage code points &mdash; probably the largest codepage</li>
<li>Similar to UTF: All 1.1 million Unicode code points U+0000-U+10ffff map to GB 18030 codes.
All but 79 Unicode code points can be mapped from GB 18030.
(I.e., there are 79 Unicode code points with only fallback mappings to GB 18030.)</li>
<li>Similar to UTF: All 1.1 million Unicode code points U+0000-U+10ffff
except for surrogates U+d800-U+dfff
map to and from GB 18030 codes.</li>
<li>Most of these mappings, except for parts of the BMP, can be done algorithmically.
This makes it an unusual mix of a Unicode encoding with a traditional codepage.</li>
<li>It is not possible for all codepage byte sequences to determine the length of
@ -39,11 +42,18 @@
<h2>Generating a GB 18030 mapping table</h2>
<p>GB 18030 is derived from existing standards and specifications,
and a mapping table can be generated from existing data.
<em>Note: </em>Following this description does not guarantee compatibility with
the standard or any particular implementation.
This section is most useful for understanding the genesis and structure of GB 18030.<br>
### THIS IS ALPHA-VERSION DATA AND DESCRIPTION AND MAY CHANGE WITHOUT NOTICE</p>
and a mapping table can be generated from existing data with modifications.<br>
On 2000-nov-30, the Chinese standards agency released a mapping table that differs
from the original specification from 2000-mar-17. It changes all four-byte GB sequences
for Unicode BMP code points, removes any mappings for single surrogates,
removes all fallback mappings, and changes some mappings to further update the GBK portion to Unicode 3.0.</p>
<p>This following description illustrates the genesis and structure of GB 18030.
The actual data that is included here contains the actual one- and two-byte GB 18030 mappings
for Unicode BMP code points as released by the Chinese standards agency.
To skip the historical discussion, continue at the <a href="#officialdata">discussion of released data</a>.</p>
<p>Historical discussion based on the specification from 2000-mar-17:
<ol>
<li>GBK is a specification (not a standard) that is an extension of GB 2312-1980
to cover the ideographs in Unicode 2.0. Microsoft co-authored GBK.
@ -74,7 +84,29 @@ This section is most useful for understanding the genesis and structure of GB 18
for fallbacks from Unicode to the codepage, or a less-than sign for fallbacks from the codepage to Unicode.
I have marked mappings of the appendix E characters with a star.
In addition, I have marked mappings that <em>should be</em> in appendix E with a double star.</li>
<li>Now compile <a href="gbmake4.c">gbmake4</a> and run it with the above file as stdin input.
</ol>
</p>
<p><a name="officialdata"></a>
The above description explains the state of the data from 2000-mar-17 with corrections.
On 2000-nov-30, a new mapping table was released that is now the base for the supplied <a href="gbkuni30.txt">gbkuni30.txt</a>.
Compared with the above, it
<ul>
<li>includes explicit mappings for the ASCII characters</li>
<li>corrects the mappings for U+2e97 and U+303e</li>
<li>removes any mappings for single surrogates</li>
<li>removes all fallback mappings, specifying only roundtrip mappings between GB 18030 and Unicode 3.0</li>
<li>changes the mapping for the Euro sign and removes the fullwidth Euro sign</li>
<li>re-enumerates all four-byte GB sequences for Unicode BMP code points from U+0080</li>
</ul>
This results in a new codepage definition that removes backwards compatibility with GBK for some 80 characters
because of the lack of fallback mappings.
The re-enumeration of all four-byte mappings for the Unicode BMP means that about 40000 mappings change from
the original specification, including about 25000 mappings for characters that are assigned in Unicode 3.0.</p>
<p>This is how the data is prepared for use with ICU, as was done for the actual ICU implementation of the GB 18030 converter:
<ol>
<li>Compile <a href="gbmake4.c">gbmake4</a> and run it with the above file as stdin input.
You will get as output all the four-byte mappings for all
BMP code points that do not have a one-byte or two-byte mapping.</li>
<li>All Unicode code points on the supplementary planes, U+10000-U+10ffff, are mapped as well.
@ -116,7 +148,9 @@ This section is most useful for understanding the genesis and structure of GB 18
<li>Also for an ICU MBCS converter, you need to specify a state table for the codepage
that describes its structure. For example, with the supplementary planes and the
<a href="ranges.txt">suggested ranges</a> handled algorithmically and therefore
declared as "unassigned", see this <a href="gbstates.txt">sample state table</a>.</li>
declared as "unassigned", see this <a href="gbstates.txt">sample state table</a>.
(Tip: for determining which bytes can be marked as unassigned for the ranges,
it helps to sort the ranges by byte sequence values.)</li>
<li>All valid four-byte codepage code points that do not map to
any Unicode code point are of course unassigned.
This includes 9012 sequences with a 0x84 lead byte and 9824 with a 0xe3 lead byte,

File diff suppressed because it is too large Load Diff

View File

@ -26,6 +26,8 @@
* This is useful for generating a partial mapping table and to handle the input
* ranges algorithmically in conversion.
*
* Single surrogates are excluded from the output.
*
* To compile, just call a C compiler/linker with this source file.
* On Windows: cl gbmake4.c
*/
@ -71,7 +73,7 @@ readRanges() {
char *s, *end;
unsigned long c1, c2;
/* parse the input file from stdin, in the format of gb18030markus2.txt */
/* parse the input file from stdin, in the format of gbkuni30.txt */
while(gets(line)!=NULL) {
/* skip empty and comment lines */
if(line[0]==0 || line[0]=='#') {
@ -108,7 +110,7 @@ readRanges() {
/* set the flags for all code points in this range */
while(c1<=c2) {
if(flags[c1]!=UNASSIGNED) {
fprintf(stderr, "error: range covers already-assigned U+%04lx\n", c1);
fprintf(stderr, "error: range covers already-assigned U+%04lX\n", c1);
return 1;
}
flags[c1++]=ROUNDTRIP|FOURBYTE;
@ -126,7 +128,7 @@ main(int argc, const char *argv[]) {
unsigned char bytes[4]={ 0x81, 0x30, 0x81, 0x30 };
char flag;
/* parse the input file from stdin, in the format of gb18030markus2.txt */
/* parse the input file from stdin, in the format of gbkuni30.txt */
while(gets(line)!=NULL) {
/* skip empty and comment lines */
if(line[0]==0 || line[0]=='#' || line[0]==0x1a) {
@ -176,7 +178,7 @@ main(int argc, const char *argv[]) {
/* set the flag for the code point, make sure the mapping from Unicode is not duplicate */
if((flags[c]&flag&FROMU)!=0) {
fprintf(stderr, "error: duplicate assignment for U+%04lx, old flags %u, new %s\n", c, flags[c], line);
fprintf(stderr, "error: duplicate assignment for U+%04lX, old flags %u, new %s\n", c, flags[c], line);
return 1;
}
flags[c]|=flag;
@ -184,9 +186,13 @@ main(int argc, const char *argv[]) {
if(argc<=1) {
/* generate all four-byte sequences that are not already in the input */
for(c=0x81; c<=0xffff; ++c) {
for(c=0x80; c<=0xffff; ++c) {
/* skip single surrogates */
if(c==0xd800) {
c=0xe000;
}
if(flags[c]==UNASSIGNED) {
printf("%04lx:%02x%02x%02x%02x\n", c, bytes[0], bytes[1], bytes[2], bytes[3]);
printf("%04lX:%02X%02X%02X%02X\n", c, bytes[0], bytes[1], bytes[2], bytes[3]);
/* increment the sequence for the next code point */
incFourGB18030(bytes);
} else if(flags[c]&FOURBYTE) {
@ -200,12 +206,17 @@ main(int argc, const char *argv[]) {
unsigned long c1, c2;
printf("ranges\n");
for(c1=0x81; c1<=0xffff;) {
for(c1=0x80; c1<=0xffff;) {
/* skip single surrogates */
if(c1==0xd800) {
c1=0xe000;
}
/* get start bytes of range */
memcpy(b1, bytes, 4);
/* look for the first non-range code point */
for(c2=c1; c2<=0xffff && flags[c2]==UNASSIGNED; ++c2) {
for(c2=c1; c2<=0xffff && flags[c2]==UNASSIGNED && c2!=0xd800; ++c2) {
/* save this sequence to avoid decrementing it after this loop */
memcpy(b2, bytes, 4);
/* increment the sequence for the next code point */
@ -214,11 +225,16 @@ main(int argc, const char *argv[]) {
/* c2 is the first code point after the range; b2 are the bytes for the last code point in the range */
/* print this range, number of codes first for easy sorting */
printf("%06lx U+%04lx-%04lx GB+%02x%02x%02x%02x-%02x%02x%02x%02x\n",
printf("%06lX U+%04lX-%04lX GB+%02X%02X%02X%02X-%02X%02X%02X%02X\n",
c2-c1, c1, c2-1,
b1[0], b1[1], b1[2], b1[3],
b2[0], b2[1], b2[2], b2[3]);
/* skip single surrogates */
if(c2==0xd800) {
c2=0xe000;
}
/* skip all assigned Unicode BMP code points */
for(c1=c2; c1<=0xffff && flags[c1]!=UNASSIGNED; ++c1) {
if(flags[c1]&FOURBYTE) {

View File

@ -4,7 +4,7 @@
# Similarly, some of the BMP mappings are marked as unassigned for the same reason.
# Mostly assigned sequences, with branches in the lead bytes
<icu:state> 0-80, 81:7, 82:8, 83:9, 84:a, 85-fe:4
<icu:state> 0-7f, 81:7, 82:8, 83:9, 84:a, 85-fe:4
<icu:state> 30-39:2, 40-7e, 80-fe
<icu:state> 81-fe:3
<icu:state> 30-39
@ -15,7 +15,9 @@
<icu:state> 30-39.u
# Some unassigned 4-byte sequences, one state for each of the lead bytes 81-84
# Each of these states branch on the second of four bytes; for the third and fourth bytes,
# unassigned sequences continue with state 5, assigned ones with state 2
<icu:state> 30:2, 31-35:5, 36-39:2, 40-7e, 80-fe
<icu:state> 30-35:2, 36-39:5, 40-7e, 80-fe
<icu:state> 30-37:5, 38:2, 39:5, 40-7e, 80-fe
<icu:state> 30:5, 31-32:2, 33-39:5, 40-7e, 80-fe
<icu:state> 30-35:5, 36:2, 37-39:5, 40-7e, 80-fe
<icu:state> 30-31:2, 32-39:5, 40-7e, 80-fe

View File

@ -1,13 +1,13 @@
ranges
00405a U+9fa6-dfff GB+82358f34-83389837
001bbe U+0452-200f GB+8130d239-8136a530
0010c7 U+e865-f92b GB+83389838-8431cc32
00083e U+2643-2e80 GB+8137a838-8138fd37
000406 U+fa2a-fe2f GB+8431e336-8432cc35
000375 U+3ce1-4055 GB+8231d439-8232af33
0002fd U+361b-3917 GB+8230a634-8230f238
0002bf U+49b8-4c76 GB+8234a132-8234e734
0001d7 U+4160-4336 GB+8232c938-8232f838
0001b9 U+478e-4946 GB+8233e839-82349639
000175 U+44d7-464b GB+8233a430-8233c932
00001a U+ffe6-ffff GB+8432e932-8432eb37
00385A U+9FA6-D7FF GB+82358F33-8336C738
001BBE U+0452-200F GB+8130D330-8136A531
0010C7 U+E865-F92B GB+8336D030-84308534
00083E U+2643-2E80 GB+8137A839-8138FD38
000406 U+FA2A-FE2F GB+84309C38-84318537
000375 U+3CE1-4055 GB+8231D438-8232AF32
0002FD U+361B-3917 GB+8230A633-8230F237
0002BF U+49B8-4C76 GB+8234A131-8234E733
0001D7 U+4160-4336 GB+8232C937-8232F837
0001B9 U+478E-4946 GB+8233E838-82349638
000175 U+44D7-464B GB+8233A339-8233C931
00001A U+FFE6-FFFF GB+8431A234-8431A439