ICU-544 update for re-release mapping file and specification

X-SVN-Rev: 3077
2000-11-30 22:07:47 +00:00 · 2000-11-30 22:07:47 +00:00 · 04b0bcfbf8
commit 04b0bcfbf8
parent f8024cc8bd
7 changed files with 84253 additions and 84287 deletions
--- a/icu4c/data/gb18030.ucm
+++ b/icu4c/data/gb18030.ucm
--- a/icu4c/source/data/mappings/gb18030.ucm
+++ b/icu4c/source/data/mappings/gb18030.ucm
--- a/icu4c/source/tools/makeconv/gb18030/gb18030.html
+++ b/icu4c/source/tools/makeconv/gb18030/gb18030.html
@ -8,14 +8,17 @@

 <body>
 <h1>GB 18030</h1>
-<p align="right">Markus Scherer, 2000-oct-21</p>
+<p align="right">Markus Scherer, 2000-nov-30</p>

 <p>GB 18030 is a new Chinese codepage standard, published 2000-mar-17, that is designed for 
 <ul>
    <li>Upwards compatibility with the GB 2312-1980 standard</li>
    <li>Compatibility with the GBK specification, updated for Unicode 3.0</li>
    <li>Full coverage of all Unicode code points similar to a UTF</li>
-</ul></p>
+</ul>
+After discussions between the Chinese standards agency and IT companies, GB 18030 was <em>republished</em>.
+On 2000-nov-30, a modified mapping table file was released, and
+the text of the standard is expected to be republished in December.</p>

 <p>Byte sequence structure: 
 <ul>
@ -27,9 +30,9 @@
 <p>Special properties of GB 18030: 
 <ul>
    <li>Huge: 1.6 million codepage code points &mdash; probably the largest codepage</li>
-    <li>Similar to UTF: All 1.1 million Unicode code points U+0000-U+10ffff map to GB 18030 codes.
-        All but 79 Unicode code points can be mapped from GB 18030.
-        (I.e., there are 79 Unicode code points with only fallback mappings to GB 18030.)</li>
+    <li>Similar to UTF: All 1.1 million Unicode code points U+0000-U+10ffff
+        except for surrogates U+d800-U+dfff
+        map to and from GB 18030 codes.</li>
    <li>Most of these mappings, except for parts of the BMP, can be done algorithmically.
        This makes it an unusual mix of a Unicode encoding with a traditional codepage.</li>
    <li>It is not possible for all codepage byte sequences to determine the length of
@ -39,11 +42,18 @@
 <h2>Generating a GB 18030 mapping table</h2>

 <p>GB 18030 is derived from existing standards and specifications,
-and a mapping table can be generated from existing data.
-<em>Note: </em>Following this description does not guarantee compatibility with
-the standard or any particular implementation.
-This section is most useful for understanding the genesis and structure of GB 18030.<br>
-### THIS IS ALPHA-VERSION DATA AND DESCRIPTION AND MAY CHANGE WITHOUT NOTICE</p>
+and a mapping table can be generated from existing data with modifications.<br>
+On 2000-nov-30, the Chinese standards agency released a mapping table that differs
+from the original specification from 2000-mar-17. It changes all four-byte GB sequences
+for Unicode BMP code points, removes any mappings for single surrogates,
+removes all fallback mappings, and changes some mappings to further update the GBK portion to Unicode 3.0.</p>
+
+<p>This following description illustrates the genesis and structure of GB 18030.
+The actual data that is included here contains the actual one- and two-byte GB 18030 mappings
+for Unicode BMP code points as released by the Chinese standards agency.
+To skip the historical discussion, continue at the <a href="#officialdata">discussion of released data</a>.</p>
+
+<p>Historical discussion based on the specification from 2000-mar-17:
 <ol>
    <li>GBK is a specification (not a standard) that is an extension of GB 2312-1980
        to cover the ideographs in Unicode 2.0. Microsoft co-authored GBK.
@ -74,7 +84,29 @@ This section is most useful for understanding the genesis and structure of GB 18
        for fallbacks from Unicode to the codepage, or a less-than sign for fallbacks from the codepage to Unicode.
        I have marked mappings of the appendix E characters with a star.
        In addition, I have marked mappings that <em>should be</em> in appendix E with a double star.</li>
-    <li>Now compile <a href="gbmake4.c">gbmake4</a> and run it with the above file as stdin input.
+</ol>
+</p>
+
+<p><a name="officialdata"></a>
+The above description explains the state of the data from 2000-mar-17 with corrections.
+On 2000-nov-30, a new mapping table was released that is now the base for the supplied <a href="gbkuni30.txt">gbkuni30.txt</a>.
+Compared with the above, it
+<ul>
+    <li>includes explicit mappings for the ASCII characters</li>
+    <li>corrects the mappings for U+2e97 and U+303e</li>
+    <li>removes any mappings for single surrogates</li>
+    <li>removes all fallback mappings, specifying only roundtrip mappings between GB 18030 and Unicode 3.0</li>
+    <li>changes the mapping for the Euro sign and removes the fullwidth Euro sign</li>
+    <li>re-enumerates all four-byte GB sequences for Unicode BMP code points from U+0080</li>
+</ul>
+This results in a new codepage definition that removes backwards compatibility with GBK for some 80 characters
+because of the lack of fallback mappings.
+The re-enumeration of all four-byte mappings for the Unicode BMP means that about 40000 mappings change from
+the original specification, including about 25000 mappings for characters that are assigned in Unicode 3.0.</p>
+
+<p>This is how the data is prepared for use with ICU, as was done for the actual ICU implementation of the GB 18030 converter:
+<ol>
+    <li>Compile <a href="gbmake4.c">gbmake4</a> and run it with the above file as stdin input.
        You will get as output all the four-byte mappings for all
        BMP code points that do not have a one-byte or two-byte mapping.</li>
    <li>All Unicode code points on the supplementary planes, U+10000-U+10ffff, are mapped as well.
@ -116,7 +148,9 @@ This section is most useful for understanding the genesis and structure of GB 18
    <li>Also for an ICU MBCS converter, you need to specify a state table for the codepage
        that describes its structure. For example, with the supplementary planes and the
        <a href="ranges.txt">suggested ranges</a> handled algorithmically and therefore
-        declared as "unassigned", see this <a href="gbstates.txt">sample state table</a>.</li>
+        declared as "unassigned", see this <a href="gbstates.txt">sample state table</a>.
+        (Tip: for determining which bytes can be marked as unassigned for the ranges,
+        it helps to sort the ranges by byte sequence values.)</li>
    <li>All valid four-byte codepage code points that do not map to
        any Unicode code point are of course unassigned.
        This includes 9012 sequences with a 0x84 lead byte and 9824 with a 0xe3 lead byte,
--- a/icu4c/source/tools/makeconv/gb18030/gbkuni30.txt
+++ b/icu4c/source/tools/makeconv/gb18030/gbkuni30.txt
--- a/icu4c/source/tools/makeconv/gb18030/gbmake4.c
+++ b/icu4c/source/tools/makeconv/gb18030/gbmake4.c
@ -26,6 +26,8 @@
 *   This is useful for generating a partial mapping table and to handle the input
 *   ranges algorithmically in conversion.
 *
+*   Single surrogates are excluded from the output.
+*
 *   To compile, just call a C compiler/linker with this source file.
 *   On Windows: cl gbmake4.c
 */
@ -71,7 +73,7 @@ readRanges() {
    char *s, *end;
    unsigned long c1, c2;

-    /* parse the input file from stdin, in the format of gb18030markus2.txt */
+    /* parse the input file from stdin, in the format of gbkuni30.txt */
    while(gets(line)!=NULL) {
        /* skip empty and comment lines */
        if(line[0]==0 || line[0]=='#') {
@ -108,7 +110,7 @@ readRanges() {
        /* set the flags for all code points in this range */
        while(c1<=c2) {
            if(flags[c1]!=UNASSIGNED) {
-                fprintf(stderr, "error: range covers already-assigned U+%04lx\n", c1);
+                fprintf(stderr, "error: range covers already-assigned U+%04lX\n", c1);
                return 1;
            }
            flags[c1++]=ROUNDTRIP|FOURBYTE;
@ -126,7 +128,7 @@ main(int argc, const char *argv[]) {
    unsigned char bytes[4]={ 0x81, 0x30, 0x81, 0x30 };
    char flag;

-    /* parse the input file from stdin, in the format of gb18030markus2.txt */
+    /* parse the input file from stdin, in the format of gbkuni30.txt */
    while(gets(line)!=NULL) {
        /* skip empty and comment lines */
        if(line[0]==0 || line[0]=='#' || line[0]==0x1a) {
@ -176,7 +178,7 @@ main(int argc, const char *argv[]) {

        /* set the flag for the code point, make sure the mapping from Unicode is not duplicate */
        if((flags[c]&flag&FROMU)!=0) {
-            fprintf(stderr, "error: duplicate assignment for U+%04lx, old flags %u, new %s\n", c, flags[c], line);
+            fprintf(stderr, "error: duplicate assignment for U+%04lX, old flags %u, new %s\n", c, flags[c], line);
            return 1;
        }
        flags[c]|=flag;
@ -184,9 +186,13 @@ main(int argc, const char *argv[]) {

    if(argc<=1) {
        /* generate all four-byte sequences that are not already in the input */
-        for(c=0x81; c<=0xffff; ++c) {
+        for(c=0x80; c<=0xffff; ++c) {
+            /* skip single surrogates */
+            if(c==0xd800) {
+                c=0xe000;
+            }
            if(flags[c]==UNASSIGNED) {
-                printf("%04lx:%02x%02x%02x%02x\n", c, bytes[0], bytes[1], bytes[2], bytes[3]);
+                printf("%04lX:%02X%02X%02X%02X\n", c, bytes[0], bytes[1], bytes[2], bytes[3]);
                /* increment the sequence for the next code point */
                incFourGB18030(bytes);
            } else if(flags[c]&FOURBYTE) {
@ -200,12 +206,17 @@ main(int argc, const char *argv[]) {
        unsigned long c1, c2;

        printf("ranges\n");
-        for(c1=0x81; c1<=0xffff;) {
+        for(c1=0x80; c1<=0xffff;) {
+            /* skip single surrogates */
+            if(c1==0xd800) {
+                c1=0xe000;
+            }
+
            /* get start bytes of range */
            memcpy(b1, bytes, 4);

            /* look for the first non-range code point */
-            for(c2=c1; c2<=0xffff && flags[c2]==UNASSIGNED; ++c2) {
+            for(c2=c1; c2<=0xffff && flags[c2]==UNASSIGNED && c2!=0xd800; ++c2) {
                /* save this sequence to avoid decrementing it after this loop */
                memcpy(b2, bytes, 4);
                /* increment the sequence for the next code point */
@ -214,11 +225,16 @@ main(int argc, const char *argv[]) {
            /* c2 is the first code point after the range; b2 are the bytes for the last code point in the range */

            /* print this range, number of codes first for easy sorting */
-            printf("%06lx  U+%04lx-%04lx  GB+%02x%02x%02x%02x-%02x%02x%02x%02x\n",
+            printf("%06lX  U+%04lX-%04lX  GB+%02X%02X%02X%02X-%02X%02X%02X%02X\n",
                c2-c1, c1, c2-1,
                b1[0], b1[1], b1[2], b1[3],
                b2[0], b2[1], b2[2], b2[3]);

+            /* skip single surrogates */
+            if(c2==0xd800) {
+                c2=0xe000;
+            }
+
            /* skip all assigned Unicode BMP code points */
            for(c1=c2; c1<=0xffff && flags[c1]!=UNASSIGNED; ++c1) {
                if(flags[c1]&FOURBYTE) {
--- a/icu4c/source/tools/makeconv/gb18030/gbstates.txt
+++ b/icu4c/source/tools/makeconv/gb18030/gbstates.txt
@ -4,7 +4,7 @@
 # Similarly, some of the BMP mappings are marked as unassigned for the same reason.

 # Mostly assigned sequences, with branches in the lead bytes
-<icu:state> 0-80, 81:7, 82:8, 83:9, 84:a, 85-fe:4
+<icu:state> 0-7f, 81:7, 82:8, 83:9, 84:a, 85-fe:4
 <icu:state> 30-39:2, 40-7e, 80-fe
 <icu:state> 81-fe:3
 <icu:state> 30-39
@ -15,7 +15,9 @@
 <icu:state> 30-39.u

 # Some unassigned 4-byte sequences, one state for each of the lead bytes 81-84
+# Each of these states branch on the second of four bytes; for the third and fourth bytes,
+# unassigned sequences continue with state 5, assigned ones with state 2
 <icu:state> 30:2, 31-35:5, 36-39:2, 40-7e, 80-fe
 <icu:state> 30-35:2, 36-39:5, 40-7e, 80-fe
-<icu:state> 30-37:5, 38:2, 39:5, 40-7e, 80-fe
-<icu:state> 30:5, 31-32:2, 33-39:5, 40-7e, 80-fe
+<icu:state> 30-35:5, 36:2, 37-39:5, 40-7e, 80-fe
+<icu:state> 30-31:2, 32-39:5, 40-7e, 80-fe
--- a/icu4c/source/tools/makeconv/gb18030/ranges.txt
+++ b/icu4c/source/tools/makeconv/gb18030/ranges.txt
@ -1,13 +1,13 @@
 ranges
-00405a  U+9fa6-dfff  GB+82358f34-83389837
-001bbe  U+0452-200f  GB+8130d239-8136a530
-0010c7  U+e865-f92b  GB+83389838-8431cc32
-00083e  U+2643-2e80  GB+8137a838-8138fd37
-000406  U+fa2a-fe2f  GB+8431e336-8432cc35
-000375  U+3ce1-4055  GB+8231d439-8232af33
-0002fd  U+361b-3917  GB+8230a634-8230f238
-0002bf  U+49b8-4c76  GB+8234a132-8234e734
-0001d7  U+4160-4336  GB+8232c938-8232f838
-0001b9  U+478e-4946  GB+8233e839-82349639
-000175  U+44d7-464b  GB+8233a430-8233c932
-00001a  U+ffe6-ffff  GB+8432e932-8432eb37
+00385A  U+9FA6-D7FF  GB+82358F33-8336C738
+001BBE  U+0452-200F  GB+8130D330-8136A531
+0010C7  U+E865-F92B  GB+8336D030-84308534
+00083E  U+2643-2E80  GB+8137A839-8138FD38
+000406  U+FA2A-FE2F  GB+84309C38-84318537
+000375  U+3CE1-4055  GB+8231D438-8232AF32
+0002FD  U+361B-3917  GB+8230A633-8230F237
+0002BF  U+49B8-4C76  GB+8234A131-8234E733
+0001D7  U+4160-4336  GB+8232C937-8232F837
+0001B9  U+478E-4946  GB+8233E838-82349638
+000175  U+44D7-464B  GB+8233A339-8233C931
+00001A  U+FFE6-FFFF  GB+8431A234-8431A439