ICU-544 information about and tools for gb 18030

X-SVN-Rev: 2762
2000-10-23 20:50:58 +00:00 · 2000-10-23 20:50:58 +00:00 · 2be0117179
commit 2be0117179
parent be45790deb
7 changed files with 24671 additions and 0 deletions
--- a/icu4c/source/tools/makeconv/gb18030/gb18030.html
+++ b/icu4c/source/tools/makeconv/gb18030/gb18030.html
@ -0,0 +1,120 @@
+<html lang="en">
+<head>
+<meta http-equiv="Content-Type" content="text/html; charset=us-ascii">
+<title>GB 18030</title>
+</head>
+
+<!-- Copyright (C) 2000, International Business Machines Corporation and others.  All Rights Reserved. -->
+
+<body>
+<h1>GB 18030</h1>
+<p align="right">Markus Scherer, 2000-oct-21</p>
+
+<p>GB 18030 is a new Chinese codepage standard, published 2000-mar-17, that is designed for 
+<ul>
+    <li>Upwards compatibility with the GB 2312-1980 standard</li>
+    <li>Compatibility with the GBK specification, updated for Unicode 3.0</li>
+    <li>Full coverage of all Unicode code points similar to a UTF</li>
+</ul></p>
+
+<p>Byte sequence structure: 
+<ul>
+    <li>Single-byte: 00-80</li>
+    <li>Two-byte: 81-fe | 40-7e, 80-fe</li>
+    <li>Four-byte: 81-fe | 30-39 | 81-fe | 30-39</li>
+</ul></p>
+
+<p>Special properties of GB 18030: 
+<ul>
+    <li>Huge: 1.6 million codepage code points &mdash; probably the largest codepage</li>
+    <li>Similar to UTF: All 1.1 million Unicode code points U+0000-U+10ffff map to GB 18030 codes.
+        All but 79 Unicode code points can be mapped from GB 18030.
+        (I.e., there are 79 Unicode code points with only fallback mappings to GB 18030.)</li>
+    <li>Most of these mappings, except for parts of the BMP, can be done algorithmically.
+        This makes it an unusual mix of a Unicode encoding with a traditional codepage.</li>
+    <li>It is not possible for all codepage byte sequences to determine the length of
+        the sequence from the first byte. This is unusual.</li>
+</ul></p>
+
+<h2>Generating a GB 18030 mapping table</h2>
+
+<p>GB 18030 is derived from existing standards and specifications,
+and a mapping table can be generated from existing data.
+<em>Note: </em>Following this description does not guarantee compatibility with
+the standard or any particular implementation.
+This section is most useful for understanding the genesis and structure of GB 18030.</p>
+<ol>
+    <li>GBK is a specification (not a standard) that is an extension of GB 2312-1980
+        to cover the ideographs in Unicode 2.0. Microsoft co-authored GBK.
+        Get a GBK table, e.g. the one for Microsoft Windows 2000 codepage 936 from ICU sample charsets.</li>
+    <li>From the Microsoft codepage table, remove all fallback mappings and the one for GB+ff.
+        Note that the Windows 2000 version contains the Euro sign at GB+80=U+20ac.
+        Leave it in there for GB 18030.</li>
+    <li>Get a copy of appendix E of the GB 18030 standard.
+        There are 79 characters with "temporary" and "new" Unicode mappings.
+        The temporary ones map to private-use code points because the characters were not assigned in Unicode 2.0.
+        In the data, change them from roundtrip mappings to fallbacks.
+        The new mappings are to Unicode 3.0 code points.
+        Add them as roundtrip mappings to your data.</li>
+    <li>U+0080 is not currently mapped by the standard.
+        Also, there is a small number of known errors, typos, and ambiguities in the original standard publication.
+        See <a href="ftp://ftp.oreilly.com/pub/examples/nutshell/cjkv/pdf/GB18030_Summary.pdf">this summary</a>.
+        I have added U+0080=GB+8432eb38 to my data.
+        This is not official at this point!</li>
+    <li>You should arrive at data like <a href="gbkuni30.txt">gbkuni30.txt</a>.
+        This file has the following simplified format on each line:<br>
+        <code>unicode (':' | '>') gb ['*']</code><br>
+        The left column contains the Unicode code point, the right column the byte sequence in GB 18030.
+        The delimiter is either a colon for roundtrip mappings or a greater-than sign
+        for fallbacks from Unicode to the codepage.
+        I have marked mappings of the appendix E characters with a star.</li>
+    <li>Now compile <a href="gbmake4.c">gbmake4</a> and run it with the above file as stdin input.
+        You will get as output all the four-byte mappings for all
+        BMP code points that do not have a one-byte or two-byte mapping.</li>
+    <li>All Unicode code points on the supplementary planes, U+10000-U+10ffff, are mapped as well.
+        Their GB 18030 codes are four-byte sequences starting at GB+90308130.
+        You can enumerate them lexically by keeping the second and fourth bytes
+        between 0x30 and 0x39 and the third byte between 0x81 and 0xfe. For example:</li>
+        <pre>
+        U+10000=GB+90308130
+        U+10001=GB+90308131
+        U+10002=GB+90308132
+        ...
+        U+1000a=GB+90308230
+        U+1000b=GB+90308231
+        ...
+        U+10ffff=GB+e3329a35
+        </pre>
+        You can calculate linear values and differences between GB 18030 four-byte sequences
+        with <a href="lineargb.c">lineargb</a>.
+    <li>Done! The result is a set of 0x110000 mappings!</li>
+    <li>Of course, an economic implementation would handle the mappings for the
+        supplementary planes algorithmically.
+        Also, large parts of the BMP mappings are contiguous and can be
+        handled similarly. For an ICU MBCS converter, U+fffe and U+ffff should
+        in any case be special-cased because these values have special meaning in .cnv files.</li>
+    <li>You can have gbmake4 generate a list of contiguous four-byte ranges in the BMP.
+        Run it with the same input but specify "r" as an argument.
+        Sort the output descending.
+        Select the ranges that you deem useful, add the one including U+fffe and U+ffff.
+        For example, see <a href="ranges.txt">ranges.txt</a>.</li>
+    <li>If you concatenate gbkuni30.txt and your selected ranges including the
+        "ranges" line in between, you can run this through gbmake4 again and
+        get a mapping table without the code points in the ranges.</li>
+    <li>For an ICU converter, turn your data into a .ucm file and
+        add the header information.
+        Keep the roundtrip/fallback information:
+        roundtrip mappings (':') need a trailing "|0", fallback mappings ('>') a trailing "|1".
+        You can use <a href="gbtoucm.c">gbtoucm</a>.</li>
+    <li>Also for an ICU MBCS converter, you need to specify a state table for the codepage
+        that describes its structure. For example, with the supplementary planes and the
+        <a href="ranges.txt">suggested ranges</a> handled algorithmically and therefore
+        declared as "unassigned", see this <a href="gbstates.txt">sample state table</a>.</li>
+    <li>All valid four-byte codepage code points that do not map to
+        any Unicode code point are of course unassigned.
+        This includes 9012 sequences with a 0x84 lead byte and 9824 with a 0xe3 lead byte,
+        as well as about 0.5 million with lead bytes 0x85..0x8f and 0xe4..0xfe.</li>
+</ol></p>
+
+</body>
+</html>
--- a/icu4c/source/tools/makeconv/gb18030/gbkuni30.txt
+++ b/icu4c/source/tools/makeconv/gb18030/gbkuni30.txt
--- a/icu4c/source/tools/makeconv/gb18030/gbmake4.c
+++ b/icu4c/source/tools/makeconv/gb18030/gbmake4.c
@ -0,0 +1,211 @@
+/*
+*******************************************************************************
+*
+*   Copyright (C) 2000, International Business Machines
+*   Corporation and others.  All Rights Reserved.
+*
+*******************************************************************************
+*   file name:  gbmake4.c
+*   encoding:   US-ASCII
+*   tab size:   8 (not used)
+*   indentation:4
+*
+*   created on: 2000oct19
+*   created by: Markus W. Scherer
+*
+*   This tool reads and processes codepage mapping files for GB 18030.
+*   Its main function is to read a mapping table with the one- and two-byte
+*   mappings of GB 18030 and to then output a mapping table with all of the
+*   four-byte mappings for the BMP.
+*   When an "r" argument is specified, it will instead write a list of
+*   ranges of contiguous mappings where both Unicode code points and GB 18030
+*   four-byte sequences form contiguous blocks.
+*   This kind of output can be appended to a mapping table with a "ranges" line
+*   in between, and the resulting output will exclude the input ranges.
+*   This is useful for generating a partial mapping table and to handle the input
+*   ranges algorithmically in conversion.
+*
+*   To compile, just call a C compiler/linker with this source file.
+*   On Windows: cl gbmake4.c
+*/
+
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+
+/* in the printed standard, U+303e is mismapped; this sequence must be skipped */
+static const unsigned char skip303eBytes[4]={ 0x81, 0x39, 0xa6, 0x34 };
+
+/* array of flags for each Unicode BMP code point */
+static char
+flags[0x10000]={ 0 };
+/* flag values: 0: not assigned  1:one/two-byte sequence  2:four-byte sequence */
+
+static void
+incFourGB18030(unsigned char bytes[4]) {
+    if(bytes[3]<0x39) {
+        ++bytes[3];
+    } else {
+        bytes[3]=0x30;
+        if(bytes[2]<0xfe) {
+            ++bytes[2];
+        } else {
+            bytes[2]=0x81;
+            if(bytes[1]<0x39) {
+                ++bytes[1];
+            } else {
+                bytes[1]=0x30;
+                ++bytes[0];
+            }
+        }
+    }
+}
+
+static void
+incSkipFourGB18030(unsigned char bytes[4]) {
+    incFourGB18030(bytes);
+    if(0==memcmp(bytes, skip303eBytes, 4)) {
+        /* make sure to skip the mismapped sequence */
+        incFourGB18030(bytes);
+    }
+}
+
+static int
+readRanges() {
+    char line[200];
+    char *s, *end;
+    unsigned long c1, c2;
+
+    /* parse the input file from stdin, in the format of gb18030markus2.txt */
+    while(gets(line)!=NULL) {
+        /* skip empty and comment lines */
+        if(line[0]==0 || line[0]=='#') {
+            continue;
+        }
+
+        /* find the Unicode code point range */
+        s=strstr(line, "U+");
+        if(s==NULL) {
+            fprintf(stderr, "error parsing range from \"%s\"\n", line);
+            return 1;
+        }
+
+        /* read range */
+        s+=2;
+        c1=strtoul(s, &end, 16);
+        if(end==s || *end!='-') {
+            fprintf(stderr, "error parsing range start from \"%s\"\n", line);
+            return 1;
+        }
+
+        s=end+1;
+        c2=strtoul(s, &end, 16);
+        if(end==s || *end!=' ' && *end!=0) {
+            fprintf(stderr, "error parsing range end from \"%s\"\n", line);
+            return 1;
+        }
+
+        /* set the flags for all code points in this range */
+        while(c1<=c2) {
+            flags[c1++]=2;
+        }
+    }
+
+    return 0;
+}
+
+extern int
+main(int argc, const char *argv[]) {
+    char line[200];
+    char *end;
+    unsigned long c, b;
+    unsigned char bytes[4]={ 0x81, 0x30, 0x81, 0x30 };
+
+    /* parse the input file from stdin, in the format of gb18030markus2.txt */
+    while(gets(line)!=NULL) {
+        /* skip empty and comment lines */
+        if(line[0]==0 || line[0]=='#' || line[0]==0x1a) {
+            continue;
+        }
+
+        /* end of code points, beginning of ranges? */
+        if(0==strcmp(line, "ranges")) {
+            int result=readRanges();
+            if(result!=0) {
+                return result;
+            }
+            break;
+        }
+
+        /* read Unicode code point */
+        c=strtoul(line, &end, 16);
+        if(end==line || *end!=':' && *end!='>') {
+            fprintf(stderr, "error parsing code point from \"%s\"\n", line);
+            return 1;
+        }
+
+        /* ignore non-BMP code points */
+        if(c>0xffff) {
+            continue;
+        }
+
+        /* read byte sequence as one long value */
+        b=strtoul(end+1, &end, 16);
+        if(*end!=0 && *end!='*') {
+            fprintf(stderr, "error parsing byte sequence from \"%s\"\n", line);
+            return 1;
+        }
+
+        /* set the flag for the code point */
+        flags[c]= b<=0xffff ? 1 : 2;
+    }
+
+    if(argc<=1) {
+        /* generate all four-byte sequences that are no already in the input */
+        for(c=0x81; c<=0xffff; ++c) {
+            if(flags[c]==0) {
+                printf("%04lx:%02x%02x%02x%02x\n", c, bytes[0], bytes[1], bytes[2], bytes[3]);
+            }
+            if(flags[c]!=1) {
+                incSkipFourGB18030(bytes);
+            }
+        }
+    } else if(0==strcmp(argv[1], "r")) {
+        /* generate ranges of contiguous code points with four-byte sequences for what is not covered by the input */
+        unsigned char b1[4], b2[4];
+        unsigned long c1, c2;
+
+        printf("ranges\n");
+        for(c1=0x81; c1<=0xffff;) {
+            /* get start bytes of range */
+            memcpy(b1, bytes, 4);
+
+            /* look for the first non-range code point */
+            for(c2=c1; c2<=0xffff && flags[c2]==0; ++c2) {
+                /* save this sequence to avoid decrementing it after this loop */
+                memcpy(b2, bytes, 4);
+                /* increment the sequence for the next code point */
+                incSkipFourGB18030(bytes);
+            }
+            /* c2 is the first code point after the range; b2 are the bytes for the last code point in the range */
+
+            /* print this range, number of codes first for easy sorting */
+            printf("%06lx  U+%04lx-%04lx  GB+%02x%02x%02x%02x-%02x%02x%02x%02x\n",
+                c2-c1, c1, c2-1,
+                b1[0], b1[1], b1[2], b1[3],
+                b2[0], b2[1], b2[2], b2[3]);
+
+            /* skip all assigned Unicode BMP code points */
+            for(c1=c2; c1<=0xffff && flags[c1]!=0; ++c1) {
+                if(flags[c1]==2) {
+                    incSkipFourGB18030(bytes);
+                }
+            }
+        }
+    } else {
+        fprintf(stderr, "unknown mode argument \"%s\"\n", argv[1]);
+        return 2;
+    }
+
+    return 0;
+}
--- a/icu4c/source/tools/makeconv/gb18030/gbstates.txt
+++ b/icu4c/source/tools/makeconv/gb18030/gbstates.txt
@ -0,0 +1,21 @@
+# ICU state information for the GB 18030 MBCS codepage
+# Note that the entire block for the supplementary Unicode planes is
+# marked unassigned because they are handled algorithmically.
+# Similarly, some of the BMP mappings are marked as unassigned for the same reason.
+
+# Mostly assigned sequences, with branches in the lead bytes
+<icu:state> 0-80, 81:7, 82:8, 83:9, 84:a, 85-fe:4
+<icu:state> 30-39:2, 40-7e, 80-fe
+<icu:state> 81-fe:3
+<icu:state> 30-39
+
+# All-unassigned 4-byte sequences
+<icu:state> 30-39:5, 40-7e, 80-fe
+<icu:state> 81-fe:6
+<icu:state> 30-39.u
+
+# Some unassigned 4-byte sequences, one state for each of the lead bytes 81-84
+<icu:state> 30:2, 31-35:5, 36-39:2, 40-7e, 80-fe
+<icu:state> 30-35:2, 36-39:5, 40-7e, 80-fe
+<icu:state> 30-37:5, 38:2, 39:5, 40-7e, 80-fe
+<icu:state> 30:5, 31-32:2, 33-39:5, 40-7e, 80-fe
--- a/icu4c/source/tools/makeconv/gb18030/gbtoucm.c
+++ b/icu4c/source/tools/makeconv/gb18030/gbtoucm.c
@ -0,0 +1,87 @@
+/*
+*******************************************************************************
+*
+*   Copyright (C) 2000, International Business Machines
+*   Corporation and others.  All Rights Reserved.
+*
+*******************************************************************************
+*   file name:  gbtoucm.c
+*   encoding:   US-ASCII
+*   tab size:   8 (not used)
+*   indentation:4
+*
+*   created on: 2000oct19
+*   created by: Markus W. Scherer
+*
+*   This tool reads a mapping table in a very simple format and turns it into
+*   .ucm file format.
+*   The input format is as follows:
+*       unicode [':' | '>'] codepage ['*']
+*   With
+*       unicode = hexadecimal number 0..10ffff
+*       codepage = hexadecimal number 0..ffffffff for big-endian bytes
+*       ':' for roundtrip mappings
+*       '>' for fallbacks from Unicode to codepage
+*       '*' ignored
+*
+*   To compile, just call a C compiler/linker with this source file.
+*   On Windows: cl gbtoucm.c
+*/
+
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+
+extern int
+main(int argc, const char *argv[]) {
+    char line[200];
+    char *end;
+    unsigned long c, b;
+    unsigned char fallback;
+
+    /* parse the input file from stdin */
+    while(gets(line)!=NULL) {
+        /* pass through empty and comment lines */
+        if(line[0]==0 || line[0]=='#' || line[0]==0x1a) {
+            puts(line);
+            continue;
+        }
+
+        /* end of code points, beginning of ranges? */
+        if(0==strcmp(line, "ranges")) {
+            break; /* ignore the rest of the file */
+        }
+
+        /* read Unicode code point */
+        c=strtoul(line, &end, 16);
+        if(end==line || *end!=':' && *end!='>') {
+            fprintf(stderr, "error parsing code point from \"%s\"\n", line);
+            return 1;
+        }
+        if(*end==':') {
+            fallback=0;
+        } else {
+            fallback=1;
+        }
+
+        /* read byte sequence as one long value */
+        b=strtoul(end+1, &end, 16);
+        if(*end!=0 && *end!='*') {
+            fprintf(stderr, "error parsing byte sequence from \"%s\"\n", line);
+            return 1;
+        }
+
+        /* output in .ucm format */
+        if(b<=0xff) {
+            printf("<U%04lx> \\x%02x |%u\n", c, b, fallback);
+        } else if(b<=0xffff) {
+            printf("<U%04lx> \\x%02x\\x%02x |%u\n", c, b>>8, b&0xff, fallback);
+        } else if(b<=0xffffff) {
+            printf("<U%04lx> \\x%02x\\x%02x\\x%02x |%u\n", c, b>>16, (b>>8)&0xff, b&0xff, fallback);
+        } else {
+            printf("<U%04lx> \\x%02x\\x%02x\\x%02x\\x%02x |%u\n", c, b>>24, (b>>16)&0xff, (b>>8)&0xff, b&0xff, fallback);
+        }
+    }
+
+    return 0;
+}
--- a/icu4c/source/tools/makeconv/gb18030/lineargb.c
+++ b/icu4c/source/tools/makeconv/gb18030/lineargb.c
@ -0,0 +1,70 @@
+/*
+*******************************************************************************
+*
+*   Copyright (C) 2000, International Business Machines
+*   Corporation and others.  All Rights Reserved.
+*
+*******************************************************************************
+*   file name:  lineargb.c
+*   encoding:   US-ASCII
+*   tab size:   8 (not used)
+*   indentation:4
+*
+*   created on: 2000oct03
+*   created by: Markus W. Scherer
+*
+*   This tool operates on 4-byte GB 18030 codepage sequences. It can
+*   - calculate the linear value of such a sequence, with the lowest one,
+*     81 30 81 30, getting value 0
+*   - calculate the linear difference between two sequences
+*   - calculate a sequence that is linearly offset from another
+*
+*   To compile, just call a C compiler/linker with this source file.
+*   On Windows: cl lineargb.c
+*/
+
+#include <stdio.h>
+#include <stdlib.h>
+
+#define LINEAR_18030(a, b, c, d) ((((a)*10+(b))*126L+(c))*10L+(d))
+#define LINEAR_18030_BASE LINEAR_18030(0x81, 0x30, 0x81, 0x30)
+
+static long
+getLinear(const char *argv[]) {
+    unsigned int a, b, c, d;
+
+    a=(unsigned int)strtoul(argv[0], NULL, 16);
+    b=(unsigned int)strtoul(argv[1], NULL, 16);
+    c=(unsigned int)strtoul(argv[2], NULL, 16);
+    d=(unsigned int)strtoul(argv[3], NULL, 16);
+
+    return LINEAR_18030(a, b, c, d);
+}
+
+extern int
+main(int argc, const char *argv[]) {
+    if(argc==5) {
+        printf("Linear value: %ld\n", getLinear(argv+1)-LINEAR_18030_BASE);
+        return 0;
+    } else if(argc==6) {
+        int a, b, c, d;
+        long linear=getLinear(argv+1)-LINEAR_18030_BASE+strtoul(argv[5], NULL, 0);
+        d=(int)(0x30+linear%10); linear/=10;
+        c=(int)(0x81+linear%126); linear/=126;
+        b=(int)(0x30+linear%10); linear/=10;
+        a=(int)(0x81+linear);
+        printf("Offset byte sequence: 0x%02x 0x%02x 0x%02x 0x%02x\n",
+               a, b, c, d);
+        return 0;
+    } else if(argc==9) {
+        printf("Linear difference: %ld\n", getLinear(argv+5)-getLinear(argv+1));
+        return 0;
+    } else {
+        printf("Usage: %s a b c d [offset | e f g h] calculates with hexadecimal GB 18030 byte values.\n"
+               "Just one sequence: prints linear value.\n"
+               "Two sequences: prints the linear difference.\n"
+               "One sequence and an offset (decimal or with 0x): prints offset byte sequence\n",
+               argv[0]);
+        return 1;
+    }
+}
--- a/icu4c/source/tools/makeconv/gb18030/ranges.txt
+++ b/icu4c/source/tools/makeconv/gb18030/ranges.txt
@ -0,0 +1,13 @@
+ranges
+00405a  U+9fa6-dfff  GB+82358f34-83389837
+001bbe  U+0452-200f  GB+8130d239-8136a530
+0010c7  U+e865-f92b  GB+83389838-8431cc32
+00083e  U+2643-2e80  GB+8137a838-8138fd37
+000406  U+fa2a-fe2f  GB+8431e336-8432cc35
+000375  U+3ce1-4055  GB+8231d439-8232af33
+0002fd  U+361b-3917  GB+8230a634-8230f238
+0002bf  U+49b8-4c76  GB+8234a132-8234e734
+0001d7  U+4160-4336  GB+8232c938-8232f838
+0001b9  U+478e-4946  GB+8233e839-82349639
+000175  U+44d7-464b  GB+8233a430-8233c932
+00001a  U+ffe6-ffff  GB+8432e932-8432eb37