ICU-544 information about and tools for gb 18030

X-SVN-Rev: 2762
This commit is contained in:
Markus Scherer 2000-10-23 20:50:58 +00:00
parent be45790deb
commit 2be0117179
7 changed files with 24671 additions and 0 deletions

View File

@ -0,0 +1,120 @@
<html lang="en">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=us-ascii">
<title>GB 18030</title>
</head>
<!-- Copyright (C) 2000, International Business Machines Corporation and others. All Rights Reserved. -->
<body>
<h1>GB 18030</h1>
<p align="right">Markus Scherer, 2000-oct-21</p>
<p>GB 18030 is a new Chinese codepage standard, published 2000-mar-17, that is designed for
<ul>
<li>Upwards compatibility with the GB 2312-1980 standard</li>
<li>Compatibility with the GBK specification, updated for Unicode 3.0</li>
<li>Full coverage of all Unicode code points similar to a UTF</li>
</ul></p>
<p>Byte sequence structure:
<ul>
<li>Single-byte: 00-80</li>
<li>Two-byte: 81-fe | 40-7e, 80-fe</li>
<li>Four-byte: 81-fe | 30-39 | 81-fe | 30-39</li>
</ul></p>
<p>Special properties of GB 18030:
<ul>
<li>Huge: 1.6 million codepage code points &mdash; probably the largest codepage</li>
<li>Similar to UTF: All 1.1 million Unicode code points U+0000-U+10ffff map to GB 18030 codes.
All but 79 Unicode code points can be mapped from GB 18030.
(I.e., there are 79 Unicode code points with only fallback mappings to GB 18030.)</li>
<li>Most of these mappings, except for parts of the BMP, can be done algorithmically.
This makes it an unusual mix of a Unicode encoding with a traditional codepage.</li>
<li>It is not possible for all codepage byte sequences to determine the length of
the sequence from the first byte. This is unusual.</li>
</ul></p>
<h2>Generating a GB 18030 mapping table</h2>
<p>GB 18030 is derived from existing standards and specifications,
and a mapping table can be generated from existing data.
<em>Note: </em>Following this description does not guarantee compatibility with
the standard or any particular implementation.
This section is most useful for understanding the genesis and structure of GB 18030.</p>
<ol>
<li>GBK is a specification (not a standard) that is an extension of GB 2312-1980
to cover the ideographs in Unicode 2.0. Microsoft co-authored GBK.
Get a GBK table, e.g. the one for Microsoft Windows 2000 codepage 936 from ICU sample charsets.</li>
<li>From the Microsoft codepage table, remove all fallback mappings and the one for GB+ff.
Note that the Windows 2000 version contains the Euro sign at GB+80=U+20ac.
Leave it in there for GB 18030.</li>
<li>Get a copy of appendix E of the GB 18030 standard.
There are 79 characters with "temporary" and "new" Unicode mappings.
The temporary ones map to private-use code points because the characters were not assigned in Unicode 2.0.
In the data, change them from roundtrip mappings to fallbacks.
The new mappings are to Unicode 3.0 code points.
Add them as roundtrip mappings to your data.</li>
<li>U+0080 is not currently mapped by the standard.
Also, there is a small number of known errors, typos, and ambiguities in the original standard publication.
See <a href="ftp://ftp.oreilly.com/pub/examples/nutshell/cjkv/pdf/GB18030_Summary.pdf">this summary</a>.
I have added U+0080=GB+8432eb38 to my data.
This is not official at this point!</li>
<li>You should arrive at data like <a href="gbkuni30.txt">gbkuni30.txt</a>.
This file has the following simplified format on each line:<br>
<code>unicode (':' | '>') gb ['*']</code><br>
The left column contains the Unicode code point, the right column the byte sequence in GB 18030.
The delimiter is either a colon for roundtrip mappings or a greater-than sign
for fallbacks from Unicode to the codepage.
I have marked mappings of the appendix E characters with a star.</li>
<li>Now compile <a href="gbmake4.c">gbmake4</a> and run it with the above file as stdin input.
You will get as output all the four-byte mappings for all
BMP code points that do not have a one-byte or two-byte mapping.</li>
<li>All Unicode code points on the supplementary planes, U+10000-U+10ffff, are mapped as well.
Their GB 18030 codes are four-byte sequences starting at GB+90308130.
You can enumerate them lexically by keeping the second and fourth bytes
between 0x30 and 0x39 and the third byte between 0x81 and 0xfe. For example:</li>
<pre>
U+10000=GB+90308130
U+10001=GB+90308131
U+10002=GB+90308132
...
U+1000a=GB+90308230
U+1000b=GB+90308231
...
U+10ffff=GB+e3329a35
</pre>
You can calculate linear values and differences between GB 18030 four-byte sequences
with <a href="lineargb.c">lineargb</a>.
<li>Done! The result is a set of 0x110000 mappings!</li>
<li>Of course, an economic implementation would handle the mappings for the
supplementary planes algorithmically.
Also, large parts of the BMP mappings are contiguous and can be
handled similarly. For an ICU MBCS converter, U+fffe and U+ffff should
in any case be special-cased because these values have special meaning in .cnv files.</li>
<li>You can have gbmake4 generate a list of contiguous four-byte ranges in the BMP.
Run it with the same input but specify "r" as an argument.
Sort the output descending.
Select the ranges that you deem useful, add the one including U+fffe and U+ffff.
For example, see <a href="ranges.txt">ranges.txt</a>.</li>
<li>If you concatenate gbkuni30.txt and your selected ranges including the
"ranges" line in between, you can run this through gbmake4 again and
get a mapping table without the code points in the ranges.</li>
<li>For an ICU converter, turn your data into a .ucm file and
add the header information.
Keep the roundtrip/fallback information:
roundtrip mappings (':') need a trailing "|0", fallback mappings ('>') a trailing "|1".
You can use <a href="gbtoucm.c">gbtoucm</a>.</li>
<li>Also for an ICU MBCS converter, you need to specify a state table for the codepage
that describes its structure. For example, with the supplementary planes and the
<a href="ranges.txt">suggested ranges</a> handled algorithmically and therefore
declared as "unassigned", see this <a href="gbstates.txt">sample state table</a>.</li>
<li>All valid four-byte codepage code points that do not map to
any Unicode code point are of course unassigned.
This includes 9012 sequences with a 0x84 lead byte and 9824 with a 0xe3 lead byte,
as well as about 0.5 million with lead bytes 0x85..0x8f and 0xe4..0xfe.</li>
</ol></p>
</body>
</html>

File diff suppressed because it is too large Load Diff

View File

@ -0,0 +1,211 @@
/*
*******************************************************************************
*
* Copyright (C) 2000, International Business Machines
* Corporation and others. All Rights Reserved.
*
*******************************************************************************
* file name: gbmake4.c
* encoding: US-ASCII
* tab size: 8 (not used)
* indentation:4
*
* created on: 2000oct19
* created by: Markus W. Scherer
*
* This tool reads and processes codepage mapping files for GB 18030.
* Its main function is to read a mapping table with the one- and two-byte
* mappings of GB 18030 and to then output a mapping table with all of the
* four-byte mappings for the BMP.
* When an "r" argument is specified, it will instead write a list of
* ranges of contiguous mappings where both Unicode code points and GB 18030
* four-byte sequences form contiguous blocks.
* This kind of output can be appended to a mapping table with a "ranges" line
* in between, and the resulting output will exclude the input ranges.
* This is useful for generating a partial mapping table and to handle the input
* ranges algorithmically in conversion.
*
* To compile, just call a C compiler/linker with this source file.
* On Windows: cl gbmake4.c
*/
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
/* in the printed standard, U+303e is mismapped; this sequence must be skipped */
static const unsigned char skip303eBytes[4]={ 0x81, 0x39, 0xa6, 0x34 };
/* array of flags for each Unicode BMP code point */
static char
flags[0x10000]={ 0 };
/* flag values: 0: not assigned 1:one/two-byte sequence 2:four-byte sequence */
static void
incFourGB18030(unsigned char bytes[4]) {
if(bytes[3]<0x39) {
++bytes[3];
} else {
bytes[3]=0x30;
if(bytes[2]<0xfe) {
++bytes[2];
} else {
bytes[2]=0x81;
if(bytes[1]<0x39) {
++bytes[1];
} else {
bytes[1]=0x30;
++bytes[0];
}
}
}
}
static void
incSkipFourGB18030(unsigned char bytes[4]) {
incFourGB18030(bytes);
if(0==memcmp(bytes, skip303eBytes, 4)) {
/* make sure to skip the mismapped sequence */
incFourGB18030(bytes);
}
}
static int
readRanges() {
char line[200];
char *s, *end;
unsigned long c1, c2;
/* parse the input file from stdin, in the format of gb18030markus2.txt */
while(gets(line)!=NULL) {
/* skip empty and comment lines */
if(line[0]==0 || line[0]=='#') {
continue;
}
/* find the Unicode code point range */
s=strstr(line, "U+");
if(s==NULL) {
fprintf(stderr, "error parsing range from \"%s\"\n", line);
return 1;
}
/* read range */
s+=2;
c1=strtoul(s, &end, 16);
if(end==s || *end!='-') {
fprintf(stderr, "error parsing range start from \"%s\"\n", line);
return 1;
}
s=end+1;
c2=strtoul(s, &end, 16);
if(end==s || *end!=' ' && *end!=0) {
fprintf(stderr, "error parsing range end from \"%s\"\n", line);
return 1;
}
/* set the flags for all code points in this range */
while(c1<=c2) {
flags[c1++]=2;
}
}
return 0;
}
extern int
main(int argc, const char *argv[]) {
char line[200];
char *end;
unsigned long c, b;
unsigned char bytes[4]={ 0x81, 0x30, 0x81, 0x30 };
/* parse the input file from stdin, in the format of gb18030markus2.txt */
while(gets(line)!=NULL) {
/* skip empty and comment lines */
if(line[0]==0 || line[0]=='#' || line[0]==0x1a) {
continue;
}
/* end of code points, beginning of ranges? */
if(0==strcmp(line, "ranges")) {
int result=readRanges();
if(result!=0) {
return result;
}
break;
}
/* read Unicode code point */
c=strtoul(line, &end, 16);
if(end==line || *end!=':' && *end!='>') {
fprintf(stderr, "error parsing code point from \"%s\"\n", line);
return 1;
}
/* ignore non-BMP code points */
if(c>0xffff) {
continue;
}
/* read byte sequence as one long value */
b=strtoul(end+1, &end, 16);
if(*end!=0 && *end!='*') {
fprintf(stderr, "error parsing byte sequence from \"%s\"\n", line);
return 1;
}
/* set the flag for the code point */
flags[c]= b<=0xffff ? 1 : 2;
}
if(argc<=1) {
/* generate all four-byte sequences that are no already in the input */
for(c=0x81; c<=0xffff; ++c) {
if(flags[c]==0) {
printf("%04lx:%02x%02x%02x%02x\n", c, bytes[0], bytes[1], bytes[2], bytes[3]);
}
if(flags[c]!=1) {
incSkipFourGB18030(bytes);
}
}
} else if(0==strcmp(argv[1], "r")) {
/* generate ranges of contiguous code points with four-byte sequences for what is not covered by the input */
unsigned char b1[4], b2[4];
unsigned long c1, c2;
printf("ranges\n");
for(c1=0x81; c1<=0xffff;) {
/* get start bytes of range */
memcpy(b1, bytes, 4);
/* look for the first non-range code point */
for(c2=c1; c2<=0xffff && flags[c2]==0; ++c2) {
/* save this sequence to avoid decrementing it after this loop */
memcpy(b2, bytes, 4);
/* increment the sequence for the next code point */
incSkipFourGB18030(bytes);
}
/* c2 is the first code point after the range; b2 are the bytes for the last code point in the range */
/* print this range, number of codes first for easy sorting */
printf("%06lx U+%04lx-%04lx GB+%02x%02x%02x%02x-%02x%02x%02x%02x\n",
c2-c1, c1, c2-1,
b1[0], b1[1], b1[2], b1[3],
b2[0], b2[1], b2[2], b2[3]);
/* skip all assigned Unicode BMP code points */
for(c1=c2; c1<=0xffff && flags[c1]!=0; ++c1) {
if(flags[c1]==2) {
incSkipFourGB18030(bytes);
}
}
}
} else {
fprintf(stderr, "unknown mode argument \"%s\"\n", argv[1]);
return 2;
}
return 0;
}

View File

@ -0,0 +1,21 @@
# ICU state information for the GB 18030 MBCS codepage
# Note that the entire block for the supplementary Unicode planes is
# marked unassigned because they are handled algorithmically.
# Similarly, some of the BMP mappings are marked as unassigned for the same reason.
# Mostly assigned sequences, with branches in the lead bytes
<icu:state> 0-80, 81:7, 82:8, 83:9, 84:a, 85-fe:4
<icu:state> 30-39:2, 40-7e, 80-fe
<icu:state> 81-fe:3
<icu:state> 30-39
# All-unassigned 4-byte sequences
<icu:state> 30-39:5, 40-7e, 80-fe
<icu:state> 81-fe:6
<icu:state> 30-39.u
# Some unassigned 4-byte sequences, one state for each of the lead bytes 81-84
<icu:state> 30:2, 31-35:5, 36-39:2, 40-7e, 80-fe
<icu:state> 30-35:2, 36-39:5, 40-7e, 80-fe
<icu:state> 30-37:5, 38:2, 39:5, 40-7e, 80-fe
<icu:state> 30:5, 31-32:2, 33-39:5, 40-7e, 80-fe

View File

@ -0,0 +1,87 @@
/*
*******************************************************************************
*
* Copyright (C) 2000, International Business Machines
* Corporation and others. All Rights Reserved.
*
*******************************************************************************
* file name: gbtoucm.c
* encoding: US-ASCII
* tab size: 8 (not used)
* indentation:4
*
* created on: 2000oct19
* created by: Markus W. Scherer
*
* This tool reads a mapping table in a very simple format and turns it into
* .ucm file format.
* The input format is as follows:
* unicode [':' | '>'] codepage ['*']
* With
* unicode = hexadecimal number 0..10ffff
* codepage = hexadecimal number 0..ffffffff for big-endian bytes
* ':' for roundtrip mappings
* '>' for fallbacks from Unicode to codepage
* '*' ignored
*
* To compile, just call a C compiler/linker with this source file.
* On Windows: cl gbtoucm.c
*/
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
extern int
main(int argc, const char *argv[]) {
char line[200];
char *end;
unsigned long c, b;
unsigned char fallback;
/* parse the input file from stdin */
while(gets(line)!=NULL) {
/* pass through empty and comment lines */
if(line[0]==0 || line[0]=='#' || line[0]==0x1a) {
puts(line);
continue;
}
/* end of code points, beginning of ranges? */
if(0==strcmp(line, "ranges")) {
break; /* ignore the rest of the file */
}
/* read Unicode code point */
c=strtoul(line, &end, 16);
if(end==line || *end!=':' && *end!='>') {
fprintf(stderr, "error parsing code point from \"%s\"\n", line);
return 1;
}
if(*end==':') {
fallback=0;
} else {
fallback=1;
}
/* read byte sequence as one long value */
b=strtoul(end+1, &end, 16);
if(*end!=0 && *end!='*') {
fprintf(stderr, "error parsing byte sequence from \"%s\"\n", line);
return 1;
}
/* output in .ucm format */
if(b<=0xff) {
printf("<U%04lx> \\x%02x |%u\n", c, b, fallback);
} else if(b<=0xffff) {
printf("<U%04lx> \\x%02x\\x%02x |%u\n", c, b>>8, b&0xff, fallback);
} else if(b<=0xffffff) {
printf("<U%04lx> \\x%02x\\x%02x\\x%02x |%u\n", c, b>>16, (b>>8)&0xff, b&0xff, fallback);
} else {
printf("<U%04lx> \\x%02x\\x%02x\\x%02x\\x%02x |%u\n", c, b>>24, (b>>16)&0xff, (b>>8)&0xff, b&0xff, fallback);
}
}
return 0;
}

View File

@ -0,0 +1,70 @@
/*
*******************************************************************************
*
* Copyright (C) 2000, International Business Machines
* Corporation and others. All Rights Reserved.
*
*******************************************************************************
* file name: lineargb.c
* encoding: US-ASCII
* tab size: 8 (not used)
* indentation:4
*
* created on: 2000oct03
* created by: Markus W. Scherer
*
* This tool operates on 4-byte GB 18030 codepage sequences. It can
* - calculate the linear value of such a sequence, with the lowest one,
* 81 30 81 30, getting value 0
* - calculate the linear difference between two sequences
* - calculate a sequence that is linearly offset from another
*
* To compile, just call a C compiler/linker with this source file.
* On Windows: cl lineargb.c
*/
#include <stdio.h>
#include <stdlib.h>
#define LINEAR_18030(a, b, c, d) ((((a)*10+(b))*126L+(c))*10L+(d))
#define LINEAR_18030_BASE LINEAR_18030(0x81, 0x30, 0x81, 0x30)
static long
getLinear(const char *argv[]) {
unsigned int a, b, c, d;
a=(unsigned int)strtoul(argv[0], NULL, 16);
b=(unsigned int)strtoul(argv[1], NULL, 16);
c=(unsigned int)strtoul(argv[2], NULL, 16);
d=(unsigned int)strtoul(argv[3], NULL, 16);
return LINEAR_18030(a, b, c, d);
}
extern int
main(int argc, const char *argv[]) {
if(argc==5) {
printf("Linear value: %ld\n", getLinear(argv+1)-LINEAR_18030_BASE);
return 0;
} else if(argc==6) {
int a, b, c, d;
long linear=getLinear(argv+1)-LINEAR_18030_BASE+strtoul(argv[5], NULL, 0);
d=(int)(0x30+linear%10); linear/=10;
c=(int)(0x81+linear%126); linear/=126;
b=(int)(0x30+linear%10); linear/=10;
a=(int)(0x81+linear);
printf("Offset byte sequence: 0x%02x 0x%02x 0x%02x 0x%02x\n",
a, b, c, d);
return 0;
} else if(argc==9) {
printf("Linear difference: %ld\n", getLinear(argv+5)-getLinear(argv+1));
return 0;
} else {
printf("Usage: %s a b c d [offset | e f g h] calculates with hexadecimal GB 18030 byte values.\n"
"Just one sequence: prints linear value.\n"
"Two sequences: prints the linear difference.\n"
"One sequence and an offset (decimal or with 0x): prints offset byte sequence\n",
argv[0]);
return 1;
}
}

View File

@ -0,0 +1,13 @@
ranges
00405a U+9fa6-dfff GB+82358f34-83389837
001bbe U+0452-200f GB+8130d239-8136a530
0010c7 U+e865-f92b GB+83389838-8431cc32
00083e U+2643-2e80 GB+8137a838-8138fd37
000406 U+fa2a-fe2f GB+8431e336-8432cc35
000375 U+3ce1-4055 GB+8231d439-8232af33
0002fd U+361b-3917 GB+8230a634-8230f238
0002bf U+49b8-4c76 GB+8234a132-8234e734
0001d7 U+4160-4336 GB+8232c938-8232f838
0001b9 U+478e-4946 GB+8233e839-82349639
000175 U+44d7-464b GB+8233a430-8233c932
00001a U+ffe6-ffff GB+8432e932-8432eb37