ICU-544 update for re-release mapping file and specification
X-SVN-Rev: 3077
This commit is contained in:
parent
f8024cc8bd
commit
04b0bcfbf8
60643
icu4c/data/gb18030.ucm
60643
icu4c/data/gb18030.ucm
File diff suppressed because it is too large
Load Diff
File diff suppressed because it is too large
Load Diff
@ -8,14 +8,17 @@
|
||||
|
||||
<body>
|
||||
<h1>GB 18030</h1>
|
||||
<p align="right">Markus Scherer, 2000-oct-21</p>
|
||||
<p align="right">Markus Scherer, 2000-nov-30</p>
|
||||
|
||||
<p>GB 18030 is a new Chinese codepage standard, published 2000-mar-17, that is designed for
|
||||
<ul>
|
||||
<li>Upwards compatibility with the GB 2312-1980 standard</li>
|
||||
<li>Compatibility with the GBK specification, updated for Unicode 3.0</li>
|
||||
<li>Full coverage of all Unicode code points similar to a UTF</li>
|
||||
</ul></p>
|
||||
</ul>
|
||||
After discussions between the Chinese standards agency and IT companies, GB 18030 was <em>republished</em>.
|
||||
On 2000-nov-30, a modified mapping table file was released, and
|
||||
the text of the standard is expected to be republished in December.</p>
|
||||
|
||||
<p>Byte sequence structure:
|
||||
<ul>
|
||||
@ -27,9 +30,9 @@
|
||||
<p>Special properties of GB 18030:
|
||||
<ul>
|
||||
<li>Huge: 1.6 million codepage code points — probably the largest codepage</li>
|
||||
<li>Similar to UTF: All 1.1 million Unicode code points U+0000-U+10ffff map to GB 18030 codes.
|
||||
All but 79 Unicode code points can be mapped from GB 18030.
|
||||
(I.e., there are 79 Unicode code points with only fallback mappings to GB 18030.)</li>
|
||||
<li>Similar to UTF: All 1.1 million Unicode code points U+0000-U+10ffff
|
||||
except for surrogates U+d800-U+dfff
|
||||
map to and from GB 18030 codes.</li>
|
||||
<li>Most of these mappings, except for parts of the BMP, can be done algorithmically.
|
||||
This makes it an unusual mix of a Unicode encoding with a traditional codepage.</li>
|
||||
<li>It is not possible for all codepage byte sequences to determine the length of
|
||||
@ -39,11 +42,18 @@
|
||||
<h2>Generating a GB 18030 mapping table</h2>
|
||||
|
||||
<p>GB 18030 is derived from existing standards and specifications,
|
||||
and a mapping table can be generated from existing data.
|
||||
<em>Note: </em>Following this description does not guarantee compatibility with
|
||||
the standard or any particular implementation.
|
||||
This section is most useful for understanding the genesis and structure of GB 18030.<br>
|
||||
### THIS IS ALPHA-VERSION DATA AND DESCRIPTION AND MAY CHANGE WITHOUT NOTICE</p>
|
||||
and a mapping table can be generated from existing data with modifications.<br>
|
||||
On 2000-nov-30, the Chinese standards agency released a mapping table that differs
|
||||
from the original specification from 2000-mar-17. It changes all four-byte GB sequences
|
||||
for Unicode BMP code points, removes any mappings for single surrogates,
|
||||
removes all fallback mappings, and changes some mappings to further update the GBK portion to Unicode 3.0.</p>
|
||||
|
||||
<p>This following description illustrates the genesis and structure of GB 18030.
|
||||
The actual data that is included here contains the actual one- and two-byte GB 18030 mappings
|
||||
for Unicode BMP code points as released by the Chinese standards agency.
|
||||
To skip the historical discussion, continue at the <a href="#officialdata">discussion of released data</a>.</p>
|
||||
|
||||
<p>Historical discussion based on the specification from 2000-mar-17:
|
||||
<ol>
|
||||
<li>GBK is a specification (not a standard) that is an extension of GB 2312-1980
|
||||
to cover the ideographs in Unicode 2.0. Microsoft co-authored GBK.
|
||||
@ -74,7 +84,29 @@ This section is most useful for understanding the genesis and structure of GB 18
|
||||
for fallbacks from Unicode to the codepage, or a less-than sign for fallbacks from the codepage to Unicode.
|
||||
I have marked mappings of the appendix E characters with a star.
|
||||
In addition, I have marked mappings that <em>should be</em> in appendix E with a double star.</li>
|
||||
<li>Now compile <a href="gbmake4.c">gbmake4</a> and run it with the above file as stdin input.
|
||||
</ol>
|
||||
</p>
|
||||
|
||||
<p><a name="officialdata"></a>
|
||||
The above description explains the state of the data from 2000-mar-17 with corrections.
|
||||
On 2000-nov-30, a new mapping table was released that is now the base for the supplied <a href="gbkuni30.txt">gbkuni30.txt</a>.
|
||||
Compared with the above, it
|
||||
<ul>
|
||||
<li>includes explicit mappings for the ASCII characters</li>
|
||||
<li>corrects the mappings for U+2e97 and U+303e</li>
|
||||
<li>removes any mappings for single surrogates</li>
|
||||
<li>removes all fallback mappings, specifying only roundtrip mappings between GB 18030 and Unicode 3.0</li>
|
||||
<li>changes the mapping for the Euro sign and removes the fullwidth Euro sign</li>
|
||||
<li>re-enumerates all four-byte GB sequences for Unicode BMP code points from U+0080</li>
|
||||
</ul>
|
||||
This results in a new codepage definition that removes backwards compatibility with GBK for some 80 characters
|
||||
because of the lack of fallback mappings.
|
||||
The re-enumeration of all four-byte mappings for the Unicode BMP means that about 40000 mappings change from
|
||||
the original specification, including about 25000 mappings for characters that are assigned in Unicode 3.0.</p>
|
||||
|
||||
<p>This is how the data is prepared for use with ICU, as was done for the actual ICU implementation of the GB 18030 converter:
|
||||
<ol>
|
||||
<li>Compile <a href="gbmake4.c">gbmake4</a> and run it with the above file as stdin input.
|
||||
You will get as output all the four-byte mappings for all
|
||||
BMP code points that do not have a one-byte or two-byte mapping.</li>
|
||||
<li>All Unicode code points on the supplementary planes, U+10000-U+10ffff, are mapped as well.
|
||||
@ -116,7 +148,9 @@ This section is most useful for understanding the genesis and structure of GB 18
|
||||
<li>Also for an ICU MBCS converter, you need to specify a state table for the codepage
|
||||
that describes its structure. For example, with the supplementary planes and the
|
||||
<a href="ranges.txt">suggested ranges</a> handled algorithmically and therefore
|
||||
declared as "unassigned", see this <a href="gbstates.txt">sample state table</a>.</li>
|
||||
declared as "unassigned", see this <a href="gbstates.txt">sample state table</a>.
|
||||
(Tip: for determining which bytes can be marked as unassigned for the ranges,
|
||||
it helps to sort the ranges by byte sequence values.)</li>
|
||||
<li>All valid four-byte codepage code points that do not map to
|
||||
any Unicode code point are of course unassigned.
|
||||
This includes 9012 sequences with a 0x84 lead byte and 9824 with a 0xe3 lead byte,
|
||||
|
File diff suppressed because it is too large
Load Diff
@ -26,6 +26,8 @@
|
||||
* This is useful for generating a partial mapping table and to handle the input
|
||||
* ranges algorithmically in conversion.
|
||||
*
|
||||
* Single surrogates are excluded from the output.
|
||||
*
|
||||
* To compile, just call a C compiler/linker with this source file.
|
||||
* On Windows: cl gbmake4.c
|
||||
*/
|
||||
@ -71,7 +73,7 @@ readRanges() {
|
||||
char *s, *end;
|
||||
unsigned long c1, c2;
|
||||
|
||||
/* parse the input file from stdin, in the format of gb18030markus2.txt */
|
||||
/* parse the input file from stdin, in the format of gbkuni30.txt */
|
||||
while(gets(line)!=NULL) {
|
||||
/* skip empty and comment lines */
|
||||
if(line[0]==0 || line[0]=='#') {
|
||||
@ -108,7 +110,7 @@ readRanges() {
|
||||
/* set the flags for all code points in this range */
|
||||
while(c1<=c2) {
|
||||
if(flags[c1]!=UNASSIGNED) {
|
||||
fprintf(stderr, "error: range covers already-assigned U+%04lx\n", c1);
|
||||
fprintf(stderr, "error: range covers already-assigned U+%04lX\n", c1);
|
||||
return 1;
|
||||
}
|
||||
flags[c1++]=ROUNDTRIP|FOURBYTE;
|
||||
@ -126,7 +128,7 @@ main(int argc, const char *argv[]) {
|
||||
unsigned char bytes[4]={ 0x81, 0x30, 0x81, 0x30 };
|
||||
char flag;
|
||||
|
||||
/* parse the input file from stdin, in the format of gb18030markus2.txt */
|
||||
/* parse the input file from stdin, in the format of gbkuni30.txt */
|
||||
while(gets(line)!=NULL) {
|
||||
/* skip empty and comment lines */
|
||||
if(line[0]==0 || line[0]=='#' || line[0]==0x1a) {
|
||||
@ -176,7 +178,7 @@ main(int argc, const char *argv[]) {
|
||||
|
||||
/* set the flag for the code point, make sure the mapping from Unicode is not duplicate */
|
||||
if((flags[c]&flag&FROMU)!=0) {
|
||||
fprintf(stderr, "error: duplicate assignment for U+%04lx, old flags %u, new %s\n", c, flags[c], line);
|
||||
fprintf(stderr, "error: duplicate assignment for U+%04lX, old flags %u, new %s\n", c, flags[c], line);
|
||||
return 1;
|
||||
}
|
||||
flags[c]|=flag;
|
||||
@ -184,9 +186,13 @@ main(int argc, const char *argv[]) {
|
||||
|
||||
if(argc<=1) {
|
||||
/* generate all four-byte sequences that are not already in the input */
|
||||
for(c=0x81; c<=0xffff; ++c) {
|
||||
for(c=0x80; c<=0xffff; ++c) {
|
||||
/* skip single surrogates */
|
||||
if(c==0xd800) {
|
||||
c=0xe000;
|
||||
}
|
||||
if(flags[c]==UNASSIGNED) {
|
||||
printf("%04lx:%02x%02x%02x%02x\n", c, bytes[0], bytes[1], bytes[2], bytes[3]);
|
||||
printf("%04lX:%02X%02X%02X%02X\n", c, bytes[0], bytes[1], bytes[2], bytes[3]);
|
||||
/* increment the sequence for the next code point */
|
||||
incFourGB18030(bytes);
|
||||
} else if(flags[c]&FOURBYTE) {
|
||||
@ -200,12 +206,17 @@ main(int argc, const char *argv[]) {
|
||||
unsigned long c1, c2;
|
||||
|
||||
printf("ranges\n");
|
||||
for(c1=0x81; c1<=0xffff;) {
|
||||
for(c1=0x80; c1<=0xffff;) {
|
||||
/* skip single surrogates */
|
||||
if(c1==0xd800) {
|
||||
c1=0xe000;
|
||||
}
|
||||
|
||||
/* get start bytes of range */
|
||||
memcpy(b1, bytes, 4);
|
||||
|
||||
/* look for the first non-range code point */
|
||||
for(c2=c1; c2<=0xffff && flags[c2]==UNASSIGNED; ++c2) {
|
||||
for(c2=c1; c2<=0xffff && flags[c2]==UNASSIGNED && c2!=0xd800; ++c2) {
|
||||
/* save this sequence to avoid decrementing it after this loop */
|
||||
memcpy(b2, bytes, 4);
|
||||
/* increment the sequence for the next code point */
|
||||
@ -214,11 +225,16 @@ main(int argc, const char *argv[]) {
|
||||
/* c2 is the first code point after the range; b2 are the bytes for the last code point in the range */
|
||||
|
||||
/* print this range, number of codes first for easy sorting */
|
||||
printf("%06lx U+%04lx-%04lx GB+%02x%02x%02x%02x-%02x%02x%02x%02x\n",
|
||||
printf("%06lX U+%04lX-%04lX GB+%02X%02X%02X%02X-%02X%02X%02X%02X\n",
|
||||
c2-c1, c1, c2-1,
|
||||
b1[0], b1[1], b1[2], b1[3],
|
||||
b2[0], b2[1], b2[2], b2[3]);
|
||||
|
||||
/* skip single surrogates */
|
||||
if(c2==0xd800) {
|
||||
c2=0xe000;
|
||||
}
|
||||
|
||||
/* skip all assigned Unicode BMP code points */
|
||||
for(c1=c2; c1<=0xffff && flags[c1]!=UNASSIGNED; ++c1) {
|
||||
if(flags[c1]&FOURBYTE) {
|
||||
|
@ -4,7 +4,7 @@
|
||||
# Similarly, some of the BMP mappings are marked as unassigned for the same reason.
|
||||
|
||||
# Mostly assigned sequences, with branches in the lead bytes
|
||||
<icu:state> 0-80, 81:7, 82:8, 83:9, 84:a, 85-fe:4
|
||||
<icu:state> 0-7f, 81:7, 82:8, 83:9, 84:a, 85-fe:4
|
||||
<icu:state> 30-39:2, 40-7e, 80-fe
|
||||
<icu:state> 81-fe:3
|
||||
<icu:state> 30-39
|
||||
@ -15,7 +15,9 @@
|
||||
<icu:state> 30-39.u
|
||||
|
||||
# Some unassigned 4-byte sequences, one state for each of the lead bytes 81-84
|
||||
# Each of these states branch on the second of four bytes; for the third and fourth bytes,
|
||||
# unassigned sequences continue with state 5, assigned ones with state 2
|
||||
<icu:state> 30:2, 31-35:5, 36-39:2, 40-7e, 80-fe
|
||||
<icu:state> 30-35:2, 36-39:5, 40-7e, 80-fe
|
||||
<icu:state> 30-37:5, 38:2, 39:5, 40-7e, 80-fe
|
||||
<icu:state> 30:5, 31-32:2, 33-39:5, 40-7e, 80-fe
|
||||
<icu:state> 30-35:5, 36:2, 37-39:5, 40-7e, 80-fe
|
||||
<icu:state> 30-31:2, 32-39:5, 40-7e, 80-fe
|
||||
|
@ -1,13 +1,13 @@
|
||||
ranges
|
||||
00405a U+9fa6-dfff GB+82358f34-83389837
|
||||
001bbe U+0452-200f GB+8130d239-8136a530
|
||||
0010c7 U+e865-f92b GB+83389838-8431cc32
|
||||
00083e U+2643-2e80 GB+8137a838-8138fd37
|
||||
000406 U+fa2a-fe2f GB+8431e336-8432cc35
|
||||
000375 U+3ce1-4055 GB+8231d439-8232af33
|
||||
0002fd U+361b-3917 GB+8230a634-8230f238
|
||||
0002bf U+49b8-4c76 GB+8234a132-8234e734
|
||||
0001d7 U+4160-4336 GB+8232c938-8232f838
|
||||
0001b9 U+478e-4946 GB+8233e839-82349639
|
||||
000175 U+44d7-464b GB+8233a430-8233c932
|
||||
00001a U+ffe6-ffff GB+8432e932-8432eb37
|
||||
00385A U+9FA6-D7FF GB+82358F33-8336C738
|
||||
001BBE U+0452-200F GB+8130D330-8136A531
|
||||
0010C7 U+E865-F92B GB+8336D030-84308534
|
||||
00083E U+2643-2E80 GB+8137A839-8138FD38
|
||||
000406 U+FA2A-FE2F GB+84309C38-84318537
|
||||
000375 U+3CE1-4055 GB+8231D438-8232AF32
|
||||
0002FD U+361B-3917 GB+8230A633-8230F237
|
||||
0002BF U+49B8-4C76 GB+8234A131-8234E733
|
||||
0001D7 U+4160-4336 GB+8232C937-8232F837
|
||||
0001B9 U+478E-4946 GB+8233E838-82349638
|
||||
000175 U+44D7-464B GB+8233A339-8233C931
|
||||
00001A U+FFE6-FFFF GB+8431A234-8431A439
|
||||
|
Loading…
Reference in New Issue
Block a user