d2ec8987a7
ICU-8966: Indic_Positional_Category & Indic_Syllabic_Category ICU-12850: Vertical_Orientation
3830 lines
178 KiB
Plaintext
3830 lines
178 KiB
Plaintext
* Copyright (C) 2016 and later: Unicode, Inc. and others.
|
|
* License & terms of use: http://www.unicode.org/copyright.html
|
|
* Copyright (C) 2004-2016, International Business Machines
|
|
* Corporation and others. All Rights Reserved.
|
|
*
|
|
* file name: changes.txt
|
|
* encoding: US-ASCII
|
|
* tab size: 8 (not used)
|
|
* indentation:4
|
|
*
|
|
* created on: 2004may06
|
|
* created by: Markus W. Scherer
|
|
*
|
|
* change log for Unicode updates
|
|
*
|
|
* For each new Unicode version, during the beta period,
|
|
* I copy the change log for the previous version to the top of this file.
|
|
* I adjust the versions, tickets, URLs, and paths.
|
|
* I work my way through the steps listed in the log, top to bottom,
|
|
* adjusting the log as necessary.
|
|
* I report problems to the UTC and/or CLDR and/or ICU.
|
|
* Before the data is final, I "turn the crank" several more times,
|
|
* using appropriate subsets of the steps.
|
|
|
|
---------------------------------------------------------------------------- ***
|
|
|
|
* New ISO 15924 script codes
|
|
|
|
Starting with ICU 55, we do not add UScriptCode constants for new scripts any more
|
|
until they are encoded in Unicode,
|
|
or can be assumed to be encoded in the next Unicode version.
|
|
Script enum constant names want to follow the Unicode script property value aliases,
|
|
which are assigned only when the scripts are encoded.
|
|
When we encode scripts early and guess wrong, then we have confusing enum constants
|
|
and have sometimes added aliases.
|
|
|
|
Variant script codes like Latf and Aran that are not subject to separate encoding
|
|
can be added at any time.
|
|
(For example, Aran could be added as USCRIPT_ARABIC_NASTALIQ.)
|
|
|
|
We add script codes used in CLDR or in the spoof checker.
|
|
This includes combination/alias codes like Hanb and Jamo.
|
|
See http://unicode.org/reports/tr35/#unicode_script_subtag_validity
|
|
and look for "alias" on http://unicode.org/iso15924/iso15924-codes.html
|
|
|
|
We add special Z* script codes like Zsye.
|
|
|
|
For new script codes see http://www.unicode.org/iso15924/codechanges.html
|
|
|
|
---------------------------------------------------------------------------- ***
|
|
|
|
ICU 63 addition of ICU support of text layout properties InPC, InSC, vo
|
|
|
|
* Command-line environment setup
|
|
|
|
UNICODE_DATA=~/unidata/uni11/20180609
|
|
CLDR_SRC=~/svn.cldr/uni
|
|
ICU_ROOT=~/icu/mine
|
|
ICU_SRC=$ICU_ROOT/src
|
|
ICUDT=icudt62b
|
|
ICU4C_DATA_IN=$ICU_SRC/icu4c/source/data/in
|
|
ICU4C_UNIDATA=$ICU_SRC/icu4c/source/data/unidata
|
|
export LD_LIBRARY_PATH=$ICU_ROOT/dbg/icu4c/lib
|
|
|
|
*** Links
|
|
|
|
https://unicode-org.atlassian.net/browse/ICU-8966 InPC & InSC
|
|
https://unicode-org.atlassian.net/browse/ICU-12850 vo
|
|
|
|
*** data files & enums & parser code
|
|
|
|
* API additions
|
|
- for each of the three new enumerated properties
|
|
+ uchar.h: add the enum UProperty constant UCHAR_<long prop name>
|
|
+ uchar.h: update UCHAR_INT_LIMIT
|
|
+ uchar.h: add the enum U<long prop name>
|
|
with constants U_<short prop name>_<long value name>
|
|
+ UProperty.java: add the constant <long prop name>
|
|
+ UProperty.java: update INT_LIMIT
|
|
+ UCharacter.java: add the interface <long prop name>
|
|
with constants <long value name>
|
|
|
|
* process and/or copy files
|
|
- $ICU_SRC/tools/unicode$ py/preparseucd.py $UNICODE_DATA $ICU_SRC
|
|
+ This writes files (especially ppucd.txt) to the ICU4C unidata and testdata subfolders.
|
|
+ It also writes tools/unicode/c/genprops/pnames_data.h with property and value
|
|
names and aliases.
|
|
+ For debugging, and tweaking how ppucd.txt is written,
|
|
the tool has an --only_ppucd option:
|
|
py/preparseucd.py $UNICODE_DATA --only_ppucd path/to/ppucd/outputfile
|
|
|
|
* preparseucd.py changes
|
|
- add new property short names (uppercase) to _prop_and_value_re
|
|
so that ParseUCharHeader() parses the new enum constants
|
|
|
|
* build ICU (make install)
|
|
so that the tools build can pick up the new definitions from the installed header files.
|
|
|
|
$ICU_ROOT/dbg/icu4c$ echo;echo; make -j7 install > out.txt 2>&1 ; tail -n 30 out.txt ; date
|
|
|
|
* build Unicode tools using CMake+make
|
|
|
|
$ICU_SRC/tools/unicode/c/icudefs.txt:
|
|
|
|
# Location (--prefix) of where ICU was installed.
|
|
set(ICU_INST_DIR /usr/local/google/home/mscherer/icu/mine/inst/icu4c)
|
|
# Location of the ICU4C source tree.
|
|
set(ICU4C_SRC_DIR /usr/local/google/home/mscherer/icu/mine/src/icu4c)
|
|
|
|
$ICU_ROOT/dbg$
|
|
mkdir -p tools/unicode/c
|
|
cd tools/unicode/c
|
|
|
|
$ICU_ROOT/dbg/tools/unicode/c$
|
|
cmake ../../../../../src/tools/unicode/c
|
|
make
|
|
|
|
* generate core properties data files
|
|
$ICU_ROOT/dbg/tools/unicode/c$
|
|
genprops/genprops $ICU_SRC/icu4c
|
|
- rebuild ICU (make install) & tools
|
|
|
|
* write data for runtime, hardcoded for now
|
|
- add genprops/layoutpropsbuilder.cpp with pieces from sibling files
|
|
- generate new icu4c/source/common/ulayout_props_data.h
|
|
- for each of the three new enumerated properties
|
|
+ int property max value
|
|
+ small, 8-bit UCPTrie
|
|
(A small 16-bit trie with bit fields for these three properties
|
|
is very nearly the same size as the sum of the three.)
|
|
|
|
* wire into C++
|
|
- uprops.cpp: #include ulayout_props_data.h
|
|
- uprops.cpp: add getInPC() etc. functions
|
|
- uprops.cpp: add lines to intProps[], include max values
|
|
- uprops.h: add UPropertySource constants
|
|
- uprops.cpp: add uprops_addPropertyStarts(src)
|
|
- uniset_props.cpp: add to UnicodeSet_initInclusion()
|
|
- intltest/ucdtest.cpp: write unit tests
|
|
|
|
* update Java data files
|
|
- refresh just the pnames.icu file with the new property [value] names, just to be safe
|
|
- see $ICU_SRC/icu4c/source/data/icu4j-readme.txt
|
|
- mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT
|
|
- $ICU_ROOT/dbg/icu4c$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install
|
|
- copy the big-endian Unicode data files to another location,
|
|
separate from the other data files,
|
|
and then refresh ICU4J
|
|
cd $ICU_ROOT/dbg/icu4c/data/out/icu4j
|
|
cp com/ibm/icu/impl/data/$ICUDT/pnames.icu /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT
|
|
jar uvf $ICU_SRC/icu4j/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/$ICUDT
|
|
|
|
* wire into Java
|
|
- UCharacterProperty.java: add new SRC_INPC etc. constants as in C++
|
|
- UCharacterProperty.java: for each new property
|
|
+ create a nested class to hold its CodePointTrie
|
|
+ initialize it from a string literal
|
|
+ paste in the initializer printed by genprops
|
|
+ add a new IntProperty object to the intProps[] array
|
|
+ use the correct max int value for each property, also printed by genprops
|
|
- UCharacterProperty.java: add ulayout_addPropertyStarts(src, set)
|
|
- UnicodeSet.java: add to getInclusions()
|
|
- UCharacterTest.java: write unit tests
|
|
|
|
---------------------------------------------------------------------------- ***
|
|
|
|
Unicode 11.0 update for ICU 62
|
|
|
|
http://www.unicode.org/versions/Unicode11.0.0/
|
|
http://unicode.org/versions/beta-11.0.0.html
|
|
https://www.unicode.org/review/pri372/
|
|
http://www.unicode.org/reports/uax-proposed-updates.html
|
|
http://www.unicode.org/reports/tr44/tr44-21.html
|
|
|
|
* Command-line environment setup
|
|
|
|
UNICODE_DATA=~/unidata/uni11/20180521
|
|
CLDR_SRC=~/svn.cldr/uni
|
|
ICU_ROOT=~/svn.icu/uni
|
|
ICU_SRC=$ICU_ROOT/src
|
|
ICUDT=icudt61b
|
|
ICU4C_DATA_IN=$ICU_SRC/icu4c/source/data/in
|
|
ICU4C_UNIDATA=$ICU_SRC/icu4c/source/data/unidata
|
|
export LD_LIBRARY_PATH=$ICU_ROOT/dbg/icu4c/lib
|
|
|
|
*** ICU Trac
|
|
|
|
- ticket:13630: Unicode 11
|
|
- ^/branches/markus/uni11
|
|
|
|
*** CLDR Trac
|
|
|
|
- cldrbug 10978: Unicode 11
|
|
- ^/branches/markus/uni11
|
|
|
|
*** Unicode version numbers
|
|
- makedata.mak
|
|
- uchar.h
|
|
- com.ibm.icu.util.VersionInfo
|
|
- com.ibm.icu.dev.test.lang.UCharacterTest.VERSION_
|
|
|
|
- Run ICU4C "configure" _after_ updating the Unicode version number in uchar.h
|
|
so that the makefiles see the new version number.
|
|
|
|
*** data files & enums & parser code
|
|
|
|
* download files
|
|
- mkdir -p $UNICODE_DATA
|
|
- download Unicode files into $UNICODE_DATA
|
|
+ subfolders: emoji, idna, security, ucd, uca
|
|
+ inside ucd: extract Unihan.zip to "here" (.../ucd/Unihan/*.txt), delete Unihan.zip
|
|
|
|
* for manual diffs and for Unicode Tools input data updates:
|
|
remove version suffixes from the file names
|
|
~$ unidata/desuffixucd.py $UNICODE_DATA
|
|
(see https://sites.google.com/site/unicodetools/inputdata)
|
|
|
|
* process and/or copy files
|
|
- $ICU_SRC/tools/unicode$ py/preparseucd.py $UNICODE_DATA $ICU_SRC
|
|
+ This writes files (especially ppucd.txt) to the ICU4C unidata and testdata subfolders.
|
|
+ For debugging, and tweaking how ppucd.txt is written,
|
|
the tool has an --only_ppucd option:
|
|
py/preparseucd.py $UNICODE_DATA --only_ppucd path/to/ppucd/outputfile
|
|
|
|
- cp $UNICODE_DATA/security/confusables.txt $ICU4C_UNIDATA
|
|
|
|
* build ICU (make install)
|
|
so that the tools build can pick up the new definitions from the installed header files.
|
|
|
|
$ICU_ROOT/dbg/icu4c$ echo;echo; make -j7 install > out.txt 2>&1 ; tail -n 30 out.txt ; date
|
|
|
|
* preparseucd.py changes
|
|
- fix other errors
|
|
NameError: unknown property Extended_Pictographic
|
|
-> add Extended_Pictographic binary property
|
|
-> add new short names for all Emoji properties
|
|
|
|
* new constants for new property values
|
|
- preparseucd.py error:
|
|
ValueError: missing uchar.h enum constants for some property values:
|
|
[(u'blk', set([u'Georgian_Ext', u'Hanifi_Rohingya', u'Medefaidrin', u'Sogdian', u'Makasar',
|
|
u'Old_Sogdian', u'Dogra', u'Gunjala_Gondi', u'Chess_Symbols', u'Mayan_Numerals',
|
|
u'Indic_Siyaq_Numbers'])),
|
|
(u'jg', set([u'Hanifi_Rohingya_Kinna_Ya', u'Hanifi_Rohingya_Pa'])),
|
|
(u'sc', set([u'Medf', u'Sogd', u'Dogr', u'Rohg', u'Maka', u'Sogo', u'Gong'])),
|
|
(u'GCB', set([u'LinkC', u'Virama'])),
|
|
(u'WB', set([u'WSegSpace']))]
|
|
= PropertyValueAliases.txt new property values (diff old & new .txt files)
|
|
blk; Chess_Symbols ; Chess_Symbols
|
|
blk; Dogra ; Dogra
|
|
blk; Georgian_Ext ; Georgian_Extended
|
|
blk; Gunjala_Gondi ; Gunjala_Gondi
|
|
blk; Hanifi_Rohingya ; Hanifi_Rohingya
|
|
blk; Indic_Siyaq_Numbers ; Indic_Siyaq_Numbers
|
|
blk; Makasar ; Makasar
|
|
blk; Mayan_Numerals ; Mayan_Numerals
|
|
blk; Medefaidrin ; Medefaidrin
|
|
blk; Old_Sogdian ; Old_Sogdian
|
|
blk; Sogdian ; Sogdian
|
|
-> add to uchar.h
|
|
use long property names for enum constants,
|
|
for the trailing comment get the block start code point: diff old & new Blocks.txt
|
|
-> add to UCharacter.UnicodeBlock IDs
|
|
Eclipse find UBLOCK_([^ ]+) = ([0-9]+), (/.+)
|
|
replace public static final int \1_ID = \2; \3
|
|
-> add to UCharacter.UnicodeBlock objects
|
|
Eclipse find UBLOCK_([^ ]+) = [0-9]+, (/.+)
|
|
replace public static final UnicodeBlock \1 = new UnicodeBlock("\1", \1_ID); \2
|
|
|
|
GCB; LinkC ; LinkingConsonant
|
|
GCB; Virama ; Virama
|
|
-> uchar.h & UCharacter.GraphemeClusterBreak
|
|
-> these two later removed again: http://www.unicode.org/L2/L2018/18115.htm#155-A76
|
|
|
|
InSC; Consonant_Initial_Postfixed ; Consonant_Initial_Postfixed
|
|
-> ignore: ICU does not yet support this property
|
|
|
|
jg ; Hanifi_Rohingya_Kinna_Ya ; Hanifi_Rohingya_Kinna_Ya
|
|
jg ; Hanifi_Rohingya_Pa ; Hanifi_Rohingya_Pa
|
|
-> uchar.h & UCharacter.JoiningGroup
|
|
|
|
sc ; Dogr ; Dogra
|
|
sc ; Gong ; Gunjala_Gondi
|
|
sc ; Maka ; Makasar
|
|
sc ; Medf ; Medefaidrin
|
|
sc ; Rohg ; Hanifi_Rohingya
|
|
sc ; Sogd ; Sogdian
|
|
sc ; Sogo ; Old_Sogdian
|
|
-> uscript.h & com.ibm.icu.lang.UScript
|
|
-> Nushu had been added already
|
|
-> fix expectedLong names in cucdapi.c/TestUScriptCodeAPI()
|
|
and in com.ibm.icu.dev.test.lang.TestUScript.java
|
|
|
|
WB ; WSegSpace ; WSegSpace
|
|
-> uchar.h & UCharacter.WordBreak
|
|
|
|
* New short names for emoji properties
|
|
- see UTS #51
|
|
- short names set in preparseucd.py
|
|
|
|
* New properties
|
|
- boolean emoji property Extended_Pictographic
|
|
-> added in preparseucd.py
|
|
-> uchar.h & UProperty.java
|
|
- misc. property Equivalent_Unified_Ideograph (EqUIdeo)
|
|
as shown in PropertyValueAliases.txt
|
|
-> ignore for now
|
|
|
|
* update Script metadata: SCRIPT_PROPS[] in uscript_props.cpp & UScript.ScriptMetadata
|
|
(not strictly necessary for NOT_ENCODED scripts)
|
|
$ICU_SRC/tools/unicode$ py/parsescriptmetadata.py $ICU_SRC/icu4c/source/common/unicode/uscript.h $CLDR_SRC/common/properties/scriptMetadata.txt
|
|
|
|
* update spoof checker UnicodeSet initializers:
|
|
inclusionPat & recommendedPat in uspoof.cpp
|
|
INCLUSION & RECOMMENDED in SpoofChecker.java
|
|
- make sure that the Unicode Tools tree contains the latest security data files
|
|
- go to Unicode Tools org.unicode.text.tools.RecommendedSetGenerator
|
|
- update the hardcoded version number there in the DIRECTORY path
|
|
- run the tool (no special environment variables needed)
|
|
- copy & paste from the Console output into the .cpp & .java files
|
|
|
|
* generate normalization data files
|
|
cd $ICU_ROOT/dbg/icu4c
|
|
bin/gennorm2 -o $ICU_SRC/icu4c/source/common/norm2_nfc_data.h -s $ICU4C_UNIDATA/norm2 nfc.txt --csource
|
|
bin/gennorm2 -o $ICU4C_DATA_IN/nfc.nrm -s $ICU4C_UNIDATA/norm2 nfc.txt
|
|
bin/gennorm2 -o $ICU4C_DATA_IN/nfkc.nrm -s $ICU4C_UNIDATA/norm2 nfc.txt nfkc.txt
|
|
bin/gennorm2 -o $ICU4C_DATA_IN/nfkc_cf.nrm -s $ICU4C_UNIDATA/norm2 nfc.txt nfkc.txt nfkc_cf.txt
|
|
bin/gennorm2 -o $ICU4C_DATA_IN/uts46.nrm -s $ICU4C_UNIDATA/norm2 nfc.txt uts46.txt
|
|
|
|
* build ICU (make install)
|
|
so that the tools build can pick up the new definitions from the installed header files.
|
|
|
|
$ICU_ROOT/dbg/icu4c$ echo;echo; make -j7 install > out.txt 2>&1 ; tail -n 30 out.txt ; date
|
|
|
|
* build Unicode tools using CMake+make
|
|
|
|
$ICU_SRC/tools/unicode/c/icudefs.txt:
|
|
|
|
# Location (--prefix) of where ICU was installed.
|
|
set(ICU_INST_DIR /usr/local/google/home/mscherer/svn.icu/trunk/inst/icu4c)
|
|
# Location of the ICU4C source tree.
|
|
set(ICU4C_SRC_DIR /usr/local/google/home/mscherer/svn.icu/uni/src/icu4c)
|
|
|
|
$ICU_ROOT/dbg$
|
|
mkdir -p tools/unicode/c
|
|
cd tools/unicode/c
|
|
|
|
$ICU_ROOT/dbg/tools/unicode/c$
|
|
cmake ../../../../src/tools/unicode/c
|
|
make
|
|
|
|
* generate core properties data files
|
|
$ICU_ROOT/dbg/tools/unicode/c$
|
|
genprops/genprops $ICU_SRC/icu4c
|
|
genuca/genuca --hanOrder implicit $ICU_SRC/icu4c
|
|
genuca/genuca --hanOrder radical-stroke $ICU_SRC/icu4c
|
|
- rebuild ICU (make install) & tools
|
|
|
|
* Fix case props
|
|
genprops error: casepropsbuilder: too many exceptions words
|
|
genprops error: failure finalizing the data - U_BUFFER_OVERFLOW_ERROR
|
|
- With the addition of Georgian Mtavruli capital letters,
|
|
there are now too many simple case mappings with big mapping deltas
|
|
that yield uncompressible exceptions.
|
|
- Changing the data structure (now formatVersion 4),
|
|
adding one bit for no-simple-case-folding (for Cherokee), and
|
|
one optional slot for a big delta (for most faraway mappings),
|
|
together with another bit for whether that is negative.
|
|
This makes most Cherokee & Georgian etc. case mappings compressible,
|
|
reducing the number of exceptions words.
|
|
- Further changes to gain one more bit for the exceptions index,
|
|
for future growth. Details see casepropsbuilder.cpp.
|
|
|
|
* update uts46test.cpp and UTS46Test.java if there are new characters that are equivalent to
|
|
sequences with non-LDH ASCII (that is, their decompositions contain '=' or similar)
|
|
- grep IdnaMappingTable.txt or uts46.txt for "disallowed_STD3_valid" on non-ASCII characters
|
|
- Unicode 6.0..11.0: U+2260, U+226E, U+226F
|
|
- nothing new in this Unicode version, no test file to update
|
|
|
|
* run & fix ICU4C tests
|
|
- Andy handles RBBI & spoof check test failures
|
|
|
|
- Errors in char.txt, word.txt, word_POSIX.txt like
|
|
createRuleBasedBreakIterator: ICU Error "U_BRK_RULE_EMPTY_SET" at line 46, column 16
|
|
because \p{Grapheme_Cluster_Break = EBG} and \p{Word_Break = EBG} are empty.
|
|
-> Temporary(!) workaround: Add an arbitrary code point to these sets to make them
|
|
not empty, just to get ICU building.
|
|
-> Intermediate workaround: Remove $E_Base_GAZ and other now-unused variables
|
|
and properties together with the rules that used them (GB 10, WB 14).
|
|
-> Andy adjusts the rule sets further to sync with
|
|
Unicode 11 grapheme, word, and line break spec changes.
|
|
|
|
* collation: CLDR collation root, UCA DUCET
|
|
|
|
- UCA DUCET goes into Mark's Unicode tools, see
|
|
https://sites.google.com/site/unicodetools/home#TOC-UCA
|
|
diff the main mapping file, look for bad changes
|
|
(for example, more bytes per weight for common characters)
|
|
~/svn.unitools/trunk$ sed -r -f ~/svn.cldr/uni/tools/scripts/uca/blankweights.sed ../Generated/uca/11.0.0/CollationAuxiliary/FractionalUCA.txt > ../frac-11.txt
|
|
~/svn.unitools/trunk$ meld ../frac-10.txt ../frac-11.txt
|
|
|
|
- CLDR root data files are checked into $CLDR_SRC/common/uca/
|
|
cp (Unicode Tools UCA generated)/CollationAuxiliary/* $CLDR_SRC/common/uca/
|
|
|
|
- update source/data/unidata/FractionalUCA.txt with FractionalUCA_SHORT.txt
|
|
cp $CLDR_SRC/common/uca/FractionalUCA_SHORT.txt $ICU4C_UNIDATA/FractionalUCA.txt
|
|
- update source/data/unidata/UCARules.txt with UCA_Rules_SHORT.txt
|
|
cp $ICU4C_UNIDATA/UCARules.txt /tmp/UCARules-old.txt
|
|
(note removing the underscore before "Rules")
|
|
cp $CLDR_SRC/common/uca/UCA_Rules_SHORT.txt $ICU4C_UNIDATA/UCARules.txt
|
|
- restore TODO diffs in UCARules.txt
|
|
meld /tmp/UCARules-old.txt $ICU4C_UNIDATA/UCARules.txt
|
|
- update (ICU4C)/source/test/testdata/CollationTest_*.txt
|
|
and (ICU4J)/main/tests/collate/src/com/ibm/icu/dev/data/CollationTest_*.txt
|
|
from the CLDR root files (..._CLDR_..._SHORT.txt)
|
|
cp $CLDR_SRC/common/uca/CollationTest_CLDR_NON_IGNORABLE_SHORT.txt $ICU_SRC/icu4c/source/test/testdata/CollationTest_NON_IGNORABLE_SHORT.txt
|
|
cp $CLDR_SRC/common/uca/CollationTest_CLDR_SHIFTED_SHORT.txt $ICU_SRC/icu4c/source/test/testdata/CollationTest_SHIFTED_SHORT.txt
|
|
cp $ICU_SRC/icu4c/source/test/testdata/CollationTest_*.txt $ICU_SRC/icu4j/main/tests/collate/src/com/ibm/icu/dev/data
|
|
- if CLDR common/uca/unihan-index.txt changes, then update
|
|
CLDR common/collation/root.xml <collation type="private-unihan">
|
|
and regenerate (or update in parallel) $ICU_SRC/icu4c/source/data/coll/root.txt
|
|
|
|
- run genuca, see command line above;
|
|
deal with
|
|
Error: Unknown script for first-primary sample character U+1180B on line 28649 of /usr/local/google/home/mscherer/svn.icu/uni/src/icu4c/source/data/unidata/FractionalUCA.txt:
|
|
FDD1 1180B; [71 CC 02, 05, 05] # Dogra first primary (compressible)
|
|
(add the character to genuca.cpp sampleCharsToScripts[])
|
|
+ look up the USCRIPT_ code for the new sample characters
|
|
(should be obvious from the comment in the error output)
|
|
+ *add* mappings to sampleCharsToScripts[], do not replace them
|
|
(in case the script sample characters flip-flop)
|
|
+ insert new scripts in DUCET script order, see the top_byte table
|
|
at the beginning of FractionalUCA.txt
|
|
- rebuild ICU4C
|
|
|
|
* Unihan collators
|
|
https://sites.google.com/site/unicodetools/unihan
|
|
- run Unicode Tools
|
|
org.unicode.draft.GenerateUnihanCollators
|
|
with VM arguments
|
|
-ea
|
|
-DSVN_WORKSPACE=/usr/local/google/home/mscherer/svn.unitools/trunk
|
|
-DOTHER_WORKSPACE=/usr/local/google/home/mscherer/svn.unitools
|
|
-DUCD_DIR=/usr/local/google/home/mscherer/svn.unitools/trunk/data
|
|
-DCLDR_DIR=/usr/local/google/home/mscherer/svn.cldr/uni
|
|
-DUVERSION=11.0.0
|
|
- run Unicode Tools
|
|
org.unicode.draft.GenerateUnihanCollatorFiles
|
|
with the same arguments
|
|
- check CLDR diffs
|
|
cd $CLDR_SRC
|
|
meld common/collation/zh.xml ../Generated/cldr/han/replace/zh.xml
|
|
meld common/transforms/Han-Latin.xml ../Generated/cldr/han/replace/Han-Latin.xml
|
|
- copy to CLDR
|
|
cd $CLDR_SRC
|
|
cp ../Generated/cldr/han/replace/zh.xml common/collation/zh.xml
|
|
cp ../Generated/cldr/han/replace/Han-Latin.xml common/transforms/Han-Latin.xml
|
|
- run CLDR unit tests, commit to CLDR
|
|
- generate ICU zh collation data: run CLDR
|
|
org.unicode.cldr.icu.NewLdml2IcuConverter
|
|
with program arguments
|
|
-t collation
|
|
-s /usr/local/google/home/mscherer/svn.cldr/uni/common/collation
|
|
-m /usr/local/google/home/mscherer/svn.cldr/uni/common/supplemental
|
|
-d /usr/local/google/home/mscherer/svn.icu/uni/src/icu4c/source/data/coll
|
|
-p /usr/local/google/home/mscherer/svn.icu/uni/src/icu4c/source/data/xml/collation
|
|
zh
|
|
and VM arguments
|
|
-ea
|
|
-DCLDR_DIR=/usr/local/google/home/mscherer/svn.cldr/uni
|
|
- rebuild ICU4C
|
|
|
|
* run & fix ICU4C tests, now with new CLDR collation root data
|
|
- run all tests with the collation test data *_SHORT.txt or the full files
|
|
(the full ones have comments, useful for debugging)
|
|
- note on intltest: if collate/UCAConformanceTest fails, then
|
|
utility/MultithreadTest/TestCollators will fail as well;
|
|
fix the conformance test before looking into the multi-thread test
|
|
|
|
* update Java data files
|
|
- refresh just the UCD/UCA-related/derived files, just to be safe
|
|
- see (ICU4C)/source/data/icu4j-readme.txt
|
|
- mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT
|
|
- $ICU_ROOT/dbg/icu4c$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install
|
|
output:
|
|
...
|
|
Unicode .icu files built to ./out/build/icudt61l
|
|
echo timestamp > uni-core-data
|
|
mkdir -p ./out/icu4j/com/ibm/icu/impl/data/icudt61b
|
|
mkdir -p ./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt61b
|
|
echo pnames.icu uprops.icu ucase.icu ubidi.icu nfc.nrm > ./out/icu4j/add.txt
|
|
LD_LIBRARY_PATH=../lib:../stubdata:../tools/ctestfw:$LD_LIBRARY_PATH ../bin/icupkg ./out/tmp/icudt61l.dat ./out/icu4j/icudt61b.dat -a ./out/icu4j/add.txt -s ./out/build/icudt61l -x '*' -tb -d ./out/icu4j/com/ibm/icu/impl/data/icudt61b
|
|
mv ./out/icu4j/"com/ibm/icu/impl/data/icudt61b/zoneinfo64.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt61b/metaZones.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt61b/timezoneTypes.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt61b/windowsZones.res" "./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt61b"
|
|
jar cf ./out/icu4j/icudata.jar -C ./out/icu4j com/ibm/icu/impl/data/icudt61b/
|
|
mkdir -p /tmp/icu4j/main/shared/data
|
|
cp ./out/icu4j/icudata.jar /tmp/icu4j/main/shared/data
|
|
jar cf ./out/icu4j/icutzdata.jar -C ./out/icu4j/tzdata com/ibm/icu/impl/data/icudt61b/
|
|
mkdir -p /tmp/icu4j/main/shared/data
|
|
cp ./out/icu4j/icutzdata.jar /tmp/icu4j/main/shared/data
|
|
make[1]: Leaving directory '/usr/local/google/home/mscherer/svn.icu/uni/dbg/icu4c/data'
|
|
- copy the big-endian Unicode data files to another location,
|
|
separate from the other data files,
|
|
and then refresh ICU4J
|
|
cd $ICU_ROOT/dbg/icu4c/data/out/icu4j
|
|
mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/coll
|
|
mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/brkitr
|
|
cp com/ibm/icu/impl/data/$ICUDT/confusables.cfu /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT
|
|
cp com/ibm/icu/impl/data/$ICUDT/*.icu /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT
|
|
rm /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/cnvalias.icu
|
|
cp com/ibm/icu/impl/data/$ICUDT/*.nrm /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT
|
|
cp com/ibm/icu/impl/data/$ICUDT/coll/* /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/coll
|
|
cp com/ibm/icu/impl/data/$ICUDT/brkitr/* /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/brkitr
|
|
jar uvf $ICU_SRC/icu4j/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/$ICUDT
|
|
|
|
* When refreshing all of ICU4J data from ICU4C
|
|
- $ICU_ROOT/dbg/icu4c$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install
|
|
- cp /tmp/icu4j/main/shared/data/icudata.jar $ICU_SRC/icu4j/main/shared/data
|
|
or
|
|
- $ICU_ROOT/dbg/icu4c$ make ICU4J_ROOT=$ICU_SRC/icu4j icu4j-data-install
|
|
|
|
* update CollationFCD.java
|
|
+ copy & paste the initializers of lcccIndex[] etc. from
|
|
ICU4C/source/i18n/collationfcd.cpp to
|
|
ICU4J/main/classes/collate/src/com/ibm/icu/impl/coll/CollationFCD.java
|
|
|
|
* refresh Java test .txt files
|
|
- copy new .txt files into ICU4J's main/tests/core/src/com/ibm/icu/dev/data/unicode
|
|
cd $ICU_SRC/icu4c/source/data/unidata
|
|
cp confusables.txt confusablesWholeScript.txt NormalizationCorrections.txt NormalizationTest.txt SpecialCasing.txt UnicodeData.txt $ICU_SRC/icu4j/main/tests/core/src/com/ibm/icu/dev/data/unicode
|
|
cd ../../test/testdata
|
|
cp BidiCharacterTest.txt BidiTest.txt IdnaTestV2.txt $ICU_SRC/icu4j/main/tests/core/src/com/ibm/icu/dev/data/unicode
|
|
cp $UNICODE_DATA/ucd/CompositionExclusions.txt $ICU_SRC/icu4j/main/tests/core/src/com/ibm/icu/dev/data/unicode
|
|
|
|
* run & fix ICU4J tests
|
|
|
|
*** API additions
|
|
- send notice to icu-design about new born-@stable API (enum constants etc.)
|
|
|
|
*** CLDR numbering systems
|
|
- look for new sets of decimal digits (gc=ND & nv=4) and add to CLDR
|
|
Unicode 11: using Unicode 11 CLDR ticket #10978
|
|
rohg 10D30..10D39 Hanifi_Rohingya
|
|
gong 11DA0..11DA9 Gunjala_Gondi
|
|
Earlier: CLDR tickets specific to adding new numbering systems.
|
|
Unicode 10: http://unicode.org/cldr/trac/ticket/10219
|
|
Unicode 9: http://unicode.org/cldr/trac/ticket/9692
|
|
|
|
*** merge the Unicode update branches back onto the trunk
|
|
- do not merge the icudata.jar and testdata.jar,
|
|
instead rebuild them from merged & tested ICU4C
|
|
- make sure that changes to Unicode tools are checked in:
|
|
http://www.unicode.org/utility/trac/log/trunk/unicodetools
|
|
|
|
---------------------------------------------------------------------------- ***
|
|
|
|
Unicode 10.0 update for ICU 60
|
|
|
|
http://www.unicode.org/versions/Unicode10.0.0/
|
|
http://www.unicode.org/versions/beta-10.0.0.html
|
|
http://blog.unicode.org/2017/03/unicode-100-beta-review.html
|
|
http://www.unicode.org/review/pri350/
|
|
http://www.unicode.org/reports/uax-proposed-updates.html
|
|
http://www.unicode.org/reports/tr44/tr44-19.html
|
|
|
|
* Command-line environment setup
|
|
|
|
UNICODE_DATA=~/unidata/uni10/20170605
|
|
CLDR_SRC=~/svn.cldr/uni10
|
|
ICU_ROOT=~/svn.icu/uni10
|
|
ICU_SRC=$ICU_ROOT/src
|
|
ICUDT=icudt60b
|
|
ICU4C_DATA_IN=$ICU_SRC/icu4c/source/data/in
|
|
ICU4C_UNIDATA=$ICU_SRC/icu4c/source/data/unidata
|
|
export LD_LIBRARY_PATH=$ICU_ROOT/dbg/icu4c/lib
|
|
|
|
*** ICU Trac
|
|
|
|
- ticket:12985: Unicode 10
|
|
- ticket:13061: undo hacks from emoji 5.0 update
|
|
- ticket:13062: add Emoji_Component property
|
|
- ^/branches/markus/uni10
|
|
|
|
*** CLDR Trac
|
|
|
|
- cldrbug 10055: Unicode 10
|
|
- cldrbug 9882: Unicode 10 script metadata
|
|
- cldrbug 10219: numbering systems for Unicode 10
|
|
|
|
*** Unicode version numbers
|
|
- makedata.mak
|
|
- uchar.h
|
|
- com.ibm.icu.util.VersionInfo
|
|
- com.ibm.icu.dev.test.lang.UCharacterTest.VERSION_
|
|
|
|
- Run ICU4C "configure" _after_ updating the Unicode version number in uchar.h
|
|
so that the makefiles see the new version number.
|
|
|
|
*** data files & enums & parser code
|
|
|
|
* download files
|
|
- mkdir -p $UNICODE_DATA
|
|
- download Unicode 10.0 files into $UNICODE_DATA
|
|
+ subfolders: ucd, uca, idna, security
|
|
+ inside ucd: extract Unihan.zip to "here" (.../ucd/Unihan/*.txt), delete Unihan.zip
|
|
- download emoji 5.0 files into $UNICODE_DATA/emoji
|
|
|
|
* for manual diffs: remove version suffixes from the file names
|
|
~$ unidata/desuffixucd.py $UNICODE_DATA
|
|
(see https://sites.google.com/site/unicodetools/inputdata)
|
|
|
|
* process and/or copy files
|
|
- $ICU_SRC/tools/unicode$ py/preparseucd.py $UNICODE_DATA $ICU_SRC
|
|
+ This writes files (especially ppucd.txt) to the ICU4C unidata and testdata subfolders.
|
|
+ For debugging, and tweaking how ppucd.txt is written,
|
|
the tool has an --only_ppucd option:
|
|
py/preparseucd.py $UNICODE_DATA --only_ppucd path/to/ppucd/outputfile
|
|
|
|
- cp $UNICODE_DATA/security/confusables.txt $ICU4C_UNIDATA
|
|
|
|
* build ICU (make install)
|
|
so that the tools build can pick up the new definitions from the installed header files.
|
|
|
|
$ICU_ROOT/dbg/icu4c$ echo;echo; make -j7 install > out.txt 2>&1 ; tail -n 30 out.txt ; date
|
|
|
|
* preparseucd.py changes
|
|
- remove or add new Unicode scripts from/to the
|
|
only-in-ISO-15924 list according to the error messages:
|
|
ValueError: remove ['Nshu'] from _scripts_only_in_iso15924
|
|
-> adjust _scripts_only_in_iso15924 as indicated
|
|
- fix other errors
|
|
Exception: no default values (@missing lines) for some Catalog or Enumerated properties: [u'vo']
|
|
-> add vo=Vertical_Orientation to _ignored_properties
|
|
-> later removed again, parsing the file, even though we do not yet store data for runtime use
|
|
|
|
* new constants for new property values
|
|
- preparseucd.py error:
|
|
ValueError: missing uchar.h enum constants for some property values:
|
|
[(u'blk', set([u'Zanabazar_Square', u'Nushu', u'CJK_Ext_F',
|
|
u'Kana_Ext_A', u'Syriac_Sup', u'Masaram_Gondi', u'Soyombo'])),
|
|
(u'jg', set([u'Malayalam_Bha', u'Malayalam_Llla', u'Malayalam_Nya', u'Malayalam_Lla',
|
|
u'Malayalam_Nga', u'Malayalam_Ssa', u'Malayalam_Tta', u'Malayalam_Ra',
|
|
u'Malayalam_Nna', u'Malayalam_Ja', u'Malayalam_Nnna'])),
|
|
(u'sc', set([u'Soyo', u'Gonm', u'Zanb']))]
|
|
= PropertyValueAliases.txt new property values (diff old & new .txt files)
|
|
blk; CJK_Ext_F ; CJK_Unified_Ideographs_Extension_F
|
|
blk; Kana_Ext_A ; Kana_Extended_A
|
|
blk; Masaram_Gondi ; Masaram_Gondi
|
|
blk; Nushu ; Nushu
|
|
blk; Soyombo ; Soyombo
|
|
blk; Syriac_Sup ; Syriac_Supplement
|
|
blk; Zanabazar_Square ; Zanabazar_Square
|
|
-> add to uchar.h
|
|
use long property names for enum constants,
|
|
for the trailing comment get the block start code point: diff old & new Blocks.txt
|
|
-> add to UCharacter.UnicodeBlock IDs
|
|
Eclipse find UBLOCK_([^ ]+) = ([0-9]+), (/.+)
|
|
replace public static final int \1_ID = \2; \3
|
|
-> add to UCharacter.UnicodeBlock objects
|
|
Eclipse find UBLOCK_([^ ]+) = [0-9]+, (/.+)
|
|
replace public static final UnicodeBlock \1 = new UnicodeBlock("\1", \1_ID); \2
|
|
|
|
jg ; Malayalam_Bha ; Malayalam_Bha
|
|
jg ; Malayalam_Ja ; Malayalam_Ja
|
|
jg ; Malayalam_Lla ; Malayalam_Lla
|
|
jg ; Malayalam_Llla ; Malayalam_Llla
|
|
jg ; Malayalam_Nga ; Malayalam_Nga
|
|
jg ; Malayalam_Nna ; Malayalam_Nna
|
|
jg ; Malayalam_Nnna ; Malayalam_Nnna
|
|
jg ; Malayalam_Nya ; Malayalam_Nya
|
|
jg ; Malayalam_Ra ; Malayalam_Ra
|
|
jg ; Malayalam_Ssa ; Malayalam_Ssa
|
|
jg ; Malayalam_Tta ; Malayalam_Tta
|
|
-> uchar.h & UCharacter.JoiningGroup
|
|
|
|
sc ; Gonm ; Masaram_Gondi
|
|
sc ; Nshu ; Nushu
|
|
sc ; Soyo ; Soyombo
|
|
sc ; Zanb ; Zanabazar_Square
|
|
-> uscript.h & com.ibm.icu.lang.UScript
|
|
-> Nushu had been added already
|
|
-> fix expectedLong names in cucdapi.c/TestUScriptCodeAPI()
|
|
and in com.ibm.icu.dev.test.lang.TestUScript.java
|
|
|
|
* New properties as shown in PropertyValueAliases.txt changes
|
|
- boolean Emoji_Component from emoji 5
|
|
-> uchar.h & UProperty.java
|
|
- boolean
|
|
# Regional_Indicator (RI)
|
|
|
|
RI ; N ; No ; F ; False
|
|
RI ; Y ; Yes ; T ; True
|
|
-> uchar.h & UProperty.java
|
|
-> single immutable range, to be hardcoded
|
|
- boolean
|
|
# Prepended_Concatenation_Mark (PCM)
|
|
|
|
PCM; N ; No ; F ; False
|
|
PCM; Y ; Yes ; T ; True
|
|
-> was new in Unicode 9
|
|
-> uchar.h & UProperty.java
|
|
- enumerated
|
|
# Vertical_Orientation (vo)
|
|
|
|
vo ; R ; Rotated
|
|
vo ; Tr ; Transformed_Rotated
|
|
vo ; Tu ; Transformed_Upright
|
|
vo ; U ; Upright
|
|
-> only pre-parsed for now, but not yet stored for runtime use
|
|
|
|
* update Script metadata: SCRIPT_PROPS[] in uscript_props.cpp & UScript.ScriptMetadata
|
|
(not strictly necessary for NOT_ENCODED scripts)
|
|
$ICU_SRC/tools/unicode$ py/parsescriptmetadata.py $ICU_SRC/icu4c/source/common/unicode/uscript.h $CLDR_SRC/common/properties/scriptMetadata.txt
|
|
|
|
* generate normalization data files
|
|
cd $ICU_ROOT/dbg/icu4c
|
|
bin/gennorm2 -o $ICU_SRC/icu4c/source/common/norm2_nfc_data.h -s $ICU4C_UNIDATA/norm2 nfc.txt --csource
|
|
bin/gennorm2 -o $ICU4C_DATA_IN/nfc.nrm -s $ICU4C_UNIDATA/norm2 nfc.txt
|
|
bin/gennorm2 -o $ICU4C_DATA_IN/nfkc.nrm -s $ICU4C_UNIDATA/norm2 nfc.txt nfkc.txt
|
|
bin/gennorm2 -o $ICU4C_DATA_IN/nfkc_cf.nrm -s $ICU4C_UNIDATA/norm2 nfc.txt nfkc.txt nfkc_cf.txt
|
|
bin/gennorm2 -o $ICU4C_DATA_IN/uts46.nrm -s $ICU4C_UNIDATA/norm2 nfc.txt uts46.txt
|
|
|
|
* build ICU (make install)
|
|
so that the tools build can pick up the new definitions from the installed header files.
|
|
|
|
$ICU_ROOT/dbg/icu4c$ echo;echo; make -j7 install > out.txt 2>&1 ; tail -n 30 out.txt ; date
|
|
|
|
* build Unicode tools using CMake+make
|
|
|
|
$ICU_SRC/tools/unicode/c/icudefs.txt:
|
|
|
|
# Location (--prefix) of where ICU was installed.
|
|
set(ICU_INST_DIR /usr/local/google/home/mscherer/svn.icu/trunk/inst/icu4c)
|
|
# Location of the ICU4C source tree.
|
|
set(ICU4C_SRC_DIR /usr/local/google/home/mscherer/svn.icu/uni10/src/icu4c)
|
|
|
|
$ICU_ROOT/dbg/tools/unicode/c$
|
|
cmake ../../../../src/tools/unicode/c
|
|
make
|
|
|
|
* generate core properties data files
|
|
$ICU_ROOT/dbg/tools/unicode/c$
|
|
genprops/genprops $ICU_SRC/icu4c
|
|
genuca/genuca --hanOrder implicit $ICU_SRC/icu4c
|
|
genuca/genuca --hanOrder radical-stroke $ICU_SRC/icu4c
|
|
- rebuild ICU (make install) & tools
|
|
|
|
* update uts46test.cpp and UTS46Test.java if there are new characters that are equivalent to
|
|
sequences with non-LDH ASCII (that is, their decompositions contain '=' or similar)
|
|
- grep IdnaMappingTable.txt or uts46.txt for "disallowed_STD3_valid" on non-ASCII characters
|
|
- Unicode 6.0..10.0: U+2260, U+226E, U+226F
|
|
- nothing new in this Unicode version, no test file to update
|
|
|
|
* run & fix ICU4C tests
|
|
- Andy handles RBBI & spoof check test failures
|
|
|
|
* collation: CLDR collation root, UCA DUCET
|
|
|
|
- UCA DUCET goes into Mark's Unicode tools, see
|
|
https://sites.google.com/site/unicodetools/home#TOC-UCA
|
|
- CLDR root data files are checked into $CLDR_SRC/common/uca/
|
|
cp (Unicode Tools UCA generated)/CollationAuxiliary/* $CLDR_SRC/common/uca/
|
|
|
|
- update source/data/unidata/FractionalUCA.txt with FractionalUCA_SHORT.txt
|
|
cp $CLDR_SRC/common/uca/FractionalUCA_SHORT.txt $ICU4C_UNIDATA/FractionalUCA.txt
|
|
- update source/data/unidata/UCARules.txt with UCA_Rules_SHORT.txt
|
|
cp $ICU4C_UNIDATA/UCARules.txt /tmp/UCARules-old.txt
|
|
(note removing the underscore before "Rules")
|
|
cp $CLDR_SRC/common/uca/UCA_Rules_SHORT.txt $ICU4C_UNIDATA/UCARules.txt
|
|
- restore TODO diffs in UCARules.txt
|
|
meld /tmp/UCARules-old.txt $ICU4C_UNIDATA/UCARules.txt
|
|
- update (ICU4C)/source/test/testdata/CollationTest_*.txt
|
|
and (ICU4J)/main/tests/collate/src/com/ibm/icu/dev/data/CollationTest_*.txt
|
|
from the CLDR root files (..._CLDR_..._SHORT.txt)
|
|
cp $CLDR_SRC/common/uca/CollationTest_CLDR_NON_IGNORABLE_SHORT.txt $ICU_SRC/icu4c/source/test/testdata/CollationTest_NON_IGNORABLE_SHORT.txt
|
|
cp $CLDR_SRC/common/uca/CollationTest_CLDR_SHIFTED_SHORT.txt $ICU_SRC/icu4c/source/test/testdata/CollationTest_SHIFTED_SHORT.txt
|
|
cp $ICU_SRC/icu4c/source/test/testdata/CollationTest_*.txt $ICU_SRC/icu4j/main/tests/collate/src/com/ibm/icu/dev/data
|
|
- if CLDR common/uca/unihan-index.txt changes, then update
|
|
CLDR common/collation/root.xml <collation type="private-unihan">
|
|
and regenerate (or update in parallel) $ICU_SRC/icu4c/source/data/coll/root.txt
|
|
|
|
- run genuca, see command line above;
|
|
deal with
|
|
Error: Unknown script for first-primary sample character U+11D10 on line 28117 of /usr/local/google/home/mscherer/svn.icu/uni10/src/icu4c/source/data/unidata/FractionalUCA.txt:
|
|
FDD1 11D10; [70 D5 02, 05, 05] # Masaram_Gondi first primary (compressible)
|
|
(add the character to genuca.cpp sampleCharsToScripts[])
|
|
+ look up the USCRIPT_ code for the new sample characters
|
|
(should be obvious from the comment in the error output)
|
|
+ *add* mappings to sampleCharsToScripts[], do not replace them
|
|
(in case the script sample characters flip-flop)
|
|
+ insert new scripts in DUCET script order, see the top_byte table
|
|
at the beginning of FractionalUCA.txt
|
|
- rebuild ICU4C
|
|
|
|
* Unihan collators
|
|
https://sites.google.com/site/unicodetools/unihan
|
|
- run Unicode Tools
|
|
org.unicode.draft.GenerateUnihanCollators
|
|
with VM arguments
|
|
-ea
|
|
-DSVN_WORKSPACE=/usr/local/google/home/mscherer/svn.unitools/trunk
|
|
-DOTHER_WORKSPACE=/usr/local/google/home/mscherer/svn.unitools
|
|
-DUCD_DIR=/usr/local/google/home/mscherer/svn.unitools/trunk/data
|
|
-DCLDR_DIR=/usr/local/google/home/mscherer/svn.cldr/uni10
|
|
-DUVERSION=10.0.0
|
|
- run Unicode Tools
|
|
org.unicode.draft.GenerateUnihanCollatorFiles
|
|
with the same arguments
|
|
- check CLDR diffs
|
|
cd $CLDR_SRC
|
|
meld common/collation/zh.xml ../Generated/cldr/han/replace/zh.xml
|
|
meld common/transforms/Han-Latin.xml ../Generated/cldr/han/replace/Han-Latin.xml
|
|
- copy to CLDR
|
|
cd $CLDR_SRC
|
|
cp ../Generated/cldr/han/replace/zh.xml common/collation/zh.xml
|
|
cp ../Generated/cldr/han/replace/Han-Latin.xml common/transforms/Han-Latin.xml
|
|
- run CLDR unit tests, commit to CLDR
|
|
- generate ICU zh collation data: run CLDR
|
|
org.unicode.cldr.icu.NewLdml2IcuConverter
|
|
with program arguments
|
|
-t collation
|
|
-s /usr/local/google/home/mscherer/svn.cldr/uni10/common/collation
|
|
-m /usr/local/google/home/mscherer/svn.cldr/uni10/common/supplemental
|
|
-d /usr/local/google/home/mscherer/svn.icu/uni10/src/icu4c/source/data/coll
|
|
-p /usr/local/google/home/mscherer/svn.icu/uni10/src/icu4c/source/data/xml/collation
|
|
zh
|
|
and VM arguments
|
|
-ea
|
|
-DCLDR_DIR=/usr/local/google/home/mscherer/svn.cldr/uni10
|
|
- rebuild ICU4C
|
|
|
|
* run & fix ICU4C tests, now with new CLDR collation root data
|
|
- run all tests with the collation test data *_SHORT.txt or the full files
|
|
(the full ones have comments, useful for debugging)
|
|
- note on intltest: if collate/UCAConformanceTest fails, then
|
|
utility/MultithreadTest/TestCollators will fail as well;
|
|
fix the conformance test before looking into the multi-thread test
|
|
|
|
* update Java data files
|
|
- refresh just the UCD/UCA-related/derived files, just to be safe
|
|
- see (ICU4C)/source/data/icu4j-readme.txt
|
|
- mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT
|
|
- $ICU_ROOT/dbg/icu4c$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install
|
|
output:
|
|
...
|
|
Unicode .icu files built to ./out/build/icudt60l
|
|
echo timestamp > uni-core-data
|
|
mkdir -p ./out/icu4j/com/ibm/icu/impl/data/icudt60b
|
|
mkdir -p ./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt60b
|
|
echo pnames.icu uprops.icu ucase.icu ubidi.icu nfc.nrm > ./out/icu4j/add.txt
|
|
LD_LIBRARY_PATH=../lib:../stubdata:../tools/ctestfw:$LD_LIBRARY_PATH ../bin/icupkg ./out/tmp/icudt60l.dat ./out/icu4j/icudt60b.dat -a ./out/icu4j/add.txt -s ./out/build/icudt60l -x '*' -tb -d ./out/icu4j/com/ibm/icu/impl/data/icudt60b
|
|
mv ./out/icu4j/"com/ibm/icu/impl/data/icudt60b/zoneinfo64.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt60b/metaZones.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt60b/timezoneTypes.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt60b/windowsZones.res" "./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt60b"
|
|
jar cf ./out/icu4j/icudata.jar -C ./out/icu4j com/ibm/icu/impl/data/icudt60b/
|
|
mkdir -p /tmp/icu4j/main/shared/data
|
|
cp ./out/icu4j/icudata.jar /tmp/icu4j/main/shared/data
|
|
jar cf ./out/icu4j/icutzdata.jar -C ./out/icu4j/tzdata com/ibm/icu/impl/data/icudt60b/
|
|
mkdir -p /tmp/icu4j/main/shared/data
|
|
cp ./out/icu4j/icutzdata.jar /tmp/icu4j/main/shared/data
|
|
make[1]: Leaving directory `/usr/local/google/home/mscherer/svn.icu/uni10/dbg/icu4c/data'
|
|
- copy the big-endian Unicode data files to another location,
|
|
separate from the other data files,
|
|
and then refresh ICU4J
|
|
cd $ICU_ROOT/dbg/icu4c/data/out/icu4j
|
|
mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/coll
|
|
mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/brkitr
|
|
cp com/ibm/icu/impl/data/$ICUDT/confusables.cfu /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT
|
|
cp com/ibm/icu/impl/data/$ICUDT/*.icu /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT
|
|
rm /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/cnvalias.icu
|
|
cp com/ibm/icu/impl/data/$ICUDT/*.nrm /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT
|
|
cp com/ibm/icu/impl/data/$ICUDT/coll/* /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/coll
|
|
cp com/ibm/icu/impl/data/$ICUDT/brkitr/* /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/brkitr
|
|
jar uvf $ICU_SRC/icu4j/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/$ICUDT
|
|
|
|
* When refreshing all of ICU4J data from ICU4C
|
|
- $ICU_ROOT/dbg/icu4c$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install
|
|
- cp /tmp/icu4j/main/shared/data/icudata.jar $ICU_SRC/icu4j/main/shared/data
|
|
or
|
|
- $ICU_ROOT/dbg/icu4c$ make ICU4J_ROOT=$ICU_SRC/icu4j icu4j-data-install
|
|
|
|
* update CollationFCD.java
|
|
+ copy & paste the initializers of lcccIndex[] etc. from
|
|
ICU4C/source/i18n/collationfcd.cpp to
|
|
ICU4J/main/classes/collate/src/com/ibm/icu/impl/coll/CollationFCD.java
|
|
|
|
* refresh Java test .txt files
|
|
- copy new .txt files into ICU4J's main/tests/core/src/com/ibm/icu/dev/data/unicode
|
|
cd $ICU_SRC/icu4c/source/data/unidata
|
|
cp confusables.txt confusablesWholeScript.txt NormalizationCorrections.txt NormalizationTest.txt SpecialCasing.txt UnicodeData.txt $ICU_SRC/icu4j/main/tests/core/src/com/ibm/icu/dev/data/unicode
|
|
cd ../../test/testdata
|
|
cp BidiCharacterTest.txt BidiTest.txt IdnaTest.txt $ICU_SRC/icu4j/main/tests/core/src/com/ibm/icu/dev/data/unicode
|
|
cp $UNICODE_DATA/ucd/CompositionExclusions.txt $ICU_SRC/icu4j/main/tests/core/src/com/ibm/icu/dev/data/unicode
|
|
|
|
* run & fix ICU4J tests
|
|
|
|
*** API additions
|
|
- send notice to icu-design about new born-@stable API (enum constants etc.)
|
|
|
|
*** CLDR numbering systems
|
|
- look for new sets of decimal digits (gc=ND & nv=4) and submit a CLDR ticket
|
|
Unicode 10: http://unicode.org/cldr/trac/ticket/10219
|
|
Unicode 9: http://unicode.org/cldr/trac/ticket/9692
|
|
|
|
*** merge the Unicode update branches back onto the trunk
|
|
- do not merge the icudata.jar and testdata.jar,
|
|
instead rebuild them from merged & tested ICU4C
|
|
- make sure that changes to Unicode tools are checked in:
|
|
http://www.unicode.org/utility/trac/log/trunk/unicodetools
|
|
|
|
---------------------------------------------------------------------------- ***
|
|
|
|
Emoji 5.0 update for ICU 59
|
|
- ICU 59 mostly remains on Unicode 9.0
|
|
- except updates bidi and segmentation data to Unicode 10 beta
|
|
|
|
First run of tools on combined icu4c/icu4j/tools trunk after svn repository reorg.
|
|
|
|
* Command-line environment setup
|
|
|
|
ICU_ROOT=~/svn.icu/trunk
|
|
ICU_SRC_DIR=$ICU_ROOT/src
|
|
ICU4C_SRC_DIR=$ICU_SRC_DIR/icu4c
|
|
ICUDT=icudt59b
|
|
export LD_LIBRARY_PATH=$ICU_ROOT/dbg/lib
|
|
SRC_DATA_IN=$ICU4C_SRC_DIR/source/data/in
|
|
UNIDATA=$ICU4C_SRC_DIR/source/data/unidata
|
|
|
|
*** ICU Trac
|
|
|
|
- ticket:12900: take Emoji 5.0 properties data into ICU 59 once it's released
|
|
- changes directly on trunk
|
|
|
|
*** data files & enums & parser code
|
|
|
|
* download files
|
|
|
|
- download Unicode 9.0 files into a uni90e50 folder: ucd, idna, security (skip uca)
|
|
- download emoji 5.0 beta files into the same uni90e50 folder
|
|
- download Unicode 10.0 beta files: ucd
|
|
+ copy Unicode 10 bidi files to the uni90e50/ucd folder:
|
|
BidiBrackets.txt
|
|
BidiCharacterTest.txt
|
|
BidiMirroring.txt
|
|
BidiTest.txt
|
|
extracted/DerivedBidiClass.txt
|
|
+ copy Unicode 10 segmentation files to the uni90e50/ucd folder:
|
|
LineBreak.txt
|
|
auxiliary/*
|
|
|
|
* preparseucd.py changes
|
|
- adjust for combined trunks
|
|
- write new copyright lines
|
|
- ignore new Emoji_Component property for now
|
|
|
|
* process and/or copy files
|
|
- ~/svn.icu/trunk/src/tools/unicode$ py/preparseucd.py ~/unidata/uni90e50/20170322 $ICU_SRC_DIR
|
|
+ This writes files (especially ppucd.txt) to the ICU4C unidata and testdata subfolders.
|
|
|
|
- cp ~/unidata/uni90e50/20170322/security/confusables.txt $UNIDATA
|
|
|
|
* build ICU (make install)
|
|
so that the tools build can pick up the new definitions from the installed header files.
|
|
|
|
$ICU_ROOT/dbg/icu4c$ echo;echo; make -j7 install > out.txt 2>&1 ; tail -n 30 out.txt ; date
|
|
|
|
* build Unicode tools using CMake+make
|
|
|
|
~/svn.icu/trunk/src/tools/unicode/c/icudefs.txt:
|
|
|
|
# Location (--prefix) of where ICU was installed.
|
|
set(ICU_INST_DIR /usr/local/google/home/mscherer/svn.icu/trunk/inst/icu4c)
|
|
# Location of the ICU4C source tree.
|
|
set(ICU4C_SRC_DIR /usr/local/google/home/mscherer/svn.icu/trunk/src/icu4c)
|
|
|
|
~/svn.icu/trunk/dbg/tools/unicode/c$
|
|
cmake ../../../../src/tools/unicode/c
|
|
make
|
|
|
|
* generate core properties data files
|
|
~/svn.icu/trunk/dbg/tools/unicode/c$
|
|
genprops/genprops $ICU4C_SRC_DIR
|
|
- rebuild ICU (make install) & tools
|
|
|
|
* run & fix ICU4C tests
|
|
- Andy handles RBBI & spoof check test failures
|
|
|
|
* update Java data files
|
|
- refresh just the UCD/UCA-related/derived files, just to be safe
|
|
- see (ICU4C)/source/data/icu4j-readme.txt
|
|
- mkdir /tmp/icu4j
|
|
- ~/svn.icu/trunk/dbg/icu4c$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install
|
|
output:
|
|
...
|
|
Unicode .icu files built to ./out/build/icudt59l
|
|
echo timestamp > uni-core-data
|
|
mkdir -p ./out/icu4j/com/ibm/icu/impl/data/icudt59b
|
|
mkdir -p ./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt59b
|
|
echo pnames.icu uprops.icu ucase.icu ubidi.icu nfc.nrm > ./out/icu4j/add.txt
|
|
LD_LIBRARY_PATH=../lib:../stubdata:../tools/ctestfw:$LD_LIBRARY_PATH ../bin/icupkg ./out/tmp/icudt59l.dat ./out/icu4j/icudt59b.dat -a ./out/icu4j/add.txt -s ./out/build/icudt59l -x '*' -tb -d ./out/icu4j/com/ibm/icu/impl/data/icudt59b
|
|
mv ./out/icu4j/"com/ibm/icu/impl/data/icudt59b/zoneinfo64.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt59b/metaZones.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt59b/timezoneTypes.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt59b/windowsZones.res" "./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt59b"
|
|
jar cf ./out/icu4j/icudata.jar -C ./out/icu4j com/ibm/icu/impl/data/icudt59b/
|
|
mkdir -p /tmp/icu4j/main/shared/data
|
|
cp ./out/icu4j/icudata.jar /tmp/icu4j/main/shared/data
|
|
jar cf ./out/icu4j/icutzdata.jar -C ./out/icu4j/tzdata com/ibm/icu/impl/data/icudt59b/
|
|
mkdir -p /tmp/icu4j/main/shared/data
|
|
cp ./out/icu4j/icutzdata.jar /tmp/icu4j/main/shared/data
|
|
make[1]: Leaving directory `/usr/local/google/home/mscherer/svn.icu/trunk/dbg/icu4c/data'
|
|
- copy the big-endian Unicode data files to another location,
|
|
separate from the other data files,
|
|
and then refresh ICU4J
|
|
cd ~/svn.icu/trunk/dbg/icu4c/data/out/icu4j
|
|
mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/brkitr
|
|
cp com/ibm/icu/impl/data/$ICUDT/confusables.cfu /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT
|
|
cp com/ibm/icu/impl/data/$ICUDT/*.icu /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT
|
|
rm /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/cnvalias.icu
|
|
cp com/ibm/icu/impl/data/$ICUDT/brkitr/* /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/brkitr
|
|
jar uvf ~/svn.icu/trunk/src/icu4j/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/$ICUDT
|
|
|
|
* When refreshing all of ICU4J data from ICU4C
|
|
- ~/svn.icu/trunk/dbg/icu4c$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install
|
|
- cp /tmp/icu4j/main/shared/data/icudata.jar ~/svn.icu/trunk/src/icu4j/main/shared/data
|
|
or
|
|
- ~/svn.icu/trunk/dbg/icu4c$ make ICU4J_ROOT=~/svn.icu/trunk/src/icu4j icu4j-data-install
|
|
|
|
* refresh Java test .txt files
|
|
- copy new .txt files into ICU4J's main/tests/core/src/com/ibm/icu/dev/data/unicode
|
|
cd $ICU4C_SRC_DIR/source/data/unidata
|
|
cp confusables.txt confusablesWholeScript.txt NormalizationCorrections.txt NormalizationTest.txt SpecialCasing.txt UnicodeData.txt ~/svn.icu/trunk/src/icu4j/main/tests/core/src/com/ibm/icu/dev/data/unicode
|
|
cd ../../test/testdata
|
|
cp BidiCharacterTest.txt BidiTest.txt ~/svn.icu/trunk/src/icu4j/main/tests/core/src/com/ibm/icu/dev/data/unicode
|
|
cp ~/unidata/uni90e50/20170322/ucd/CompositionExclusions.txt ~/svn.icu/trunk/src/icu4j/main/tests/core/src/com/ibm/icu/dev/data/unicode
|
|
|
|
* run & fix ICU4J tests
|
|
|
|
---------------------------------------------------------------------------- ***
|
|
|
|
Unicode 9.0 update for ICU 58
|
|
|
|
* Command-line environment setup
|
|
|
|
ICU_ROOT=~/svn.icu/trunk
|
|
ICU_SRC_DIR=$ICU_ROOT/src
|
|
ICUDT=icudt58b
|
|
export LD_LIBRARY_PATH=$ICU_ROOT/dbg/lib
|
|
SRC_DATA_IN=$ICU_SRC_DIR/source/data/in
|
|
UNIDATA=$ICU_SRC_DIR/source/data/unidata
|
|
|
|
http://www.unicode.org/review/pri323/ -- beta review
|
|
http://www.unicode.org/reports/uax-proposed-updates.html
|
|
http://www.unicode.org/versions/beta-9.0.0.html
|
|
http://www.unicode.org/versions/Unicode9.0.0/
|
|
http://www.unicode.org/reports/tr44/tr44-17.html
|
|
|
|
*** ICU Trac
|
|
|
|
- ticket:12526: integrate Unicode 9
|
|
- C++ ^/icu/branches/markus/uni90, ^/icu/branches/markus/uni90b
|
|
- Java ^/icu4j/branches/markus/uni90, ^/icu4j/branches/markus/uni90b
|
|
|
|
*** CLDR Trac
|
|
|
|
- cldrbug 9414: UCA 9
|
|
- ^/branches/markus/uni90 at r11518 from trunk at r11517
|
|
|
|
- cldrbug 8745: Unicode 9.0 script metadata
|
|
|
|
*** Unicode version numbers
|
|
- makedata.mak
|
|
- uchar.h
|
|
- com.ibm.icu.util.VersionInfo
|
|
- com.ibm.icu.dev.test.lang.UCharacterTest.VERSION_
|
|
|
|
- Run ICU4C "configure" _after_ updating the Unicode version number in uchar.h
|
|
so that the makefiles see the new version number.
|
|
|
|
*** data files & enums & parser code
|
|
|
|
* file preparation
|
|
|
|
- download UCD & IDNA files
|
|
- make sure that the Unicode data folder passed into preparseucd.py
|
|
includes a copy of the latest IdnaMappingTable.txt (can be in some subfolder)
|
|
- only for manual diffs: remove version suffixes from the file names
|
|
~/unidata/uni70/20140403$ ../../desuffixucd.py .
|
|
(see https://sites.google.com/site/unicodetools/inputdata)
|
|
- only for manual diffs: extract Unihan.zip to "here" (.../ucd/Unihan/*.txt), delete Unihan.zip
|
|
- ~/svn.icutools/trunk/src/unicode$ py/preparseucd.py ~/unidata/uni90/20160603 $ICU_SRC_DIR ~/svn.icutools/trunk/src
|
|
- This writes files (especially ppucd.txt) to the ICU4C unidata and testdata subfolders.
|
|
|
|
- also: from http://unicode.org/Public/security/9.0.0/ download new confusables.txt
|
|
and copy to $UNIDATA
|
|
cp ~/unidata/uni90/20160603/security/confusables.txt $UNIDATA
|
|
|
|
* preparseucd.py changes
|
|
- remove or add new Unicode scripts from/to the
|
|
only-in-ISO-15924 list according to the error messages:
|
|
ValueError: remove ['Tang'] from _scripts_only_in_iso15924
|
|
ValueError: sc = Hanb (uchar.h USCRIPT_HAN_WITH_BOPOMOFO) not in the UCD
|
|
ValueError: sc = Jamo (uchar.h USCRIPT_JAMO) not in the UCD
|
|
ValueError: sc = Zsye (uchar.h USCRIPT_SYMBOLS_EMOJI) not in the UCD
|
|
-> fix expectedLong names in cucdapi.c/TestUScriptCodeAPI()
|
|
and in com.ibm.icu.dev.test.lang.TestUScript.java
|
|
- DerivedNumericValues.txt new numeric values
|
|
0D58 ; 0.00625 ; ; 1/160 # No MALAYALAM FRACTION ONE ONE-HUNDRED-AND-SIXTIETH
|
|
0D59 ; 0.025 ; ; 1/40 # No MALAYALAM FRACTION ONE FORTIETH
|
|
0D5A ; 0.0375 ; ; 3/80 # No MALAYALAM FRACTION THREE EIGHTIETHS
|
|
0D5B ; 0.05 ; ; 1/20 # No MALAYALAM FRACTION ONE TWENTIETH
|
|
0D5D ; 0.15 ; ; 3/20 # No MALAYALAM FRACTION THREE TWENTIETHS
|
|
-> change uprops.h, corepropsbuilder.cpp/encodeNumericValue(),
|
|
uchar.c, UCharacterProperty.java
|
|
to support a new series of values
|
|
- adjust preparseucd.py for Tangut algorithmic names
|
|
in ppucd.txt:
|
|
algnamesrange;17000..187EC;han;CJK UNIFIED IDEOGRAPH-
|
|
->
|
|
algnamesrange;17000..187EC;han;TANGUT IDEOGRAPH-
|
|
- avoid block-compressing most String/Miscellaneous property values,
|
|
triggered by genprops not coping with a multi-code point Case_Folding on
|
|
block;1C80..1C8F;...;Cased;cf=0442;CWCF;...
|
|
keep block-compressing empty-string mappings NFKC_CF="" for tags and variation selectors
|
|
|
|
* PropertyAliases.txt changes
|
|
- 1 new property PCM=Prepended_Concatenation_Mark
|
|
Ignore: Only useful for layout engines.
|
|
Ok to list in ppucd.txt.
|
|
|
|
* PropertyValueAliases.txt new property values
|
|
blk; Adlam ; Adlam
|
|
blk; Bhaiksuki ; Bhaiksuki
|
|
blk; Cyrillic_Ext_C ; Cyrillic_Extended_C
|
|
blk; Glagolitic_Sup ; Glagolitic_Supplement
|
|
blk; Ideographic_Symbols ; Ideographic_Symbols_And_Punctuation
|
|
blk; Marchen ; Marchen
|
|
blk; Mongolian_Sup ; Mongolian_Supplement
|
|
blk; Newa ; Newa
|
|
blk; Osage ; Osage
|
|
blk; Tangut ; Tangut
|
|
blk; Tangut_Components ; Tangut_Components
|
|
-> add to uchar.h
|
|
use long property names for enum constants
|
|
-> add to UCharacter.UnicodeBlock IDs
|
|
Eclipse find UBLOCK_([^ ]+) = ([0-9]+), (/.+)
|
|
replace public static final int \1_ID = \2; \3
|
|
-> add to UCharacter.UnicodeBlock objects
|
|
Eclipse find UBLOCK_([^ ]+) = [0-9]+, (/.+)
|
|
replace public static final UnicodeBlock \1 = new UnicodeBlock("\1", \1_ID); \2
|
|
|
|
GCB; EB ; E_Base
|
|
GCB; EBG ; E_Base_GAZ
|
|
GCB; EM ; E_Modifier
|
|
GCB; GAZ ; Glue_After_Zwj
|
|
GCB; ZWJ ; ZWJ
|
|
-> uchar.h & UCharacter.GraphemeClusterBreak
|
|
|
|
jg ; African_Feh ; African_Feh
|
|
jg ; African_Noon ; African_Noon
|
|
jg ; African_Qaf ; African_Qaf
|
|
-> uchar.h & UCharacter.JoiningGroup
|
|
|
|
lb ; EB ; E_Base
|
|
lb ; EM ; E_Modifier
|
|
lb ; ZWJ ; ZWJ
|
|
-> uchar.h & UCharacter.LineBreak
|
|
|
|
sc ; Adlm ; Adlam
|
|
sc ; Bhks ; Bhaiksuki
|
|
sc ; Marc ; Marchen
|
|
sc ; Newa ; Newa
|
|
sc ; Osge ; Osage
|
|
sc ; Tang ; Tangut
|
|
-> all of them had been added already to uscript.h & com.ibm.icu.lang.UScript
|
|
|
|
WB ; EB ; E_Base
|
|
WB ; EBG ; E_Base_GAZ
|
|
WB ; EM ; E_Modifier
|
|
WB ; GAZ ; Glue_After_Zwj
|
|
WB ; ZWJ ; ZWJ
|
|
-> uchar.h & UCharacter.WordBreak
|
|
|
|
* update Script metadata: SCRIPT_PROPS[] in uscript_props.cpp & UScript.ScriptMetadata
|
|
(not strictly necessary for NOT_ENCODED scripts)
|
|
~/svn.icutools/trunk/src/unicode$ py/parsescriptmetadata.py $ICU_SRC_DIR/source/common/unicode/uscript.h ~/svn.cldr/trunk/common/properties/scriptMetadata.txt
|
|
|
|
* generate normalization data files
|
|
cd $ICU_ROOT/dbg
|
|
bin/gennorm2 -o $ICU_SRC_DIR/source/common/norm2_nfc_data.h -s $UNIDATA/norm2 nfc.txt --csource
|
|
bin/gennorm2 -o $SRC_DATA_IN/nfc.nrm -s $UNIDATA/norm2 nfc.txt
|
|
bin/gennorm2 -o $SRC_DATA_IN/nfkc.nrm -s $UNIDATA/norm2 nfc.txt nfkc.txt
|
|
bin/gennorm2 -o $SRC_DATA_IN/nfkc_cf.nrm -s $UNIDATA/norm2 nfc.txt nfkc.txt nfkc_cf.txt
|
|
bin/gennorm2 -o $SRC_DATA_IN/uts46.nrm -s $UNIDATA/norm2 nfc.txt uts46.txt
|
|
|
|
* build ICU (make install)
|
|
so that the tools build can pick up the new definitions from the installed header files.
|
|
|
|
$ICU_ROOT/dbg$ echo;echo;make -j5 install > out.txt 2>&1 ; tail -n 30 out.txt
|
|
|
|
* build Unicode tools using CMake+make
|
|
|
|
~/svn.icutools/trunk/src/unicode/c/icudefs.txt:
|
|
|
|
# Location (--prefix) of where ICU was installed.
|
|
set(ICU_INST_DIR /home/mscherer/svn.icu/trunk/inst)
|
|
# Location of the ICU source tree.
|
|
set(ICU_SRC_DIR /home/mscherer/svn.icu/trunk/src)
|
|
|
|
~/svn.icutools/trunk/dbg/unicode/c$
|
|
cmake ../../../src/unicode/c
|
|
make
|
|
|
|
* generate core properties data files
|
|
~/svn.icutools/trunk/dbg/unicode/c$
|
|
genprops/genprops $ICU_SRC_DIR
|
|
genuca/genuca --hanOrder implicit $ICU_SRC_DIR
|
|
genuca/genuca --hanOrder radical-stroke $ICU_SRC_DIR
|
|
- rebuild ICU (make install) & tools
|
|
|
|
* update uts46test.cpp and UTS46Test.java if there are new characters that are equivalent to
|
|
sequences with non-LDH ASCII (that is, their decompositions contain '=' or similar)
|
|
- grep IdnaMappingTable.txt or uts46.txt for "disallowed_STD3_valid" on non-ASCII characters
|
|
- Unicode 6.0..9.0: U+2260, U+226E, U+226F
|
|
- nothing new in 9.0, no test file to update
|
|
|
|
* run & fix ICU4C tests
|
|
- Andy handles RBBI & spoof check test failures
|
|
|
|
* collation: CLDR collation root, UCA DUCET
|
|
|
|
- UCA DUCET goes into Mark's Unicode tools, see
|
|
https://sites.google.com/site/unicodetools/home#TOC-UCA
|
|
- CLDR root data files are checked into (CLDR UCA branch)/common/uca/
|
|
cp (UCA generated)/CollationAuxiliary/* ~/svn.cldr/trunk/common/uca/
|
|
|
|
- cd (CLDR UCA branch)/common/uca/
|
|
- update source/data/unidata/FractionalUCA.txt with FractionalUCA_SHORT.txt
|
|
cp FractionalUCA_SHORT.txt $ICU_SRC_DIR/source/data/unidata/FractionalUCA.txt
|
|
- update source/data/unidata/UCARules.txt with UCA_Rules_SHORT.txt
|
|
cp $ICU_SRC_DIR/source/data/unidata/UCARules.txt /tmp/UCARules-old.txt
|
|
(note removing the underscore before "Rules")
|
|
cp UCA_Rules_SHORT.txt $ICU_SRC_DIR/source/data/unidata/UCARules.txt
|
|
- restore TODO diffs in UCARules.txt
|
|
meld /tmp/UCARules-old.txt $ICU_SRC_DIR/source/data/unidata/UCARules.txt
|
|
- update (ICU4C)/source/test/testdata/CollationTest_*.txt
|
|
and (ICU4J)/main/tests/collate/src/com/ibm/icu/dev/data/CollationTest_*.txt
|
|
from the CLDR root files (..._CLDR_..._SHORT.txt)
|
|
cp CollationTest_CLDR_NON_IGNORABLE_SHORT.txt $ICU_SRC_DIR/source/test/testdata/CollationTest_NON_IGNORABLE_SHORT.txt
|
|
cp CollationTest_CLDR_SHIFTED_SHORT.txt $ICU_SRC_DIR/source/test/testdata/CollationTest_SHIFTED_SHORT.txt
|
|
cp $ICU_SRC_DIR/source/test/testdata/CollationTest_*.txt ~/svn.icu4j/trunk/src/main/tests/collate/src/com/ibm/icu/dev/data
|
|
- if CLDR common/uca/unihan-index.txt changes, then update
|
|
CLDR common/collation/root.xml <collation type="private-unihan">
|
|
and regenerate (or update in parallel) $ICU_SRC_DIR/source/data/coll/root.txt
|
|
|
|
- run genuca, see command line above;
|
|
deal with
|
|
Error: Unknown script for first-primary sample character U+104B5 on line 32599 of /home/mscherer/svn.icu/trunk/src/source/data/unidata/FractionalUCA.txt:
|
|
FDD1 104B5; [75 B8 02, 05, 05] # Osage first primary (compressible)
|
|
(add the character to genuca.cpp sampleCharsToScripts[])
|
|
+ look up the USCRIPT_ code for the new sample characters
|
|
(should be obvious from the comment in the error output)
|
|
+ *add* mappings to sampleCharsToScripts[], do not replace them
|
|
(in case the script sample characters flip-flop)
|
|
+ insert new scripts in DUCET script order, see the top_byte table
|
|
at the beginning of FractionalUCA.txt
|
|
- rebuild ICU4C
|
|
|
|
* Unihan collators
|
|
- run Unicode Tools
|
|
org.unicode.draft.GenerateUnihanCollators
|
|
with VM arguments
|
|
-DSVN_WORKSPACE=/home/mscherer/svn.unitools/trunk
|
|
-DOTHER_WORKSPACE=/home/mscherer/svn.unitools
|
|
-DUCD_DIR=/home/mscherer/svn.unitools/trunk/data
|
|
-DCLDR_DIR=/home/mscherer/svn.cldr/trunk
|
|
-DUVERSION=9.0.0
|
|
-ea
|
|
- run Unicode Tools
|
|
org.unicode.draft.GenerateUnihanCollatorFiles
|
|
with the same arguments
|
|
- check CLDR diffs
|
|
cd ~/svn.cldr/trunk
|
|
meld common/collation/zh.xml ../Generated/cldr/han/replace/zh.xml
|
|
meld common/transforms/Han-Latin.xml ../Generated/cldr/han/replace/Han-Latin.xml
|
|
- copy to CLDR
|
|
cd ~/svn.cldr/trunk
|
|
cp ../Generated/cldr/han/replace/zh.xml common/collation/zh.xml
|
|
cp ../Generated/cldr/han/replace/Han-Latin.xml common/transforms/Han-Latin.xml
|
|
- commit to CLDR
|
|
- generate ICU zh collation data: run CLDR
|
|
org.unicode.cldr.icu.NewLdml2IcuConverter
|
|
with program arguments
|
|
-t collation
|
|
-s /home/mscherer/svn.cldr/trunk/common/collation
|
|
-m /home/mscherer/svn.cldr/trunk/common/supplemental
|
|
-d /home/mscherer/svn.icu/trunk/src/source/data/coll
|
|
-p /home/mscherer/svn.icu/trunk/src/source/data/xml/collation
|
|
zh
|
|
and VM arguments
|
|
-DCLDR_DIR=/home/mscherer/svn.cldr/trunk
|
|
- rebuild ICU4C
|
|
|
|
* run & fix ICU4C tests, now with new CLDR collation root data
|
|
- run all tests with the collation test data *_SHORT.txt or the full files
|
|
(the full ones have comments, useful for debugging)
|
|
- note on intltest: if collate/UCAConformanceTest fails, then
|
|
utility/MultithreadTest/TestCollators will fail as well;
|
|
fix the conformance test before looking into the multi-thread test
|
|
|
|
* update Java data files
|
|
- refresh just the UCD/UCA-related/derived files, just to be safe
|
|
- see (ICU4C)/source/data/icu4j-readme.txt
|
|
- mkdir /tmp/icu4j
|
|
- ~/svn.icu/trunk/dbg$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install
|
|
output:
|
|
...
|
|
Unicode .icu files built to ./out/build/icudt58l
|
|
echo timestamp > uni-core-data
|
|
mkdir -p ./out/icu4j/com/ibm/icu/impl/data/icudt58b
|
|
mkdir -p ./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt58b
|
|
echo pnames.icu uprops.icu ucase.icu ubidi.icu nfc.nrm > ./out/icu4j/add.txt
|
|
LD_LIBRARY_PATH=../lib:../stubdata:../tools/ctestfw:$LD_LIBRARY_PATH ../bin/icupkg ./out/tmp/icudt58l.dat ./out/icu4j/icudt58b.dat -a ./out/icu4j/add.txt -s ./out/build/icudt58l -x '*' -tb -d ./out/icu4j/com/ibm/icu/impl/data/icudt58b
|
|
mv ./out/icu4j/"com/ibm/icu/impl/data/icudt58b/zoneinfo64.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt58b/metaZones.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt58b/timezoneTypes.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt58b/windowsZones.res" "./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt58b"
|
|
jar cf ./out/icu4j/icudata.jar -C ./out/icu4j com/ibm/icu/impl/data/icudt58b/
|
|
mkdir -p /tmp/icu4j/main/shared/data
|
|
cp ./out/icu4j/icudata.jar /tmp/icu4j/main/shared/data
|
|
jar cf ./out/icu4j/icutzdata.jar -C ./out/icu4j/tzdata com/ibm/icu/impl/data/icudt58b/
|
|
mkdir -p /tmp/icu4j/main/shared/data
|
|
cp ./out/icu4j/icutzdata.jar /tmp/icu4j/main/shared/data
|
|
make[1]: Leaving directory `/home/mscherer/svn.icu/trunk/dbg/data'
|
|
- copy the big-endian Unicode data files to another location,
|
|
separate from the other data files,
|
|
and then refresh ICU4J
|
|
cd ~/svn.icu/trunk/dbg/data/out/icu4j
|
|
mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/coll
|
|
mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/brkitr
|
|
cp com/ibm/icu/impl/data/$ICUDT/confusables.cfu /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT
|
|
cp com/ibm/icu/impl/data/$ICUDT/*.icu /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT
|
|
rm /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/cnvalias.icu
|
|
cp com/ibm/icu/impl/data/$ICUDT/*.nrm /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT
|
|
cp com/ibm/icu/impl/data/$ICUDT/coll/* /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/coll
|
|
cp com/ibm/icu/impl/data/$ICUDT/brkitr/* /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/brkitr
|
|
jar uvf ~/svn.icu4j/trunk/src/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/$ICUDT
|
|
|
|
* When refreshing all of ICU4J data from ICU4C
|
|
- ~/svn.icu/trunk/dbg$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install
|
|
- cp /tmp/icu4j/main/shared/data/icudata.jar ~/svn.icu4j/trunk/src/main/shared/data
|
|
or
|
|
- ~/svn.icu/trunk/dbg$ make ICU4J_ROOT=~/svn.icu4j/trunk/src icu4j-data-install
|
|
|
|
* update CollationFCD.java
|
|
+ copy & paste the initializers of lcccIndex[] etc. from
|
|
ICU4C/source/i18n/collationfcd.cpp to
|
|
ICU4J/main/classes/collate/src/com/ibm/icu/impl/coll/CollationFCD.java
|
|
|
|
* refresh Java test .txt files
|
|
- copy new .txt files into ICU4J's main/tests/core/src/com/ibm/icu/dev/data/unicode
|
|
cd $ICU_SRC_DIR/source/data/unidata
|
|
cp confusables.txt confusablesWholeScript.txt NormalizationCorrections.txt NormalizationTest.txt SpecialCasing.txt UnicodeData.txt ~/svn.icu4j/trunk/src/main/tests/core/src/com/ibm/icu/dev/data/unicode
|
|
cd ../../test/testdata
|
|
cp BidiCharacterTest.txt BidiTest.txt ~/svn.icu4j/trunk/src/main/tests/core/src/com/ibm/icu/dev/data/unicode
|
|
cp ~/unidata/uni90/20160603/ucd/CompositionExclusions.txt ~/svn.icu4j/trunk/src/main/tests/core/src/com/ibm/icu/dev/data/unicode
|
|
|
|
* run & fix ICU4J tests
|
|
|
|
*** LayoutEngine script information
|
|
|
|
* Run icu4j-tools: com.ibm.icu.dev.tool.layout.ScriptNameBuilder.
|
|
This generates LEScripts.h, LELanguages.h, ScriptAndLanguageTags.h and ScriptAndLanguageTags.cpp
|
|
in the working directory.
|
|
|
|
(It also generates ScriptRunData.cpp, which is no longer needed.)
|
|
|
|
It also reads and regenerates tools/misc/src/com/ibm/icu/dev/tool/layout/ScriptAndLanguages
|
|
(a plain text file)
|
|
which maps ICU versions to the numbers of script/language constants
|
|
that were added then.
|
|
(This mapping is probably obsolete since we do not print "@stable ICU xy" any more.)
|
|
|
|
The generated files have a current copyright date and "@deprecated" statement.
|
|
|
|
* Review changes, fix Java tool if necessary, and copy to ICU4C
|
|
cd ~/svn.icu4j/trunk/src
|
|
meld $ICU_SRC_DIR/source/layout tools/misc/src/com/ibm/icu/dev/tool/layout
|
|
cp tools/misc/src/com/ibm/icu/dev/tool/layout/*.h $ICU_SRC_DIR/source/layout
|
|
cp tools/misc/src/com/ibm/icu/dev/tool/layout/ScriptAndLanguageTags.cpp $ICU_SRC_DIR/source/layout
|
|
|
|
*** API additions
|
|
- send notice to icu-design about new born-@stable API (enum constants etc.)
|
|
|
|
*** merge the Unicode update branches back onto the trunk
|
|
- do not merge the icudata.jar and testdata.jar,
|
|
instead rebuild them from merged & tested ICU4C
|
|
- make sure that changes to Unicode tools & ICU tools are checked in
|
|
http://www.unicode.org/utility/trac/log/trunk/unicodetools
|
|
http://bugs.icu-project.org/trac/log/tools/trunk
|
|
|
|
---------------------------------------------------------------------------- ***
|
|
|
|
New script codes early in ICU 58: http://bugs.icu-project.org/trac/ticket/11764
|
|
|
|
Adding
|
|
- new scripts in Unicode 9: Adlm, Bhks, Marc, Newa, Osge
|
|
- new combination/alias codes: Hanb, Jamo
|
|
- used in CLDR 29 and in spoof checker
|
|
- new Z* code: Zsye
|
|
|
|
Add new codes to uscript.h & UScript.java, see Unicode update logs.
|
|
-> com.ibm.icu.lang.UScript
|
|
find USCRIPT_([^ ]+) *= ([0-9]+),(.+)
|
|
replace public static final int \1 = \2; \3
|
|
|
|
Manually edit ppucd.txt and icutools:unicode/c/genprops/pnames_data.h,
|
|
add new script codes.
|
|
"Long" script names only where established in Unicode 9 PropertyValueAliases.txt.
|
|
|
|
Note: If we have to run preparseucd.py again before the Unicode 9 update,
|
|
then we need to manually keep/restore the new script codes.
|
|
|
|
ICU_ROOT=~/svn.icu/trunk
|
|
ICU_SRC_DIR=$ICU_ROOT/src
|
|
ICUDT=icudt57b
|
|
export LD_LIBRARY_PATH=$ICU_ROOT/dbg/lib
|
|
SRC_DATA_IN=$ICU_SRC_DIR/source/data/in
|
|
UNIDATA=$ICU_SRC_DIR/source/data/unidata
|
|
|
|
Adjust unicode/c/genprops/*builder.cpp for #ifndef/#ifdef changes in _data.h files,
|
|
see http://bugs.icu-project.org/trac/ticket/12141
|
|
|
|
make install, then icutools cmake & make, then
|
|
~/svn.icutools/trunk/dbg/unicode/c$ make && genprops/genprops $ICU_SRC_DIR
|
|
|
|
Generate Java data as usual, only update pnames.icu & uprops.icu.
|
|
|
|
*** LayoutEngine script information
|
|
|
|
* Run icu4j-tools: com.ibm.icu.dev.tool.layout.ScriptNameBuilder.
|
|
This generates LEScripts.h, LELanguages.h, ScriptAndLanguageTags.h and ScriptAndLanguageTags.cpp
|
|
in the working directory.
|
|
|
|
(It also generates ScriptRunData.cpp, which is no longer needed.)
|
|
|
|
It also reads and regenerates tools/misc/src/com/ibm/icu/dev/tool/layout/ScriptAndLanguages
|
|
(a plain text file)
|
|
which maps ICU versions to the numbers of script/language constants
|
|
that were added then.
|
|
(This mapping is probably obsolete since we do not print "@stable ICU xy" any more.)
|
|
|
|
The generated files have a current copyright date and "@deprecated" statement.
|
|
|
|
* Review changes, fix Java tool if necessary, and copy to ICU4C
|
|
cd ~/svn.icu4j/trunk/src
|
|
meld $ICU_SRC_DIR/source/layout tools/misc/src/com/ibm/icu/dev/tool/layout
|
|
cp tools/misc/src/com/ibm/icu/dev/tool/layout/*.h $ICU_SRC_DIR/source/layout
|
|
cp tools/misc/src/com/ibm/icu/dev/tool/layout/ScriptAndLanguageTags.cpp $ICU_SRC_DIR/source/layout
|
|
|
|
---------------------------------------------------------------------------- ***
|
|
|
|
Emoji properties added in ICU 57: http://bugs.icu-project.org/trac/ticket/11802
|
|
|
|
Edit preparseucd.py to add & parse new properties.
|
|
They share the UCD property namespace but are not listed in PropertyAliases.txt.
|
|
|
|
Add emoji-data.txt to the input files, from http://www.unicode.org/Public/emoji/
|
|
Initial data from emoji/2.0/
|
|
|
|
ICU_ROOT=~/svn.icu/trunk
|
|
ICU_SRC_DIR=$ICU_ROOT/src
|
|
ICUDT=icudt56b
|
|
export LD_LIBRARY_PATH=$ICU_ROOT/dbg/lib
|
|
SRC_DATA_IN=$ICU_SRC_DIR/source/data/in
|
|
UNIDATA=$ICU_SRC_DIR/source/data/unidata
|
|
|
|
Add binary-property constants to uchar.h enum UProperty & UProperty.java.
|
|
|
|
~/svn.icutools/trunk/src/unicode$ py/preparseucd.py ~/unidata/uni80/20151217 $ICU_SRC_DIR ~/svn.icutools/trunk/src
|
|
(Needs to be run after uchar.h additions, so that the new properties can be picked up by genprops.)
|
|
|
|
Data structure: uprops.h/.cpp, corepropsbuilder.cpp, UCharacterProperty.java
|
|
|
|
make install, then icutools cmake & make, then
|
|
~/svn.icutools/trunk/dbg/unicode/c$ make && genprops/genprops $ICU_SRC_DIR
|
|
|
|
Generate Java data as usual, only update pnames.icu & uprops.icu.
|
|
|
|
---------------------------------------------------------------------------- ***
|
|
|
|
Unicode 8.0 update for ICU 56
|
|
|
|
* Command-line environment setup
|
|
|
|
ICU_ROOT=~/svn.icu/trunk
|
|
ICU_SRC_DIR=$ICU_ROOT/src
|
|
ICUDT=icudt56b
|
|
export LD_LIBRARY_PATH=$ICU_ROOT/dbg/lib
|
|
SRC_DATA_IN=$ICU_SRC_DIR/source/data/in
|
|
UNIDATA=$ICU_SRC_DIR/source/data/unidata
|
|
|
|
http://www.unicode.org/review/pri297/ -- beta review
|
|
http://www.unicode.org/reports/uax-proposed-updates.html
|
|
http://unicode.org/versions/beta-8.0.0.html
|
|
http://www.unicode.org/versions/Unicode8.0.0/
|
|
http://www.unicode.org/reports/tr44/tr44-15.html
|
|
|
|
*** ICU Trac
|
|
|
|
- ticket:11574: Unicode 8
|
|
- C++ branches/markus/uni80 at r37351 from trunk at r37343
|
|
- Java branches/markus/uni80 at r37352 from trunk at r37338
|
|
|
|
*** CLDR Trac
|
|
|
|
- cldrbug 8311: UCA 8
|
|
- branches/markus/uni80 at r11518 from trunk at r11517
|
|
|
|
- cldrbug 8109: Unicode 8.0 script metadata
|
|
- cldrbug 8418: Updated segmentation for Unicode 8.0
|
|
|
|
*** Unicode version numbers
|
|
- makedata.mak
|
|
- uchar.h
|
|
- com.ibm.icu.util.VersionInfo
|
|
- com.ibm.icu.dev.test.lang.UCharacterTest.VERSION_
|
|
|
|
- Run ICU4C "configure" _after_ updating the Unicode version number in uchar.h
|
|
so that the makefiles see the new version number.
|
|
|
|
*** data files & enums & parser code
|
|
|
|
* file preparation
|
|
|
|
- download UCD & IDNA files
|
|
- make sure that the Unicode data folder passed into preparseucd.py
|
|
includes a copy of the latest IdnaMappingTable.txt (can be in some subfolder)
|
|
- only for manual diffs: remove version suffixes from the file names
|
|
~/unidata/uni70/20140403$ ../../desuffixucd.py .
|
|
(see https://sites.google.com/site/unicodetools/inputdata)
|
|
- only for manual diffs: extract Unihan.zip to "here" (.../ucd/Unihan/*.txt), delete Unihan.zip
|
|
- ~/svn.icutools/trunk/src/unicode$ py/preparseucd.py ~/unidata/uni80/20150415 $ICU_SRC_DIR ~/svn.icutools/trunk/src
|
|
- This writes files (especially ppucd.txt) to the ICU4C unidata and testdata subfolders.
|
|
|
|
- also: from http://unicode.org/Public/security/8.0.0/ download new
|
|
confusables.txt & confusablesWholeScript.txt
|
|
and copy to $UNIDATA
|
|
~/unidata$ cp uni80/20150415/security/confusables.txt $UNIDATA
|
|
~/unidata$ cp uni80/20150415/security/confusablesWholeScript.txt $UNIDATA
|
|
|
|
* initial preparseucd.py changes
|
|
- remove new Unicode scripts from the
|
|
only-in-ISO-15924 list according to the error message:
|
|
ValueError: remove ['Ahom', 'Hatr', 'Hluw', 'Hung', 'Mult', 'Sgnw']
|
|
from _scripts_only_in_iso15924
|
|
-> fix expectedLong names in cucdapi.c/TestUScriptCodeAPI()
|
|
and in com.ibm.icu.dev.test.lang.TestUScript.java
|
|
- property and file name change:
|
|
IndicMatraCategory -> IndicPositionalCategory
|
|
- UnicodeData.txt unusual numeric values (improper fractions)
|
|
109F6;MEROITIC CURSIVE FRACTION ONE TWELFTH;No;0;R;;;;1/12;N;;;;;
|
|
109F7;MEROITIC CURSIVE FRACTION TWO TWELFTHS;No;0;R;;;;2/12;N;;;;;
|
|
109F8;MEROITIC CURSIVE FRACTION THREE TWELFTHS;No;0;R;;;;3/12;N;;;;;
|
|
109F9;MEROITIC CURSIVE FRACTION FOUR TWELFTHS;No;0;R;;;;4/12;N;;;;;
|
|
109FA;MEROITIC CURSIVE FRACTION FIVE TWELFTHS;No;0;R;;;;5/12;N;;;;;
|
|
109FB;MEROITIC CURSIVE FRACTION SIX TWELFTHS;No;0;R;;;;6/12;N;;;;;
|
|
109FC;MEROITIC CURSIVE FRACTION SEVEN TWELFTHS;No;0;R;;;;7/12;N;;;;;
|
|
109FD;MEROITIC CURSIVE FRACTION EIGHT TWELFTHS;No;0;R;;;;8/12;N;;;;;
|
|
109FE;MEROITIC CURSIVE FRACTION NINE TWELFTHS;No;0;R;;;;9/12;N;;;;;
|
|
109FF;MEROITIC CURSIVE FRACTION TEN TWELFTHS;No;0;R;;;;10/12;N;;;;;
|
|
-> change preparseucd.py to map them to proper fractions (e.g., 1/6)
|
|
which are listed in DerivedNumericValues.txt;
|
|
keeps storage in data file simple
|
|
|
|
* PropertyValueAliases.txt changes
|
|
- 10 new Block (blk) values:
|
|
blk; Ahom ; Ahom
|
|
blk; Anatolian_Hieroglyphs ; Anatolian_Hieroglyphs
|
|
blk; Cherokee_Sup ; Cherokee_Supplement
|
|
blk; CJK_Ext_E ; CJK_Unified_Ideographs_Extension_E
|
|
blk; Early_Dynastic_Cuneiform ; Early_Dynastic_Cuneiform
|
|
blk; Hatran ; Hatran
|
|
blk; Multani ; Multani
|
|
blk; Old_Hungarian ; Old_Hungarian
|
|
blk; Sup_Symbols_And_Pictographs ; Supplemental_Symbols_And_Pictographs
|
|
blk; Sutton_SignWriting ; Sutton_SignWriting
|
|
-> add to uchar.h
|
|
use long property names for enum constants
|
|
-> add to UCharacter.UnicodeBlock IDs
|
|
Eclipse find UBLOCK_([^ ]+) = ([0-9]+), (/.+)
|
|
replace public static final int \1_ID = \2; \3
|
|
-> add to UCharacter.UnicodeBlock objects
|
|
Eclipse find UBLOCK_([^ ]+) = [0-9]+, (/.+)
|
|
replace public static final UnicodeBlock \1 = new UnicodeBlock("\1", \1_ID); \2
|
|
- 6 new Script (sc) values:
|
|
sc ; Ahom ; Ahom
|
|
sc ; Hatr ; Hatran
|
|
sc ; Hluw ; Anatolian_Hieroglyphs
|
|
sc ; Hung ; Old_Hungarian
|
|
sc ; Mult ; Multani
|
|
sc ; Sgnw ; SignWriting
|
|
-> all of them had been added already to uscript.h & com.ibm.icu.lang.UScript
|
|
|
|
* update Script metadata: SCRIPT_PROPS[] in uscript_props.cpp & UScript.ScriptMetadata
|
|
(not strictly necessary for NOT_ENCODED scripts)
|
|
~/svn.icutools/trunk/src/unicode$ py/parsescriptmetadata.py $ICU_SRC_DIR/source/common/unicode/uscript.h ~/svn.cldr/trunk/common/properties/scriptMetadata.txt
|
|
|
|
* generate normalization data files
|
|
cd $ICU_ROOT/dbg
|
|
bin/gennorm2 -o $ICU_SRC_DIR/source/common/norm2_nfc_data.h -s $UNIDATA/norm2 nfc.txt --csource
|
|
bin/gennorm2 -o $SRC_DATA_IN/nfc.nrm -s $UNIDATA/norm2 nfc.txt
|
|
bin/gennorm2 -o $SRC_DATA_IN/nfkc.nrm -s $UNIDATA/norm2 nfc.txt nfkc.txt
|
|
bin/gennorm2 -o $SRC_DATA_IN/nfkc_cf.nrm -s $UNIDATA/norm2 nfc.txt nfkc.txt nfkc_cf.txt
|
|
bin/gennorm2 -o $SRC_DATA_IN/uts46.nrm -s $UNIDATA/norm2 nfc.txt uts46.txt
|
|
|
|
* build ICU (make install)
|
|
so that the tools build can pick up the new definitions from the installed header files.
|
|
|
|
$ICU_ROOT/dbg$ echo;echo;make -j5 install > out.txt 2>&1 ; tail -n 20 out.txt
|
|
|
|
* build Unicode tools using CMake+make
|
|
|
|
~/svn.icutools/trunk/src/unicode/c/icudefs.txt:
|
|
|
|
# Location (--prefix) of where ICU was installed.
|
|
set(ICU_INST_DIR /home/mscherer/svn.icu/trunk/inst)
|
|
# Location of the ICU source tree.
|
|
set(ICU_SRC_DIR /home/mscherer/svn.icu/trunk/src)
|
|
|
|
~/svn.icutools/trunk/dbg/unicode/c$ cmake ../../../src/unicode/c
|
|
~/svn.icutools/trunk/dbg/unicode/c$ make
|
|
|
|
* generate core properties data files
|
|
- ~/svn.icutools/trunk/dbg/unicode/c$ genprops/genprops $ICU_SRC_DIR
|
|
- ~/svn.icutools/trunk/dbg/unicode/c$ genuca/genuca --hanOrder implicit $ICU_SRC_DIR
|
|
- ~/svn.icutools/trunk/dbg/unicode/c$ genuca/genuca --hanOrder radical-stroke $ICU_SRC_DIR
|
|
- rebuild ICU (make install) & tools
|
|
- run genuca again (see step above) so that it picks up the new nfc.nrm
|
|
- rebuild ICU (make install) & tools
|
|
|
|
* update uts46test.cpp and UTS46Test.java if there are new characters that are equivalent to
|
|
sequences with non-LDH ASCII (that is, their decompositions contain '=' or similar)
|
|
- grep IdnaMappingTable.txt or uts46.txt for "disallowed_STD3_valid" on non-ASCII characters
|
|
- Unicode 6.0..8.0: U+2260, U+226E, U+226F
|
|
- nothing new in 8.0, no test file to update
|
|
|
|
* run & fix ICU4C tests
|
|
- bad Cherokee case folding due to difference in fallbacks:
|
|
UCD case folding falls back to no mapping,
|
|
ICU runtime case folding falls back to lowercasing;
|
|
fixed casepropsbuilder.cpp to generate scf mappings to self
|
|
when there is an slc mapping but no scf
|
|
- Andy handles RBBI & spoof check test failures
|
|
|
|
* collation: CLDR collation root, UCA DUCET
|
|
|
|
- UCA DUCET goes into Mark's Unicode tools, see
|
|
https://sites.google.com/site/unicodetools/home#TOC-UCA
|
|
- CLDR root data files are checked into (CLDR UCA branch)/common/uca/
|
|
- cd (CLDR UCA branch)/common/uca/
|
|
- update source/data/unidata/FractionalUCA.txt with FractionalUCA_SHORT.txt
|
|
cp FractionalUCA_SHORT.txt $ICU_SRC_DIR/source/data/unidata/FractionalUCA.txt
|
|
- update source/data/unidata/UCARules.txt with UCA_Rules_SHORT.txt
|
|
cp $ICU_SRC_DIR/source/data/unidata/UCARules.txt /tmp/UCARules-old.txt
|
|
(note removing the underscore before "Rules")
|
|
cp UCA_Rules_SHORT.txt $ICU_SRC_DIR/source/data/unidata/UCARules.txt
|
|
- restore TODO diffs in UCARules.txt
|
|
meld /tmp/UCARules-old.txt $ICU_SRC_DIR/source/data/unidata/UCARules.txt
|
|
- update (ICU4C)/source/test/testdata/CollationTest_*.txt
|
|
and (ICU4J)/main/tests/collate/src/com/ibm/icu/dev/data/CollationTest_*.txt
|
|
from the CLDR root files (..._CLDR_..._SHORT.txt)
|
|
cp CollationTest_CLDR_NON_IGNORABLE_SHORT.txt $ICU_SRC_DIR/source/test/testdata/CollationTest_NON_IGNORABLE_SHORT.txt
|
|
cp CollationTest_CLDR_SHIFTED_SHORT.txt $ICU_SRC_DIR/source/test/testdata/CollationTest_SHIFTED_SHORT.txt
|
|
cp $ICU_SRC_DIR/source/test/testdata/CollationTest_*.txt ~/svn.icu4j/trunk/src/main/tests/collate/src/com/ibm/icu/dev/data
|
|
- if CLDR common/uca/unihan-index.txt changes, then update
|
|
CLDR common/collation/root.xml <collation type="private-unihan">
|
|
and regenerate (or update in parallel) $ICU_SRC_DIR/source/data/coll/root.txt
|
|
- run genuca, see command line above;
|
|
deal with
|
|
Error: Unknown script for first-primary sample character U+07d8 on line 23005 of /home/mscherer/svn.icu/trunk/src/source/data/unidata/FractionalUCA.txt
|
|
(add the character to genuca.cpp sampleCharsToScripts[])
|
|
+ look up the script for the new sample characters
|
|
(e.g., in FractionalUCA.txt)
|
|
+ *add* mappings to sampleCharsToScripts[], do not replace them
|
|
(in case the script sample characters flip-flop)
|
|
+ insert new scripts in DUCET script order, see the top_byte table
|
|
at the beginning of FractionalUCA.txt
|
|
- rebuild ICU4C
|
|
|
|
* run & fix ICU4C tests, now with new CLDR collation root data
|
|
- run all tests with the collation test data *_SHORT.txt or the full files
|
|
(the full ones have comments, useful for debugging)
|
|
- note on intltest: if collate/UCAConformanceTest fails, then
|
|
utility/MultithreadTest/TestCollators will fail as well;
|
|
fix the conformance test before looking into the multi-thread test
|
|
- fixed bug in CollationWeights::getWeightRanges()
|
|
exposed by new data and CollationTest::TestRootElements
|
|
|
|
* update Java data files
|
|
- refresh just the UCD/UCA-related/derived files, just to be safe
|
|
- see (ICU4C)/source/data/icu4j-readme.txt
|
|
- mkdir /tmp/icu4j
|
|
- ~/svn.icu/trunk/dbg$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install
|
|
output:
|
|
...
|
|
Unicode .icu files built to ./out/build/icudt56l
|
|
echo timestamp > uni-core-data
|
|
mkdir -p ./out/icu4j/com/ibm/icu/impl/data/icudt56b
|
|
mkdir -p ./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt56b
|
|
echo pnames.icu uprops.icu ucase.icu ubidi.icu nfc.nrm > ./out/icu4j/add.txt
|
|
LD_LIBRARY_PATH=../lib:../stubdata:../tools/ctestfw:$LD_LIBRARY_PATH ../bin/icupkg ./out/tmp/icudt56l.dat ./out/icu4j/icudt56b.dat -a ./out/icu4j/add.txt -s ./out/build/icudt56l -x '*' -tb -d ./out/icu4j/com/ibm/icu/impl/data/icudt56b
|
|
mv ./out/icu4j/"com/ibm/icu/impl/data/icudt56b/zoneinfo64.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt56b/metaZones.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt56b/timezoneTypes.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt56b/windowsZones.res" "./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt56b"
|
|
jar cf ./out/icu4j/icudata.jar -C ./out/icu4j com/ibm/icu/impl/data/icudt56b/
|
|
mkdir -p /tmp/icu4j/main/shared/data
|
|
cp ./out/icu4j/icudata.jar /tmp/icu4j/main/shared/data
|
|
jar cf ./out/icu4j/icutzdata.jar -C ./out/icu4j/tzdata com/ibm/icu/impl/data/icudt56b/
|
|
mkdir -p /tmp/icu4j/main/shared/data
|
|
cp ./out/icu4j/icutzdata.jar /tmp/icu4j/main/shared/data
|
|
make[1]: Leaving directory `/home/mscherer/svn.icu/trunk/dbg/data'
|
|
- copy the big-endian Unicode data files to another location,
|
|
separate from the other data files,
|
|
and then refresh ICU4J
|
|
cd ~/svn.icu/trunk/dbg/data/out/icu4j
|
|
mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/coll
|
|
mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/brkitr
|
|
cp com/ibm/icu/impl/data/$ICUDT/confusables.cfu /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT
|
|
cp com/ibm/icu/impl/data/$ICUDT/*.icu /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT
|
|
rm /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/cnvalias.icu
|
|
cp com/ibm/icu/impl/data/$ICUDT/*.nrm /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT
|
|
cp com/ibm/icu/impl/data/$ICUDT/coll/* /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/coll
|
|
cp com/ibm/icu/impl/data/$ICUDT/brkitr/* /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/brkitr
|
|
jar uf ~/svn.icu4j/trunk/src/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/$ICUDT
|
|
|
|
* When refreshing all of ICU4J data from ICU4C
|
|
- ~/svn.icu/trunk/dbg$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install
|
|
- cp /tmp/icu4j/main/shared/data/icudata.jar ~/svn.icu4j/trunk/src/main/shared/data
|
|
or
|
|
- ~/svn.icu/trunk/dbg$ make ICU4J_ROOT=~/svn.icu4j/trunk/src icu4j-data-install
|
|
|
|
* update CollationFCD.java
|
|
+ copy & paste the initializers of lcccIndex[] etc. from
|
|
ICU4C/source/i18n/collationfcd.cpp to
|
|
ICU4J/main/classes/collate/src/com/ibm/icu/impl/coll/CollationFCD.java
|
|
|
|
* refresh Java test .txt files
|
|
- copy new .txt files into ICU4J's main/tests/core/src/com/ibm/icu/dev/data/unicode
|
|
cd $ICU_SRC_DIR/source/data/unidata
|
|
cp confusables.txt confusablesWholeScript.txt NormalizationCorrections.txt NormalizationTest.txt SpecialCasing.txt UnicodeData.txt ~/svn.icu4j/trunk/src/main/tests/core/src/com/ibm/icu/dev/data/unicode
|
|
cd ../../test/testdata
|
|
cp BidiCharacterTest.txt BidiTest.txt ~/svn.icu4j/trunk/src/main/tests/core/src/com/ibm/icu/dev/data/unicode
|
|
cp ~/unidata/uni80/20150415/ucd/CompositionExclusions.txt ~/svn.icu4j/trunk/src/main/tests/core/src/com/ibm/icu/dev/data/unicode
|
|
|
|
* run & fix ICU4J tests
|
|
|
|
*** LayoutEngine script information
|
|
|
|
* ICU 56: Modify ScriptIDModuleWriter.java to not output @stable tags any more,
|
|
because the layout engine was deprecated in ICU 54.
|
|
Modify ScriptIDModuleWriter.java and ScriptTagModuleWriter.java
|
|
to write lines that we used to add manually.
|
|
|
|
* Run icu4j-tools: com.ibm.icu.dev.tool.layout.ScriptNameBuilder.
|
|
This generates LEScripts.h, LELanguages.h, ScriptAndLanguageTags.h and ScriptAndLanguageTags.cpp
|
|
in the working directory.
|
|
|
|
(It also generates ScriptRunData.cpp, which is no longer needed.)
|
|
|
|
It also reads and regenerates tools/misc/src/com/ibm/icu/dev/tool/layout/ScriptAndLanguages
|
|
(a plain text file)
|
|
which maps ICU versions to the numbers of script/language constants
|
|
that were added then.
|
|
(This mapping is probably obsolete since we do not print "@stable ICU xy" any more.)
|
|
|
|
The generated files have a current copyright date and "@deprecated" statement.
|
|
|
|
* Review changes, fix Java tool if necessary, and copy to ICU4C
|
|
cd ~/svn.icu4j/trunk/src
|
|
meld $ICU_SRC_DIR/source/layout tools/misc/src/com/ibm/icu/dev/tool/layout
|
|
cp tools/misc/src/com/ibm/icu/dev/tool/layout/*.h $ICU_SRC_DIR/source/layout
|
|
cp tools/misc/src/com/ibm/icu/dev/tool/layout/ScriptAndLanguageTags.cpp $ICU_SRC_DIR/source/layout
|
|
|
|
*** API additions
|
|
- send notice to icu-design about new born-@stable API (enum constants etc.)
|
|
|
|
*** merge the Unicode update branches back onto the trunk
|
|
- do not merge the icudata.jar and testdata.jar,
|
|
instead rebuild them from merged & tested ICU4C
|
|
- make sure that changes to Unicode tools & ICU tools are checked in
|
|
http://www.unicode.org/utility/trac/log/trunk/unicodetools
|
|
http://bugs.icu-project.org/trac/log/tools/trunk
|
|
|
|
---------------------------------------------------------------------------- ***
|
|
|
|
Unicode 7.0 update for ICU 54
|
|
|
|
http://www.unicode.org/review/pri271/ -- beta review
|
|
http://www.unicode.org/reports/uax-proposed-updates.html
|
|
http://www.unicode.org/versions/beta-7.0.0.html#notable_issues
|
|
http://www.unicode.org/reports/tr44/tr44-13.html
|
|
|
|
*** ICU Trac
|
|
|
|
- ticket 10821: Unicode 7.0, UCA 7.0
|
|
- C++ branches/markus/uni70 at r35584 from trunk at r35580
|
|
- Java branches/markus/uni70 at r35587 from trunk at r35545
|
|
|
|
*** CLDR Trac
|
|
|
|
- ticket 7195: UCA 7.0 CLDR root collation
|
|
- branches/markus/uni70 at r10062 from trunk at r10061
|
|
|
|
- ticket 6762: script metadata for Unicode 7.0 new scripts
|
|
|
|
*** Unicode version numbers
|
|
- makedata.mak
|
|
- uchar.h
|
|
- com.ibm.icu.util.VersionInfo
|
|
- com.ibm.icu.dev.test.lang.UCharacterTest.VERSION_
|
|
|
|
- Run ICU4C "configure" _after_ updating the Unicode version number in uchar.h
|
|
so that the makefiles see the new version number.
|
|
|
|
*** data files & enums & parser code
|
|
|
|
* file preparation
|
|
|
|
- download UCD & IDNA files
|
|
- make sure that the Unicode data folder passed into preparseucd.py
|
|
includes a copy of the latest IdnaMappingTable.txt (can be in some subfolder)
|
|
- only for manual diffs: remove version suffixes from the file names
|
|
~/unidata/uni70/20140403$ ../../desuffixucd.py .
|
|
(see https://sites.google.com/site/unicodetools/inputdata)
|
|
- only for manual diffs: extract Unihan.zip to "here" (.../ucd/Unihan/*.txt), delete Unihan.zip
|
|
- ~/svn.icutools/trunk/src/unicode$ py/preparseucd.py ~/unidata/uni70/20140403 $ICU_SRC_DIR ~/svn.icutools/trunk/src
|
|
- This writes files (especially ppucd.txt) to the ICU4C unidata and testdata subfolders.
|
|
- Restore TODO diffs in source/data/unidata/UCARules.txt
|
|
cd $ICU_SRC_DIR
|
|
meld ../../trunk/src/source/data/unidata/UCARules.txt source/data/unidata/UCARules.txt
|
|
- Restore ICU patches for ticket #10176 in source/test/testdata/LineBreakTest.txt
|
|
|
|
- also: from http://unicode.org/Public/security/7.0.0/ download new
|
|
confusables.txt & confusablesWholeScript.txt
|
|
and copy to $ICU_ROOT/src/source/data/unidata/
|
|
|
|
* initial preparseucd.py changes
|
|
- remove new Unicode scripts from the
|
|
only-in-ISO-15924 list according to the error message:
|
|
ValueError: remove ['Hmng', 'Lina', 'Perm', 'Mani', 'Phlp', 'Bass',
|
|
'Dupl', 'Elba', 'Gran', 'Mend', 'Narb', 'Nbat', 'Palm',
|
|
'Sind', 'Wara', 'Mroo', 'Khoj', 'Tirh', 'Aghb', 'Mahj']
|
|
from _scripts_only_in_iso15924
|
|
-> fix expectedLong names in cucdapi.c/TestUScriptCodeAPI()
|
|
and in com.ibm.icu.dev.test.lang.TestUScript.java
|
|
- NamesList.txt now has a heading with a non-ASCII character
|
|
+ keep ppucd.txt in platform charset, rather than changing tool/test parsers
|
|
+ escape non-ASCII characters in heading comments
|
|
- gets Unicode copyright line from PropertyAliases.txt which is currently still at 2013
|
|
+ get the copyright from the first file whose copyright line contains the current year
|
|
|
|
* PropertyValueAliases.txt changes
|
|
- 32 new Block (blk) values:
|
|
blk; Bassa_Vah ; Bassa_Vah
|
|
blk; Caucasian_Albanian ; Caucasian_Albanian
|
|
blk; Coptic_Epact_Numbers ; Coptic_Epact_Numbers
|
|
blk; Diacriticals_Ext ; Combining_Diacritical_Marks_Extended
|
|
blk; Duployan ; Duployan
|
|
blk; Elbasan ; Elbasan
|
|
blk; Geometric_Shapes_Ext ; Geometric_Shapes_Extended
|
|
blk; Grantha ; Grantha
|
|
blk; Khojki ; Khojki
|
|
blk; Khudawadi ; Khudawadi
|
|
blk; Latin_Ext_E ; Latin_Extended_E
|
|
blk; Linear_A ; Linear_A
|
|
blk; Mahajani ; Mahajani
|
|
blk; Manichaean ; Manichaean
|
|
blk; Mende_Kikakui ; Mende_Kikakui
|
|
blk; Modi ; Modi
|
|
blk; Mro ; Mro
|
|
blk; Myanmar_Ext_B ; Myanmar_Extended_B
|
|
blk; Nabataean ; Nabataean
|
|
blk; Old_North_Arabian ; Old_North_Arabian
|
|
blk; Old_Permic ; Old_Permic
|
|
blk; Ornamental_Dingbats ; Ornamental_Dingbats
|
|
blk; Pahawh_Hmong ; Pahawh_Hmong
|
|
blk; Palmyrene ; Palmyrene
|
|
blk; Pau_Cin_Hau ; Pau_Cin_Hau
|
|
blk; Psalter_Pahlavi ; Psalter_Pahlavi
|
|
blk; Shorthand_Format_Controls ; Shorthand_Format_Controls
|
|
blk; Siddham ; Siddham
|
|
blk; Sinhala_Archaic_Numbers ; Sinhala_Archaic_Numbers
|
|
blk; Sup_Arrows_C ; Supplemental_Arrows_C
|
|
blk; Tirhuta ; Tirhuta
|
|
blk; Warang_Citi ; Warang_Citi
|
|
-> add to uchar.h
|
|
use long property names for enum constants
|
|
-> add to UCharacter.UnicodeBlock IDs
|
|
Eclipse find UBLOCK_([^ ]+) = ([0-9]+), (/.+)
|
|
replace public static final int \1_ID = \2; \3
|
|
-> add to UCharacter.UnicodeBlock objects
|
|
Eclipse find UBLOCK_([^ ]+) = [0-9]+, (/.+)
|
|
replace public static final UnicodeBlock \1 = new UnicodeBlock("\1", \1_ID); \2
|
|
- 28 new Joining_Group (jg) values:
|
|
jg ; Manichaean_Aleph ; Manichaean_Aleph
|
|
jg ; Manichaean_Ayin ; Manichaean_Ayin
|
|
jg ; Manichaean_Beth ; Manichaean_Beth
|
|
jg ; Manichaean_Daleth ; Manichaean_Daleth
|
|
jg ; Manichaean_Dhamedh ; Manichaean_Dhamedh
|
|
jg ; Manichaean_Five ; Manichaean_Five
|
|
jg ; Manichaean_Gimel ; Manichaean_Gimel
|
|
jg ; Manichaean_Heth ; Manichaean_Heth
|
|
jg ; Manichaean_Hundred ; Manichaean_Hundred
|
|
jg ; Manichaean_Kaph ; Manichaean_Kaph
|
|
jg ; Manichaean_Lamedh ; Manichaean_Lamedh
|
|
jg ; Manichaean_Mem ; Manichaean_Mem
|
|
jg ; Manichaean_Nun ; Manichaean_Nun
|
|
jg ; Manichaean_One ; Manichaean_One
|
|
jg ; Manichaean_Pe ; Manichaean_Pe
|
|
jg ; Manichaean_Qoph ; Manichaean_Qoph
|
|
jg ; Manichaean_Resh ; Manichaean_Resh
|
|
jg ; Manichaean_Sadhe ; Manichaean_Sadhe
|
|
jg ; Manichaean_Samekh ; Manichaean_Samekh
|
|
jg ; Manichaean_Taw ; Manichaean_Taw
|
|
jg ; Manichaean_Ten ; Manichaean_Ten
|
|
jg ; Manichaean_Teth ; Manichaean_Teth
|
|
jg ; Manichaean_Thamedh ; Manichaean_Thamedh
|
|
jg ; Manichaean_Twenty ; Manichaean_Twenty
|
|
jg ; Manichaean_Waw ; Manichaean_Waw
|
|
jg ; Manichaean_Yodh ; Manichaean_Yodh
|
|
jg ; Manichaean_Zayin ; Manichaean_Zayin
|
|
jg ; Straight_Waw ; Straight_Waw
|
|
-> uchar.h & UCharacter.JoiningGroup
|
|
- 23 new Script (sc) values:
|
|
sc ; Aghb ; Caucasian_Albanian
|
|
sc ; Bass ; Bassa_Vah
|
|
sc ; Dupl ; Duployan
|
|
sc ; Elba ; Elbasan
|
|
sc ; Gran ; Grantha
|
|
sc ; Hmng ; Pahawh_Hmong
|
|
sc ; Khoj ; Khojki
|
|
sc ; Lina ; Linear_A
|
|
sc ; Mahj ; Mahajani
|
|
sc ; Mani ; Manichaean
|
|
sc ; Mend ; Mende_Kikakui
|
|
sc ; Modi ; Modi
|
|
sc ; Mroo ; Mro
|
|
sc ; Narb ; Old_North_Arabian
|
|
sc ; Nbat ; Nabataean
|
|
sc ; Palm ; Palmyrene
|
|
sc ; Pauc ; Pau_Cin_Hau
|
|
sc ; Perm ; Old_Permic
|
|
sc ; Phlp ; Psalter_Pahlavi
|
|
sc ; Sidd ; Siddham
|
|
sc ; Sind ; Khudawadi
|
|
sc ; Tirh ; Tirhuta
|
|
sc ; Wara ; Warang_Citi
|
|
-> uscript.h (many were added before)
|
|
comment "Mende Kikakui" for USCRIPT_MENDE
|
|
add USCRIPT_KHUDAWADI, make USCRIPT_SINDHI an alias
|
|
-> com.ibm.icu.lang.UScript
|
|
find USCRIPT_([^ ]+) *= ([0-9]+),(.+)
|
|
replace public static final int \1 = \2; \3
|
|
- 6 new script codes from ISO 15924 http://www.unicode.org/iso15924/codechanges.html
|
|
(added 2012-11-01)
|
|
Ahom 338 Ahom
|
|
Hatr 127 Hatran
|
|
Mult 323 Multani
|
|
(added 2013-10-12)
|
|
Modi 324 Modi
|
|
Pauc 263 Pau Cin Hau
|
|
Sidd 302 Siddham
|
|
-> uscript.h (some overlap with additions from Unicode)
|
|
-> com.ibm.icu.lang.UScript
|
|
find USCRIPT_([^ ]+) *= ([0-9]+),(.+)
|
|
replace public static final int \1 = \2; \3
|
|
-> add Ahom, Hatr, Mult to preparseucd.py _scripts_only_in_iso15924
|
|
-> add to expectedLong and expectedShort names in cintltst/cucdapi.c/TestUScriptCodeAPI()
|
|
and in com.ibm.icu.dev.test.lang.TestUScript.java
|
|
|
|
* update Script metadata: SCRIPT_PROPS[] in uscript_props.cpp & UScript.ScriptMetadata
|
|
(not strictly necessary for NOT_ENCODED scripts)
|
|
~/svn.icutools/trunk/src/unicode$ py/parsescriptmetadata.py $ICU_SRC_DIR/source/common/unicode/uscript.h ~/svn.cldr/trunk/common/properties/scriptMetadata.txt
|
|
|
|
* generate normalization data files
|
|
- cd $ICU_ROOT/dbg
|
|
- export LD_LIBRARY_PATH=$ICU_ROOT/dbg/lib
|
|
- SRC_DATA_IN=$ICU_SRC_DIR/source/data/in
|
|
- UNIDATA=$ICU_SRC_DIR/source/data/unidata
|
|
- bin/gennorm2 -o $ICU_SRC_DIR/source/common/norm2_nfc_data.h -s $UNIDATA/norm2 nfc.txt --csource
|
|
- bin/gennorm2 -o $SRC_DATA_IN/nfc.nrm -s $UNIDATA/norm2 nfc.txt
|
|
- bin/gennorm2 -o $SRC_DATA_IN/nfkc.nrm -s $UNIDATA/norm2 nfc.txt nfkc.txt
|
|
- bin/gennorm2 -o $SRC_DATA_IN/nfkc_cf.nrm -s $UNIDATA/norm2 nfc.txt nfkc.txt nfkc_cf.txt
|
|
- bin/gennorm2 -o $SRC_DATA_IN/uts46.nrm -s $UNIDATA/norm2 nfc.txt uts46.txt
|
|
|
|
* build ICU (make install)
|
|
so that the tools build can pick up the new definitions from the installed header files.
|
|
|
|
~/svn.icu/uni70/dbg$ echo;echo;make -j5 install > out.txt 2>&1 ; tail -n 20 out.txt
|
|
|
|
* build Unicode tools using CMake+make
|
|
|
|
~/svn.icutools/trunk/src/unicode/c/icudefs.txt:
|
|
|
|
# Location (--prefix) of where ICU was installed.
|
|
set(ICU_INST_DIR /home/mscherer/svn.icu/uni70/inst)
|
|
# Location of the ICU source tree.
|
|
set(ICU_SRC_DIR /home/mscherer/svn.icu/uni70/src)
|
|
|
|
~/svn.icutools/trunk/dbg/unicode/c$ cmake ../../../src/unicode/c
|
|
~/svn.icutools/trunk/dbg/unicode/c$ make
|
|
|
|
* genprops work
|
|
- new code point range for Joining_Group values: 10AC0..10AFF Manichaean
|
|
+ add second array of Joining_Group values for at most 10800..10FFF
|
|
icutools: unicode/c/genprops/bidipropsbuilder.cpp
|
|
icu: source/common/ubidi_props.h/.c/_data.h
|
|
icu4j: main/classes/core/src/com/ibm/icu/impl/UBiDiProps.java
|
|
|
|
* generate core properties data files
|
|
- ~/svn.icutools/trunk/dbg/unicode/c$ genprops/genprops $ICU_SRC_DIR
|
|
- ~/svn.icutools/trunk/dbg/unicode/c$ genuca/genuca $ICU_SRC_DIR
|
|
- rebuild ICU (make install) & tools
|
|
- run genuca again (see step above) so that it picks up the new nfc.nrm
|
|
- rebuild ICU (make install) & tools
|
|
|
|
* update uts46test.cpp and UTS46Test.java if there are new characters that are equivalent to
|
|
sequences with non-LDH ASCII (that is, their decompositions contain '=' or similar)
|
|
- grep IdnaMappingTable.txt or uts46.txt for "disallowed_STD3_valid" on non-ASCII characters
|
|
- Unicode 6.0..7.0: U+2260, U+226E, U+226F
|
|
- nothing new in 7.0, no test file to update
|
|
|
|
* run & fix ICU4C tests
|
|
|
|
* update Java data files
|
|
- refresh just the UCD-related files, just to be safe
|
|
- see (ICU4C)/source/data/icu4j-readme.txt
|
|
- mkdir /tmp/icu4j
|
|
- ~/svn.icu/uni70/dbg$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install
|
|
output:
|
|
...
|
|
Unicode .icu files built to ./out/build/icudt53l
|
|
echo timestamp > uni-core-data
|
|
mkdir -p ./out/icu4j/com/ibm/icu/impl/data/icudt53b
|
|
mkdir -p ./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt53b
|
|
echo pnames.icu ubidi.icu ucase.icu uprops.icu > ./out/icu4j/add.txt
|
|
LD_LIBRARY_PATH=../lib:../stubdata:../tools/ctestfw:$LD_LIBRARY_PATH ../bin/icupkg ./out/tmp/icudt53l.dat ./out/icu4j/icudt53b.dat -a ./out/icu4j/add.txt -s ./out/build/icudt53l -x '*' -tb -d ./out/icu4j/com/ibm/icu/impl/data/icudt53b
|
|
mv ./out/icu4j/"com/ibm/icu/impl/data/icudt53b/zoneinfo64.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt53b/metaZones.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt53b/timezoneTypes.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt53b/windowsZones.res" "./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt53b"
|
|
jar cf ./out/icu4j/icudata.jar -C ./out/icu4j com/ibm/icu/impl/data/icudt53b/
|
|
mkdir -p /tmp/icu4j/main/shared/data
|
|
cp ./out/icu4j/icudata.jar /tmp/icu4j/main/shared/data
|
|
jar cf ./out/icu4j/icutzdata.jar -C ./out/icu4j/tzdata com/ibm/icu/impl/data/icudt53b/
|
|
mkdir -p /tmp/icu4j/main/shared/data
|
|
cp ./out/icu4j/icutzdata.jar /tmp/icu4j/main/shared/data
|
|
make[1]: Leaving directory `/home/mscherer/svn.icu/uni70/dbg/data'
|
|
- copy the big-endian Unicode data files to another location,
|
|
separate from the other data files
|
|
ICUDT=icudt54b
|
|
mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/coll
|
|
mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/brkitr
|
|
cd ~/svn.icu/uni70/dbg/data/out/icu4j
|
|
cp com/ibm/icu/impl/data/$ICUDT/confusables.cfu /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT
|
|
cp com/ibm/icu/impl/data/$ICUDT/*.icu /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT
|
|
rm /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/cnvalias.icu
|
|
cp com/ibm/icu/impl/data/$ICUDT/*.nrm /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT
|
|
cp com/ibm/icu/impl/data/$ICUDT/coll/*.icu /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/coll
|
|
cp com/ibm/icu/impl/data/$ICUDT/brkitr/* /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/brkitr
|
|
- refresh ICU4J
|
|
~/svn.icu/uni70/dbg/data/out/icu4j$ jar uf ~/svn.icu4j/trunk/src/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/$ICUDT
|
|
|
|
* update CollationFCD.java
|
|
+ copy & paste the initializers of lcccIndex[] etc. from
|
|
ICU4C/source/i18n/collationfcd.cpp to
|
|
ICU4J/main/classes/collate/src/com/ibm/icu/impl/coll/CollationFCD.java
|
|
|
|
* refresh Java test .txt files
|
|
- copy new .txt files into ICU4J's main/tests/core/src/com/ibm/icu/dev/data/unicode
|
|
cd $ICU_SRC_DIR/source/data/unidata
|
|
cp confusables.txt confusablesWholeScript.txt NormalizationCorrections.txt NormalizationTest.txt SpecialCasing.txt UnicodeData.txt ~/svn.icu4j/trunk/src/main/tests/core/src/com/ibm/icu/dev/data/unicode
|
|
cd ../../test/testdata
|
|
cp BidiCharacterTest.txt BidiTest.txt ~/svn.icu4j/trunk/src/main/tests/core/src/com/ibm/icu/dev/data/unicode
|
|
cp ~/unidata/uni70/20140409/ucd/CompositionExclusions.txt ~/svn.icu4j/trunk/src/main/tests/core/src/com/ibm/icu/dev/data/unicode
|
|
|
|
* UCA
|
|
|
|
- download UCA files (mostly allkeys.txt) from http://www.unicode.org/Public/UCA/<beta version>/
|
|
- run desuffixucd.py (see https://sites.google.com/site/unicodetools/inputdata)
|
|
- update the input files for Mark's UCA tools, in ~/svn.unitools/trunk/data/uca/7.0.0/
|
|
- run Mark's UCA Main: https://sites.google.com/site/unicodetools/home#TOC-UCA
|
|
- output files are in ~/svn.unitools/Generated/uca/7.0.0/
|
|
- review data; compare files, use blankweights.sed or similar
|
|
~/svn.unitools$ sed -r -f blankweights.sed Generated/uca/7.0.0/CollationAuxiliary/FractionalUCA.txt > frac-7.0.txt
|
|
- cd ~/svn.unitools/Generated/uca/7.0.0/
|
|
- update source/data/unidata/FractionalUCA.txt with FractionalUCA_SHORT.txt
|
|
cp CollationAuxiliary/FractionalUCA_SHORT.txt $ICU_SRC_DIR/source/data/unidata/FractionalUCA.txt
|
|
- update source/data/unidata/UCARules.txt with UCA_Rules_SHORT.txt
|
|
(note removing the underscore before "Rules")
|
|
cp CollationAuxiliary/UCA_Rules_SHORT.txt $ICU_SRC_DIR/source/data/unidata/UCARules.txt
|
|
- update (ICU4C)/source/test/testdata/CollationTest_*.txt
|
|
and (ICU4J)/main/tests/collate/src/com/ibm/icu/dev/data/CollationTest_*.txt
|
|
with output from Mark's Unicode tools (..._CLDR_..._SHORT.txt)
|
|
cp CollationAuxiliary/CollationTest_CLDR_NON_IGNORABLE_SHORT.txt $ICU_SRC_DIR/source/test/testdata/CollationTest_NON_IGNORABLE_SHORT.txt
|
|
cp CollationAuxiliary/CollationTest_CLDR_SHIFTED_SHORT.txt $ICU_SRC_DIR/source/test/testdata/CollationTest_SHIFTED_SHORT.txt
|
|
cp $ICU_SRC_DIR/source/test/testdata/CollationTest_*.txt ~/svn.icu4j/trunk/src/main/tests/collate/src/com/ibm/icu/dev/data
|
|
- run genuca, see command line above
|
|
- rebuild ICU4C
|
|
- refresh ICU4J collation data:
|
|
(subset of instructions above for properties data refresh, except copies all coll/*)
|
|
ICUDT=icudt54b
|
|
~/svn.icu/uni70/dbg$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install
|
|
~/svn.icu/uni70/dbg$ mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/coll
|
|
~/svn.icu/uni70/dbg/data/out/icu4j$ cp com/ibm/icu/impl/data/$ICUDT/coll/* /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/coll
|
|
~/svn.icu/uni70/dbg/data/out/icu4j$ jar uf ~/svn.icu4j/trunk/src/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/$ICUDT
|
|
- run all tests with the *_SHORT.txt or the full files (the full ones have comments, useful for debugging)
|
|
- note on intltest: if collate/UCAConformanceTest fails, then
|
|
utility/MultithreadTest/TestCollators will fail as well;
|
|
fix the conformance test before looking into the multi-thread test
|
|
- copy all output from Mark's UCA tool to unicode.org for review & staging by Ken & editors
|
|
- copy most of ~/svn.unitools/Generated/uca/7.0.0/CollationAuxiliary/* to CLDR branch
|
|
~/svn.unitools$ cp Generated/uca/7.0.0/CollationAuxiliary/* ~/svn.cldr/trunk/common/uca/
|
|
|
|
* When refreshing all of ICU4J data from ICU4C
|
|
- ~/svn.icu/uni70/dbg$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install
|
|
- cp /tmp/icu4j/main/shared/data/icudata.jar ~/svn.icu4j/trunk/src/main/shared/data
|
|
or
|
|
- ~/svn.icu/uni70/dbg$ make ICU4J_ROOT=~/svn.icu4j/trunk/src icu4j-data-install
|
|
|
|
* run & fix ICU4J tests
|
|
|
|
*** LayoutEngine script information
|
|
|
|
(For details see the Unicode 5.2 change log below.)
|
|
|
|
* Run icu4j-tools: com.ibm.icu.dev.tool.layout.ScriptNameBuilder.
|
|
This generates LEScripts.h, LELanguages.h, ScriptAndLanguageTags.h and ScriptAndLanguageTags.cpp
|
|
in the working directory.
|
|
(It also generates ScriptRunData.cpp, which is no longer needed.)
|
|
|
|
The generated files have a current copyright date and "@stable" statement.
|
|
ICU 54: Fixed tools/misc/src/com/ibm/icu/dev/tool/layout/ScriptIDModuleWriter.java
|
|
for "born stable" Unicode API constants, and to stop parsing ICU version numbers
|
|
which may not contain dots any more.
|
|
|
|
- diff current <icu>/source/layout files vs. generated ones
|
|
~/svn.icu4j/trunk/src$ meld $ICU_SRC_DIR/source/layout tools/misc/src/com/ibm/icu/dev/tool/layout
|
|
review and manually merge desired changes;
|
|
fix gratuitous changes, incorrect @draft/@stable and missing aliases;
|
|
Unicode-derived script codes should be "born stable" like constants in uchar.h, uscript.h etc.
|
|
- if you just copy the above files, then
|
|
fix mixed line endings, review the diffs as above and restore changes to API tags etc.;
|
|
manually re-add the "Indic script xyz v.2" tags in ScriptAndLanguageTags.h
|
|
|
|
*** API additions
|
|
- send notice to icu-design about new born-@stable API (enum constants etc.)
|
|
|
|
*** merge the Unicode update branches back onto the trunk
|
|
- do not merge the icudata.jar and testdata.jar,
|
|
instead rebuild them from merged & tested ICU4C
|
|
|
|
---------------------------------------------------------------------------- ***
|
|
|
|
Unicode 6.3 update
|
|
|
|
http://www.unicode.org/review/pri249/ -- beta review
|
|
http://www.unicode.org/reports/uax-proposed-updates.html
|
|
http://www.unicode.org/versions/beta-6.3.0.html#notable_issues
|
|
http://www.unicode.org/reports/tr44/tr44-11.html
|
|
|
|
*** ICU Trac
|
|
|
|
- ticket 10128: update ICU to Unicode 6.3 beta
|
|
- ticket 10168: update ICU to Unicode 6.3 final
|
|
- C++ branches/markus/uni63 at r33552 from trunk at r33551
|
|
- Java branches/markus/uni63 at r33550 from trunk at r33553
|
|
|
|
- ticket 10142: implement Unicode 6.3 bidi algorithm additions
|
|
|
|
*** Unicode version numbers
|
|
- makedata.mak
|
|
- uchar.h
|
|
(configure.in & configure: have been modified to extract the version from uchar.h)
|
|
- com.ibm.icu.util.VersionInfo
|
|
- com.ibm.icu.dev.test.lang.UCharacterTest.VERSION_
|
|
|
|
- Run ICU4C "configure" _after_ updating the Unicode version number in uchar.h
|
|
so that the makefiles see the new version number.
|
|
|
|
*** data files & enums & parser code
|
|
|
|
* file preparation
|
|
|
|
- download UCD, UCA & IDNA files
|
|
- make sure that the Unicode data folder passed into preparseucd.py
|
|
includes a copy of the latest IdnaMappingTable.txt (can be in some subfolder)
|
|
- modify preparseucd.py:
|
|
parse new file BidiBrackets.txt
|
|
with new properties bpb=Bidi_Paired_Bracket and bpt=Bidi_Paired_Bracket_Type
|
|
- ~/svn.icutools/trunk/src/unicode$ py/preparseucd.py ~/unidata/uni63/20130425 ~/svn.icu/uni63/src ~/svn.icutools/trunk/src
|
|
- This writes files (especially ppucd.txt) to the ICU4C unidata and testdata subfolders.
|
|
- Check test file diffs for previously commented-out, known-failing data lines;
|
|
probably need to keep those commented out.
|
|
|
|
* PropertyAliases.txt changes
|
|
- 1 new Enumerated Property
|
|
bpt ; Bidi_Paired_Bracket_Type
|
|
-> uchar.h & UProperty.java & UCharacter.BidiPairedBracketType
|
|
-> ubidi_props.h & .c & UBiDiProps.java
|
|
-> remember to write the max value at UBIDI_MAX_VALUES_INDEX
|
|
-> uprops.cpp
|
|
-> change ubidi.icu format version from 2.0 to 2.1
|
|
- 1 new Miscellaneous Property
|
|
bpb ; Bidi_Paired_Bracket
|
|
-> uchar.h & UProperty.java
|
|
-> ppucd.h & .cpp
|
|
|
|
* PropertyValueAliases.txt changes
|
|
- 3 Bidi_Paired_Bracket_Type (bpt) values:
|
|
bpt; c ; Close
|
|
bpt; n ; None
|
|
bpt; o ; Open
|
|
-> uchar.h & UCharacter.BidiPairedBracketType
|
|
-> ubidi_props.h & .c & UBiDiProps.java
|
|
-> change ubidi.icu format version from 2.0 to 2.1
|
|
- 4 new Bidi_Class (bc) values:
|
|
bc ; FSI ; First_Strong_Isolate
|
|
bc ; LRI ; Left_To_Right_Isolate
|
|
bc ; RLI ; Right_To_Left_Isolate
|
|
bc ; PDI ; Pop_Directional_Isolate
|
|
-> uchar.h & UCharacterEnums.ECharacterDirection
|
|
-> until the bidi code gets updated,
|
|
Roozbeh suggests mapping the new bc values to ON (Other_Neutral)
|
|
- 3 new Word_Break (WB) values:
|
|
WB ; HL ; Hebrew_Letter
|
|
WB ; SQ ; Single_Quote
|
|
WB ; DQ ; Double_Quote
|
|
-> uchar.h & UCharacter.WordBreak
|
|
-> first time Word_Break numeric constants exceed 4 bits (now 17 values)
|
|
- 2 new script codes from ISO 15924 http://www.unicode.org/iso15924/codechanges.html
|
|
(added 2012-10-16)
|
|
Aghb 239 Caucasian Albanian
|
|
Mahj 314 Mahajani
|
|
-> uscript.h
|
|
-> com.ibm.icu.lang.UScript
|
|
find USCRIPT_([^ ]+) *= ([0-9]+),(.+)
|
|
replace public static final int \1 = \2;\3
|
|
-> preparseucd.py _scripts_only_in_iso15924
|
|
-> add to expectedLong and expectedShort names in cintltst/cucdapi.c/TestUScriptCodeAPI()
|
|
and in com.ibm.icu.dev.test.lang.TestUScript.java
|
|
-> update Script metadata: SCRIPT_PROPS[] in uscript_props.cpp & UScript.ScriptMetadata
|
|
(not strictly necessary for NOT_ENCODED scripts)
|
|
|
|
* generate normalization data files
|
|
- ~/svn.icu/uni63/dbg$ export LD_LIBRARY_PATH=~/svn.icu/uni63/dbg/lib
|
|
- ~/svn.icu/uni63/dbg$ SRC_DATA_IN=~/svn.icu/uni63/src/source/data/in
|
|
- ~/svn.icu/uni63/dbg$ UNIDATA=~/svn.icu/uni63/src/source/data/unidata
|
|
- ~/svn.icu/uni63/dbg$ bin/gennorm2 -o $SRC_DATA_IN/nfc.nrm -s $UNIDATA/norm2 nfc.txt
|
|
- ~/svn.icu/uni63/dbg$ bin/gennorm2 -o $SRC_DATA_IN/nfkc.nrm -s $UNIDATA/norm2 nfc.txt nfkc.txt
|
|
- ~/svn.icu/uni63/dbg$ bin/gennorm2 -o $SRC_DATA_IN/nfkc_cf.nrm -s $UNIDATA/norm2 nfc.txt nfkc.txt nfkc_cf.txt
|
|
- ~/svn.icu/uni63/dbg$ bin/gennorm2 -o $SRC_DATA_IN/uts46.nrm -s $UNIDATA/norm2 nfc.txt uts46.txt
|
|
|
|
* build ICU (make install)
|
|
so that the tools build can pick up the new definitions from the installed header files.
|
|
|
|
~/svn.icu/uni63/dbg$ echo;echo;make -j5 install > out.txt 2>&1 ; tail -n 20 out.txt
|
|
|
|
* build Unicode tools using CMake+make
|
|
|
|
~/svn.icutools/trunk/src/unicode/c/icudefs.txt:
|
|
|
|
# Location (--prefix) of where ICU was installed.
|
|
set(ICU_INST_DIR /home/mscherer/svn.icu/uni63/inst)
|
|
# Location of the ICU source tree.
|
|
set(ICU_SRC_DIR /home/mscherer/svn.icu/uni63/src)
|
|
|
|
~/svn.icutools/trunk/dbg/unicode/c$ cmake ../../../src/unicode/c
|
|
~/svn.icutools/trunk/dbg/unicode/c$ make
|
|
|
|
* generate core properties data files
|
|
- ~/svn.icutools/trunk/dbg/unicode/c$ genprops/genprops ~/svn.icu/uni63/src
|
|
- ~/svn.icutools/trunk/dbg/unicode/c$ genuca/genuca -i ~/svn.icu/uni63/dbg/data/out/build/icudt52l ~/svn.icu/uni63/src
|
|
- rebuild ICU (make install) & tools
|
|
- run genuca again (see step above) so that it picks up the new case mappings and nfc.nrm
|
|
- rebuild ICU (make install) & tools
|
|
|
|
* update uts46test.cpp and UTS46Test.java if there are new characters that are equivalent to
|
|
sequences with non-LDH ASCII (that is, their decompositions contain '=' or similar)
|
|
- grep IdnaMappingTable.txt or uts46.txt for "disallowed_STD3_valid" on non-ASCII characters
|
|
- Unicode 6.0..6.3: U+2260, U+226E, U+226F
|
|
- nothing new in 6.3, no test file to update
|
|
|
|
* update Java data files
|
|
- refresh just the UCD-related files, just to be safe
|
|
- see (ICU4C)/source/data/icu4j-readme.txt
|
|
- mkdir /tmp/icu4j
|
|
- ~/svn.icu/uni63/dbg$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install
|
|
output:
|
|
...
|
|
Unicode .icu files built to ./out/build/icudt52l
|
|
mkdir -p ./out/icu4j/com/ibm/icu/impl/data/icudt52b
|
|
mkdir -p ./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt52b
|
|
echo pnames.icu ubidi.icu ucase.icu uprops.icu > ./out/icu4j/add.txt
|
|
LD_LIBRARY_PATH=../lib:../stubdata:../tools/ctestfw:$LD_LIBRARY_PATH ../bin/icupkg ./out/tmp/icudt52l.dat ./out/icu4j/icudt52b.dat -a ./out/icu4j/add.txt -s ./out/build/icudt52l -x '*' -tb -d ./out/icu4j/com/ibm/icu/impl/data/icudt52b
|
|
mv ./out/icu4j/"com/ibm/icu/impl/data/icudt52b/zoneinfo64.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt52b/metaZones.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt52b/timezoneTypes.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt52b/windowsZones.res" "./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt52b"
|
|
jar cf ./out/icu4j/icudata.jar -C ./out/icu4j com/ibm/icu/impl/data/icudt52b/
|
|
mkdir -p /tmp/icu4j/main/shared/data
|
|
cp ./out/icu4j/icudata.jar /tmp/icu4j/main/shared/data
|
|
jar cf ./out/icu4j/icutzdata.jar -C ./out/icu4j/tzdata com/ibm/icu/impl/data/icudt52b/
|
|
mkdir -p /tmp/icu4j/main/shared/data
|
|
cp ./out/icu4j/icutzdata.jar /tmp/icu4j/main/shared/data
|
|
make[1]: Leaving directory `/home/mscherer/svn.icu/uni63/dbg/data'
|
|
- copy the big-endian Unicode data files to another location,
|
|
separate from the other data files
|
|
mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/icudt52b/coll
|
|
mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/icudt52b/brkitr
|
|
~/svn.icu/uni63/dbg/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt52b/*.icu /tmp/icu4j/com/ibm/icu/impl/data/icudt52b
|
|
~/svn.icu/uni63/dbg/data/out/icu4j$ rm /tmp/icu4j/com/ibm/icu/impl/data/icudt52b/cnvalias.icu
|
|
~/svn.icu/uni63/dbg/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt52b/*.nrm /tmp/icu4j/com/ibm/icu/impl/data/icudt52b
|
|
~/svn.icu/uni63/dbg/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt52b/coll/*.icu /tmp/icu4j/com/ibm/icu/impl/data/icudt52b/coll
|
|
~/svn.icu/uni63/dbg/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt52b/brkitr/* /tmp/icu4j/com/ibm/icu/impl/data/icudt52b/brkitr
|
|
- refresh ICU4J
|
|
~/svn.icu/uni63/dbg/data/out/icu4j$ jar uf ~/svn.icu4j/trunk/src/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/icudt52b
|
|
|
|
* refresh Java test .txt files
|
|
- copy new .txt files into ICU4J's main/tests/core/src/com/ibm/icu/dev/data/unicode
|
|
|
|
* UCA -- mostly skipped for ICU 52 / Unicode 6.3, except update coll/* files
|
|
|
|
- get output from Mark's tools; look in http://www.unicode.org/Public/UCA/<beta version>/
|
|
- CLDR root files for ICU are in CollationAuxiliary.zip; unpack that
|
|
- update source/data/unidata/FractionalUCA.txt with FractionalUCA_SHORT.txt
|
|
- update source/data/unidata/UCARules.txt with UCA_Rules_SHORT.txt
|
|
(note removing the underscore before "Rules")
|
|
- update (ICU4C)/source/test/testdata/CollationTest_*.txt
|
|
and (ICU4J)/main/tests/collate/src/com/ibm/icu/dev/data/CollationTest_*.txt
|
|
with output from Mark's Unicode tools (..._CLDR_..._SHORT.txt)
|
|
- check test file diffs for previously commented-out, known-failing data lines;
|
|
probably need to keep those commented out
|
|
- check FractionalUCA.txt for manual changes of lead bytes from IMPLICIT to Hani
|
|
- run genuca, see command line above
|
|
- rebuild ICU4C
|
|
- refresh ICU4J collation data:
|
|
(subset of instructions above for properties data refresh, except copies all coll/*)
|
|
~/svn.icu/uni63/dbg$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install
|
|
~/svn.icu/uni63/dbg$ mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/icudt52b/coll
|
|
~/svn.icu/uni63/dbg/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt52b/coll/* /tmp/icu4j/com/ibm/icu/impl/data/icudt52b/coll
|
|
~/svn.icu/uni63/dbg/data/out/icu4j$ jar uf ~/svn.icu4j/trunk/src/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/icudt52b
|
|
- run all tests with the *_SHORT.txt or the full files (the full ones have comments, useful for debugging)
|
|
- note on intltest: if collate/UCAConformanceTest fails, then
|
|
utility/MultithreadTest/TestCollators will fail as well;
|
|
fix the conformance test before looking into the multi-thread test
|
|
|
|
* test ICU, fix test code where necessary
|
|
|
|
* When refreshing all of ICU4J data from ICU4C
|
|
- ~/svn.icu/uni63/dbg$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install
|
|
- cp /tmp/icu4j/main/shared/data/icudata.jar ~/svn.icu4j/trunk/src/main/shared/data
|
|
or
|
|
- ~/svn.icu/uni63/dbg$ make ICU4J_ROOT=~/svn.icu4j/trunk/src icu4j-data-install
|
|
|
|
*** LayoutEngine script information
|
|
- skipped for Unicode 6.3: no new scripts
|
|
|
|
*** merge the Unicode update branches back onto the trunk
|
|
- do not merge the icudata.jar and testdata.jar,
|
|
instead rebuild them from merged & tested ICU4C
|
|
|
|
---------------------------------------------------------------------------- ***
|
|
|
|
Unicode 6.2 update
|
|
|
|
http://www.unicode.org/review/pri230/
|
|
http://www.unicode.org/versions/beta-6.2.0.html
|
|
http://www.unicode.org/reports/tr44/tr44-9.html#Unicode_6.2.0
|
|
http://www.unicode.org/review/pri227/ Changes to Script Extensions Property Values
|
|
http://www.unicode.org/review/pri228/ Changing some common characters from Punctuation to Symbol
|
|
http://www.unicode.org/review/pri229/ Linebreaking Changes for Pictographic Symbols
|
|
http://www.unicode.org/reports/tr46/tr46-8.html IDNA
|
|
http://unicode.org/Public/idna/6.2.0/
|
|
|
|
*** ICU Trac
|
|
|
|
- ticket 9515: Unicode 6.2: final ICU update
|
|
|
|
- ticket 9514: UCA 6.2: fix UCARules.txt
|
|
|
|
- ticket 9437: update ICU to Unicode 6.2
|
|
- C++ branches/markus/uni62 at r32050 from trunk at r32041
|
|
- Java branches/markus/uni62 at r32068 from trunk at r32066
|
|
|
|
*** Unicode version numbers
|
|
- makedata.mak
|
|
- uchar.h
|
|
(configure.in & configure: have been modified to extract the version from uchar.h)
|
|
- com.ibm.icu.util.VersionInfo
|
|
- com.ibm.icu.dev.test.lang.UCharacterTest.VERSION_
|
|
|
|
*** data files & enums & parser code
|
|
|
|
* file preparation
|
|
|
|
- download UCD, UCA & IDNA files
|
|
- make sure that the Unicode data folder passed into preparseucd.py
|
|
includes a copy of the latest IdnaMappingTable.txt (can be in some subfolder)
|
|
- modify preparseucd.py: NamesList.txt is now in UTF-8
|
|
- ~/svn.icu/tools/trunk/src/unicode$ py/preparseucd.py ~/uni62/20120816 ~/svn.icu/uni62/src ~/svn.icu/tools/trunk/src
|
|
- This writes files (especially ppucd.txt) to the ICU4C unidata and testdata subfolders.
|
|
- Check test file diffs for previously commented-out, known-failing data lines;
|
|
probably need to keep those commented out.
|
|
|
|
* PropertyValueAliases.txt changes
|
|
- 1 new Line_Break (lb) value:
|
|
lb ; RI ; Regional_Indicator
|
|
-> uchar.h & UCharacter.LineBreak
|
|
- 1 new Word_Break (WB) value:
|
|
WB ; RI ; Regional_Indicator
|
|
-> uchar.h & UCharacter.WordBreak
|
|
- 1 new Grapheme_Cluster_Break (GCB) value:
|
|
GCB; RI ; Regional_Indicator
|
|
-> uchar.h & UCharacter.GraphemeClusterBreak
|
|
|
|
* 3 new numeric values
|
|
The new value -1, which was really supposed to be NaN but that would have required
|
|
new UnicodeData.txt syntax, can already be represented as a "fraction" of -1/1,
|
|
but encodeNumericValue() in corepropsbuilder.cpp had to be fixed.
|
|
cp;12456;na=CUNEIFORM NUMERIC SIGN NIGIDAMIN;nv=-1
|
|
cp;12457;na=CUNEIFORM NUMERIC SIGN NIGIDAESH;nv=-1
|
|
The two new values 216000 and 432000 require an addition to the encoding of numeric values.
|
|
cp;12432;na=CUNEIFORM NUMERIC SIGN SHAR2 TIMES GAL PLUS DISH;nv=216000
|
|
cp;12433;na=CUNEIFORM NUMERIC SIGN SHAR2 TIMES GAL PLUS MIN;nv=432000
|
|
-> uprops.h, uchar.c & UCharacterProperty.java
|
|
-> cucdtst.c & UCharacterTest.java
|
|
|
|
* generate normalization data files
|
|
- ~/svn.icu/uni62/dbg$ export LD_LIBRARY_PATH=~/svn.icu/uni62/dbg/lib
|
|
- ~/svn.icu/uni62/dbg$ SRC_DATA_IN=~/svn.icu/uni62/src/source/data/in
|
|
- ~/svn.icu/uni62/dbg$ UNIDATA=~/svn.icu/uni62/src/source/data/unidata
|
|
- ~/svn.icu/uni62/dbg$ bin/gennorm2 -o $SRC_DATA_IN/nfc.nrm -s $UNIDATA/norm2 nfc.txt
|
|
- ~/svn.icu/uni62/dbg$ bin/gennorm2 -o $SRC_DATA_IN/nfkc.nrm -s $UNIDATA/norm2 nfc.txt nfkc.txt
|
|
- ~/svn.icu/uni62/dbg$ bin/gennorm2 -o $SRC_DATA_IN/nfkc_cf.nrm -s $UNIDATA/norm2 nfc.txt nfkc.txt nfkc_cf.txt
|
|
- ~/svn.icu/uni62/dbg$ bin/gennorm2 -o $SRC_DATA_IN/uts46.nrm -s $UNIDATA/norm2 nfc.txt uts46.txt
|
|
|
|
* build ICU (make install)
|
|
so that the tools build can pick up the new definitions from the installed header files.
|
|
* build Unicode tools using CMake+make
|
|
|
|
* generate core properties data files
|
|
- ~/svn.icu/tools/trunk/dbg/unicode$ c/genprops/genprops ~/svn.icu/uni62/src
|
|
- in initial bootstrapping, change the UCA version
|
|
in source/data/unidata/FractionalUCA.txt to match the new Unicode version
|
|
- ~/svn.icu/tools/trunk/dbg/unicode$ c/genuca/genuca -i ~/svn.icu/uni62/dbg/data/out/build/icudt50l ~/svn.icu/uni62/src
|
|
- rebuild ICU (make install) & tools
|
|
+ if genrb fails to build coll/root.res with an U_INVALID_FORMAT_ERROR,
|
|
check if the UCA version in FractionalUCA.txt matches the new Unicode version
|
|
(see step above)
|
|
- run genuca again (see step above) so that it picks up the new case mappings and nfc.nrm
|
|
- rebuild ICU (make install) & tools
|
|
|
|
* update uts46test.cpp and UTS46Test.java if there are new characters that are equivalent to
|
|
sequences with non-LDH ASCII (that is, their decompositions contain '=' or similar)
|
|
- grep IdnaMappingTable.txt or uts46.txt for "disallowed_STD3_valid" on non-ASCII characters
|
|
- Unicode 6.0..6.2: U+2260, U+226E, U+226F
|
|
- nothing new in 6.2, no test file to update
|
|
|
|
* update Java data files
|
|
- refresh just the UCD-related files, just to be safe
|
|
- see (ICU4C)/source/data/icu4j-readme.txt
|
|
- mkdir /tmp/icu4j
|
|
- ~/svn.icu/uni62/dbg$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install
|
|
output:
|
|
...
|
|
Unicode .icu files built to ./out/build/icudt50l
|
|
mkdir -p ./out/icu4j/com/ibm/icu/impl/data/icudt50b
|
|
mkdir -p ./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt50b
|
|
echo pnames.icu ubidi.icu ucase.icu uprops.icu > ./out/icu4j/add.txt
|
|
LD_LIBRARY_PATH=../lib:../stubdata:../tools/ctestfw:$LD_LIBRARY_PATH ../bin/icupkg ./out/tmp/icudt50l.dat ./out/icu4j/icudt50b.dat -a ./out/icu4j/add.txt -s ./out/build/icudt50l -x '*' -tb -d ./out/icu4j/com/ibm/icu/impl/data/icudt50b
|
|
mv ./out/icu4j/"com/ibm/icu/impl/data/icudt50b/zoneinfo64.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt50b/metaZones.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt50b/timezoneTypes.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt50b/windowsZones.res" "./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt50b"
|
|
jar cf ./out/icu4j/icudata.jar -C ./out/icu4j com/ibm/icu/impl/data/icudt50b/
|
|
mkdir -p /tmp/icu4j/main/shared/data
|
|
cp ./out/icu4j/icudata.jar /tmp/icu4j/main/shared/data
|
|
jar cf ./out/icu4j/icutzdata.jar -C ./out/icu4j/tzdata com/ibm/icu/impl/data/icudt50b/
|
|
mkdir -p /tmp/icu4j/main/shared/data
|
|
cp ./out/icu4j/icutzdata.jar /tmp/icu4j/main/shared/data
|
|
make[1]: Leaving directory `/home/mscherer/svn.icu/uni62/dbg/data'
|
|
- copy the big-endian Unicode data files to another location,
|
|
separate from the other data files
|
|
mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/icudt50b/coll
|
|
mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/icudt50b/brkitr
|
|
~/svn.icu/uni62/dbg/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt50b/*.icu /tmp/icu4j/com/ibm/icu/impl/data/icudt50b
|
|
~/svn.icu/uni62/dbg/data/out/icu4j$ rm /tmp/icu4j/com/ibm/icu/impl/data/icudt50b/cnvalias.icu
|
|
~/svn.icu/uni62/dbg/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt50b/*.nrm /tmp/icu4j/com/ibm/icu/impl/data/icudt50b
|
|
~/svn.icu/uni62/dbg/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt50b/coll/*.icu /tmp/icu4j/com/ibm/icu/impl/data/icudt50b/coll
|
|
~/svn.icu/uni62/dbg/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt50b/brkitr/* /tmp/icu4j/com/ibm/icu/impl/data/icudt50b/brkitr
|
|
- refresh ICU4J
|
|
~/svn.icu/uni62/dbg/data/out/icu4j$ jar uf ~/svn.icu4j/trunk/src/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/icudt50b
|
|
|
|
* refresh Java test .txt files
|
|
- copy new .txt files into ICU4J's main/tests/core/src/com/ibm/icu/dev/data/unicode
|
|
|
|
* UCA
|
|
|
|
- get output from Mark's tools; look in http://www.unicode.org/Public/UCA/<beta version>/
|
|
- CLDR root files for ICU are in CollationAuxiliary.zip; unpack that
|
|
- update source/data/unidata/FractionalUCA.txt with FractionalUCA_SHORT.txt
|
|
- update source/data/unidata/UCARules.txt with UCA_Rules_SHORT.txt
|
|
(note removing the underscore before "Rules")
|
|
- update (ICU4C)/source/test/testdata/CollationTest_*.txt
|
|
and (ICU4J)/main/tests/collate/src/com/ibm/icu/dev/data/CollationTest_*.txt
|
|
with output from Mark's Unicode tools (..._CLDR_..._SHORT.txt)
|
|
- check test file diffs for previously commented-out, known-failing data lines;
|
|
probably need to keep those commented out
|
|
- check FractionalUCA.txt for manual changes of lead bytes from IMPLICIT to Hani
|
|
- run genuca, see command line above
|
|
- rebuild ICU4C
|
|
- refresh ICU4J collation data:
|
|
(subset of instructions above for properties data refresh, except copies all coll/*)
|
|
~/svn.icu/uni62/bld$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install
|
|
~/svn.icu/uni62/bld$ mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/icudt50b/coll
|
|
~/svn.icu/uni62/bld/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt50b/coll/* /tmp/icu4j/com/ibm/icu/impl/data/icudt50b/coll
|
|
~/svn.icu/uni62/bld/data/out/icu4j$ jar uf ~/svn.icu4j/trunk/src/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/icudt50b
|
|
- run all tests with the *_SHORT.txt or the full files (the full ones have comments, useful for debugging)
|
|
- note on intltest: if collate/UCAConformanceTest fails, then
|
|
utility/MultithreadTest/TestCollators will fail as well;
|
|
fix the conformance test before looking into the multi-thread test
|
|
|
|
* test ICU, fix test code where necessary
|
|
|
|
* When refreshing all of ICU4J data from ICU4C
|
|
- ~/svn.icu/uni62/dbg$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install
|
|
- cp /tmp/icu4j/main/shared/data/icudata.jar ~/svn.icu4j/trunk/src/main/shared/data
|
|
or
|
|
- ~/svn.icu/uni62/dbg$ make ICU4J_ROOT=~/svn.icu4j/trunk/src icu4j-data-install
|
|
|
|
*** LayoutEngine script information
|
|
- skipped for Unicode 6.2: no new scripts
|
|
|
|
*** merge the Unicode update branches back onto the trunk
|
|
- do not merge the icudata.jar and testdata.jar,
|
|
instead rebuild them from merged & tested ICU4C
|
|
|
|
---------------------------------------------------------------------------- ***
|
|
|
|
Future Unicode update
|
|
|
|
Tools simplified since the Unicode 6.1 update. See
|
|
- http://site.icu-project.org/design/props/ppucd
|
|
- http://bugs.icu-project.org/trac/wiki/Markus/ReviewTicket8972
|
|
|
|
* Unicode version numbers
|
|
- icutools/unicode/makedefs.sh was deleted, so one fewer place for version & path updates
|
|
|
|
* file preparation
|
|
- ucdcopy.py, idna2nrm.py and genpname/preparse.pl replaced by preparseucd.py:
|
|
- ~/svn.icu/tools/trunk/src/unicode$ py/preparseucd.py ~/uni61/20120118 ~/svn.icu/trunk/src ~/svn.icu/tools/trunk/src
|
|
- This writes files (especially ppucd.txt) to the ICU4C unidata and testdata subfolders.
|
|
- Check test file diffs for previously commented-out, known-failing data lines;
|
|
probably need to keep those commented out.
|
|
|
|
* PropertyValueAliases.txt changes
|
|
- Script codes that are in ISO 15924 but not in Unicode are now listed in
|
|
preparseucd.py, in the _scripts_only_in_iso15924 variable.
|
|
If there are new ISO codes, then add them.
|
|
If Unicode adds some of them, then remove them from the .py variable.
|
|
|
|
* UnicodeData.txt changes
|
|
- No more manual changes for CJK ranges for algorithmic names;
|
|
those are now written to ppucd.txt and genprops reads them from there.
|
|
|
|
* generate core properties data files (makeprops.sh was deleted)
|
|
- ~/svn.icu/tools/trunk/dbg/unicode$ c/genprops/genprops ~/svn.icu/trunk/src
|
|
|
|
* no more manual updates of source/data/unidata/norm2/nfkc_cf.txt
|
|
- it is now generated by preparseucd.py
|
|
|
|
* no more separate idna2nrm.py run and manual copying to generate source/data/unidata/norm2/uts46.txt
|
|
- it is now generated by preparseucd.py
|
|
- make sure that the Unicode data folder passed into preparseucd.py
|
|
includes a copy of http://www.unicode.org/Public/idna/6.1.0/IdnaMappingTable.txt
|
|
(can be in some subfolder)
|
|
|
|
* generate normalization data files
|
|
- ~/svn.icu/trunk/dbg$ export LD_LIBRARY_PATH=~/svn.icu/trunk/dbg/lib
|
|
- ~/svn.icu/trunk/dbg$ SRC_DATA_IN=~/svn.icu/trunk/src/source/data/in
|
|
- ~/svn.icu/trunk/dbg$ UNIDATA=~/svn.icu/trunk/src/source/data/unidata
|
|
- ~/svn.icu/trunk/dbg$ bin/gennorm2 -o $SRC_DATA_IN/nfc.nrm -s $UNIDATA/norm2 nfc.txt
|
|
- ~/svn.icu/trunk/dbg$ bin/gennorm2 -o $SRC_DATA_IN/nfkc.nrm -s $UNIDATA/norm2 nfc.txt nfkc.txt
|
|
- ~/svn.icu/trunk/dbg$ bin/gennorm2 -o $SRC_DATA_IN/nfkc_cf.nrm -s $UNIDATA/norm2 nfc.txt nfkc.txt nfkc_cf.txt
|
|
- ~/svn.icu/trunk/dbg$ bin/gennorm2 -o $SRC_DATA_IN/uts46.nrm -s $UNIDATA/norm2 nfc.txt uts46.txt
|
|
|
|
* build ICU (make install)
|
|
* build Unicode tools using CMake+make
|
|
|
|
* new way to call genuca (makeuca.sh was deleted)
|
|
- ~/svn.icu/tools/trunk/dbg/unicode$ c/genuca/genuca -i ~/svn.icu/trunk/dbg/data/out/build/icudt49l ~/svn.icu/trunk/src
|
|
|
|
---------------------------------------------------------------------------- ***
|
|
|
|
Unicode 6.1 update
|
|
|
|
*** ICU Trac
|
|
|
|
- ticket 8995 final update to Unicode 6.1
|
|
- ticket 8994 regenerate source/layout/CanonData.cpp
|
|
|
|
- ticket 8961 support Unicode "Age" value *names*
|
|
- ticket 8963 support multiple character name aliases & types
|
|
|
|
- ticket 8827 "update ICU to Unicode 6.1"
|
|
- C++ branches/markus/uni61 at r30864 from trunk at r30843
|
|
- Java branches/markus/uni61 at r30865 from trunk at r30863
|
|
|
|
*** Unicode version numbers
|
|
- makedata.mak
|
|
- uchar.h
|
|
(configure.in & configure: have been modified to extract the version from uchar.h)
|
|
- com.ibm.icu.util.VersionInfo
|
|
- icutools/unicode/makedefs.sh
|
|
+ also review & update other definitions in that file,
|
|
e.g. the ICU version in this path: BLD_DATA_FILES=$ICU_BLD/data/out/build/icudt49l
|
|
|
|
*** data files & enums & parser code
|
|
|
|
* file preparation
|
|
|
|
~/svn.icu/tools/trunk/src/unicode/c/genprops/misc$ ./ucdcopy.py ~/uni61/20111205/ucd ~/uni61/processed
|
|
- This prepares both unidata and testdata files in respective output subfolders.
|
|
- Check test file diffs for previously commented-out, known-failing data lines;
|
|
probably need to keep those commented out.
|
|
|
|
* PropertyValueAliases.txt changes
|
|
- 11 new block names:
|
|
Arabic_Extended_A
|
|
Arabic_Mathematical_Alphabetic_Symbols
|
|
Chakma
|
|
Meetei_Mayek_Extensions
|
|
Meroitic_Cursive
|
|
Meroitic_Hieroglyphs
|
|
Miao
|
|
Sharada
|
|
Sora_Sompeng
|
|
Sundanese_Supplement
|
|
Takri
|
|
-> add to uchar.h
|
|
-> add to UCharacter.UnicodeBlock IDs
|
|
Eclipse find UBLOCK_([^ ]+) = ([0-9]+), (/.+)
|
|
replace public static final int \1_ID = \2; \3
|
|
-> add to UCharacter.UnicodeBlock objects
|
|
Eclipse find UBLOCK_([^ ]+) = [0-9]+, (/.+)
|
|
replace public static final UnicodeBlock \1 = new UnicodeBlock("\1", \1_ID); \2
|
|
- 1 new Joining_Group (jg) value:
|
|
Rohingya_Yeh
|
|
-> uchar.h & UCharacter.JoiningGroup
|
|
- 2 new Line_Break (lb) values:
|
|
CJ=Conditional_Japanese_Starter
|
|
HL=Hebrew_Letter
|
|
-> uchar.h & UCharacter.LineBreak
|
|
- 7 new scripts:
|
|
sc ; Cakm ; Chakma
|
|
sc ; Merc ; Meroitic_Cursive
|
|
sc ; Mero ; Meroitic_Hieroglyphs
|
|
sc ; Plrd ; Miao
|
|
sc ; Shrd ; Sharada
|
|
sc ; Sora ; Sora_Sompeng
|
|
sc ; Takr ; Takri
|
|
-> remove these from SyntheticPropertyValueAliases.txt
|
|
-> fix expectedLong names in cucdapi.c/TestUScriptCodeAPI()
|
|
and in com.ibm.icu.dev.test.lang.TestUScript.java
|
|
- 2 new script codes from ISO 15924 http://www.unicode.org/iso15924/codechanges.html
|
|
(added 2011-06-21)
|
|
Khoj 322 Khojki
|
|
Tirh 326 Tirhuta
|
|
and another one added 2011-12-09
|
|
Hluw 080 Anatolian Hieroglyphs (Luwian Hieroglyphs, Hittite Hieroglyphs)
|
|
-> uscript.h
|
|
-> com.ibm.icu.lang.UScript
|
|
find USCRIPT_([^ ]+) *= ([0-9]+),(.+)
|
|
replace public static final int \1 = \2;\3
|
|
-> SyntheticPropertyValueAliases.txt
|
|
-> add to expectedLong and expectedShort names in cintltst/cucdapi.c/TestUScriptCodeAPI()
|
|
and in com.ibm.icu.dev.test.lang.TestUScript.java
|
|
|
|
* UnicodeData.txt changes
|
|
- the last Unihan code point changes from U+9FCB to U+9FCC
|
|
search for both 9FCB (end) and 9FCC (limit) (regex 9FC[BC], case-insensitive)
|
|
+ do change gennames.c
|
|
+ do change swapCJK() in ucol.cpp & ImplicitCEGenerator.java
|
|
|
|
* DerivedBidiClass.txt changes
|
|
- 2 new default-AL blocks:
|
|
# Arabic Extended-A: U+08A0 - U+08FF (was default-R)
|
|
# Arabic Mathematical Alphabetic Symbols:
|
|
# U+1EE00 - U+1EEFF (was default-R)
|
|
- 2 new default-R blocks:
|
|
# Meroitic Hieroglyphs:
|
|
# U+10980 - U+1099F
|
|
# Meroitic Cursive: U+109A0 - U+109FF
|
|
-> should be picked up by the explicit data in the file
|
|
|
|
* NameAliases.txt changes
|
|
- from
|
|
# Each line has two fields
|
|
# First field: Code point
|
|
# Second field: Alias
|
|
- to
|
|
# Each line has three fields, as described here:
|
|
#
|
|
# First field: Code point
|
|
# Second field: Alias
|
|
# Third field: Type
|
|
- Also, the file previously allowed multiple aliases but only now does it
|
|
actually provide multiple, even multiple of the same type. For example,
|
|
FEFF;BYTE ORDER MARK;alternate
|
|
FEFF;BOM;abbreviation
|
|
FEFF;ZWNBSP;abbreviation
|
|
- This breaks our gennames parser, unames.icu data structure, and API.
|
|
Fix gennames to only pick up "correction" aliases.
|
|
New ticket #8963 for further changes.
|
|
|
|
* run genpname/preparse.pl (on Linux)
|
|
+ cd ~/svn.icu/tools/trunk/src/unicode/c/genpname
|
|
+ make sure that data.h is writable
|
|
+ perl preparse.pl ~/svn.icu/trunk/src > out.txt
|
|
+ preparse.pl shows no errors, out.txt Info and Warning lines look ok
|
|
|
|
* build ICU (make install)
|
|
so that the tools build can pick up the new definitions from the installed header files.
|
|
* build Unicode tools (at least genpname) using CMake+make
|
|
|
|
* run genpname
|
|
(builds both pnames.icu and propname_data.h)
|
|
- ~/svn.icu/tools/trunk/bld/unicode/c$ genpname/genpname -v -d ~/svn.icu/trunk/src/source/data/in
|
|
- ~/svn.icu/tools/trunk/bld/unicode/c$ genpname/genpname -v -d ~/svn.icu/trunk/src/source/common --csource
|
|
|
|
* build ICU (make install)
|
|
* build Unicode tools using CMake+make
|
|
|
|
* update source/data/unidata/norm2/nfkc_cf.txt
|
|
- follow the instructions in nfkc_cf.txt for updating it from DerivedNormalizationProps.txt
|
|
|
|
* update source/data/unidata/norm2/uts46.txt
|
|
- download http://www.unicode.org/Public/idna/6.1.0/IdnaMappingTable.txt
|
|
to ~/svn.icu/tools/trunk/src/unicode/py
|
|
- adjust idna2nrm.py to remove "; NV8": For UTS #46, we do not care about "not valid in IDNA2008".
|
|
- ~/svn.icu/tools/trunk/src/unicode/py$ ./idna2nrm.py
|
|
- ~/svn.icu/tools/trunk/src/unicode/py$ cp uts46.txt ~/svn.icu/trunk/src/source/data/unidata/norm2
|
|
|
|
* update uts46test.cpp and UTS46Test.java if there are new characters that are equivalent to
|
|
sequences with non-LDH ASCII (that is, their decompositions contain '=' or similar)
|
|
- grep IdnaMappingTable.txt or uts46.txt for "disallowed_STD3_valid" on non-ASCII characters
|
|
- Unicode 6.0..6.1: U+2260, U+226E, U+226F
|
|
- nothing new in 6.1, no test file to update
|
|
|
|
* generate core properties data files
|
|
- in initial bootstrapping, change the UCA version
|
|
in source/data/unidata/FractionalUCA.txt to match the new Unicode version
|
|
- ~/svn.icu/tools/trunk/src/unicode$ ./makeprops.sh ~/svn.icu/trunk/src ~/svn.icu/trunk/bld
|
|
- rebuild ICU & tools
|
|
+ if genrb fails to build coll/root.res with an U_INVALID_FORMAT_ERROR,
|
|
check if the UCA version in FractionalUCA.txt matches the new Unicode version
|
|
(see step above)
|
|
- run makeuca.sh so that genuca picks up the new case mappings and nfc.nrm:
|
|
~/svn.icu/tools/trunk/src/unicode$ ./makeuca.sh ~/svn.icu/trunk/src ~/svn.icu/trunk/bld
|
|
- rebuild ICU & tools
|
|
|
|
* update Java data files
|
|
- refresh just the UCD-related files, just to be safe
|
|
- see (ICU4C)/source/data/icu4j-readme.txt
|
|
- mkdir /tmp/icu4j
|
|
- ~/svn.icu/trunk/bld$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install
|
|
output:
|
|
...
|
|
Unicode .icu files built to ./out/build/icudt49l
|
|
mkdir -p ./out/icu4j/com/ibm/icu/impl/data/icudt49b
|
|
mkdir -p ./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt49b
|
|
echo pnames.icu ubidi.icu ucase.icu uprops.icu > ./out/icu4j/add.txt
|
|
LD_LIBRARY_PATH=../lib:../stubdata:../tools/ctestfw:$LD_LIBRARY_PATH ../bin/icupkg ./out/tmp/icudt49l.dat ./out/icu4j/icudt49b.dat -a ./out/icu4j/add.txt -s ./out/build/icudt49l -x '*' -tb -d ./out/icu4j/com/ibm/icu/impl/data/icudt49b
|
|
mv ./out/icu4j/"com/ibm/icu/impl/data/icudt49b/zoneinfo64.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt49b/metaZones.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt49b/timezoneTypes.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt49b/windowsZones.res" "./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt49b"
|
|
jar cf ./out/icu4j/icudata.jar -C ./out/icu4j com/ibm/icu/impl/data/icudt49b/
|
|
mkdir -p /tmp/icu4j/main/shared/data
|
|
cp ./out/icu4j/icudata.jar /tmp/icu4j/main/shared/data
|
|
jar cf ./out/icu4j/icutzdata.jar -C ./out/icu4j/tzdata com/ibm/icu/impl/data/icudt49b/
|
|
mkdir -p /tmp/icu4j/main/shared/data
|
|
cp ./out/icu4j/icutzdata.jar /tmp/icu4j/main/shared/data
|
|
make[1]: Leaving directory `/home/mscherer/svn.icu/trunk/bld/data'
|
|
- copy the big-endian Unicode data files to another location,
|
|
separate from the other data files
|
|
mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/icudt49b/coll
|
|
mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/icudt49b/brkitr
|
|
~/svn.icu/trunk/bld/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt49b/*.icu /tmp/icu4j/com/ibm/icu/impl/data/icudt49b
|
|
~/svn.icu/trunk/bld/data/out/icu4j$ rm /tmp/icu4j/com/ibm/icu/impl/data/icudt49b/cnvalias.icu
|
|
~/svn.icu/trunk/bld/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt49b/*.nrm /tmp/icu4j/com/ibm/icu/impl/data/icudt49b
|
|
~/svn.icu/trunk/bld/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt49b/coll/*.icu /tmp/icu4j/com/ibm/icu/impl/data/icudt49b/coll
|
|
~/svn.icu/trunk/bld/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt49b/brkitr/* /tmp/icu4j/com/ibm/icu/impl/data/icudt49b/brkitr
|
|
- refresh ICU4J
|
|
~/svn.icu/trunk/bld/data/out/icu4j$ jar uf ~/svn.icu4j/trunk/src/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/icudt49b
|
|
|
|
* refresh Java test .txt files
|
|
- copy new .txt files into ICU4J's main/tests/core/src/com/ibm/icu/dev/data/unicode
|
|
|
|
* test ICU so far, fix test code where necessary
|
|
- temporarily ignore collation issues that look like UCA/UCD mismatches,
|
|
until UCA data is updated
|
|
|
|
* UCA
|
|
|
|
- get output from Mark's tools; look in
|
|
http://www.unicode.org/Public/UCA/6.1.0/CollationAuxiliary-<dev. version>.txt
|
|
- update source/data/unidata/FractionalUCA.txt with FractionalUCA_SHORT.txt
|
|
- update source/data/unidata/UCARules.txt with UCA_Rules_SHORT.txt
|
|
(note removing the underscore before "Rules")
|
|
- update (ICU)/source/test/testdata/CollationTest_*.txt
|
|
and (ICU4J)/main/tests/collate/src/com/ibm/icu/dev/data/CollationTest_*.txt
|
|
with output from Mark's Unicode tools (..._CLDR_..._SHORT.txt)
|
|
- check test file diffs for previously commented-out, known-failing data lines;
|
|
probably need to keep those commented out
|
|
- check FractionalUCA.txt for manual changes of lead bytes from IMPLICIT to Hani
|
|
- run makeuca.sh:
|
|
~/svn.icu/tools/trunk/src/unicode$ ./makeuca.sh ~/svn.icu/trunk/src ~/svn.icu/trunk/bld
|
|
- rebuild ICU4C
|
|
- refresh ICU4J collation data:
|
|
(subset of instructions above for properties data refresh, except copies all coll/*)
|
|
~/svn.icu/trunk/bld$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install
|
|
~/svn.icu/trunk/bld$ mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/icudt49b/coll
|
|
~/svn.icu/trunk/bld/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt49b/coll/* /tmp/icu4j/com/ibm/icu/impl/data/icudt49b/coll
|
|
~/svn.icu/trunk/bld/data/out/icu4j$ jar uf ~/svn.icu4j/trunk/src/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/icudt49b
|
|
- run all tests with the *_SHORT.txt or the full files (the full ones have comments, useful for debugging)
|
|
- note on intltest: if collate/UCAConformanceTest fails, then
|
|
utility/MultithreadTest/TestCollators will fail as well;
|
|
fix the conformance test before looking into the multi-thread test
|
|
|
|
* When refreshing all of ICU4J data from ICU4C
|
|
- ~/svn.icu/trunk/bld$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install
|
|
- cp /tmp/icu4j/main/shared/data/icudata.jar ~/svn.icu4j/trunk/src/main/shared/data
|
|
or
|
|
- ~/svn.icu/trunk/bld$ make ICU4J_ROOT=~/svn.icu4j/trunk/src icu4j-data-install
|
|
|
|
*** LayoutEngine script information
|
|
|
|
(For details see the Unicode 5.2 change log below.)
|
|
|
|
* Run icu4j-tools: com.ibm.icu.dev.tool.layout.ScriptNameBuilder.
|
|
This generates LEScripts.h, LELanguages.h, ScriptAndLanguageTags.h and ScriptAndLanguageTags.cpp
|
|
in the working directory.
|
|
(It also generates ScriptRunData.cpp, which is no longer needed.)
|
|
|
|
The generated files have a current copyright date and "@draft" statement.
|
|
|
|
- diff current <icu>/source/layout files vs. generated ones
|
|
~/svn.icu4j/trunk/src$ kdiff3 ~/svn.icu/trunk/src/source/layout tools/misc/src/com/ibm/icu/dev/tool/layout
|
|
review and manually merge desired changes;
|
|
fix gratuitous changes, incorrect @draft and missing aliases;
|
|
Unicode-derived script codes should be "born stable" like constants in uchar.h, uscript.h etc.
|
|
- if you just copy the above files, then
|
|
fix mixed line endings, review the diffs as above and restore changes to API tags etc.;
|
|
manually re-add the "Indic script xyz v.2" tags in ScriptAndLanguageTags.h
|
|
|
|
*** merge the Unicode update branches back onto the trunk
|
|
- do not merge the icudata.jar and testdata.jar,
|
|
instead rebuild them from merged & tested ICU4C
|
|
|
|
---------------------------------------------------------------------------- ***
|
|
|
|
ICU 4.8 (no Unicode update, just new script codes)
|
|
|
|
* 9 new script codes from ISO 15924 http://www.unicode.org/iso15924/codechanges.html
|
|
(added 2010-12-21)
|
|
Afak 439 Afaka
|
|
Jurc 510 Jurchen
|
|
Mroo 199 Mro, Mru
|
|
Nshu 499 Nüshu
|
|
Shrd 319 Sharada, Śāradā
|
|
Sora 398 Sora Sompeng
|
|
Takr 321 Takri, Ṭākrī, Ṭāṅkrī
|
|
Tang 520 Tangut
|
|
Wole 480 Woleai
|
|
-> uscript.h
|
|
-> com.ibm.icu.lang.UScript
|
|
find USCRIPT_([^ ]+) *= ([0-9]+),(.+)
|
|
replace public static final int \1 = \2;\3
|
|
-> genpname/SyntheticPropertyValueAliases.txt
|
|
-> add to expectedLong and expectedShort names in cintltst/cucdapi.c/TestUScriptCodeAPI()
|
|
and in com.ibm.icu.dev.test.lang.TestUScript.java
|
|
|
|
* run genpname/preparse.pl (on Linux)
|
|
+ cd ~/svn.icu/tools/trunk/src/unicode/c/genpname
|
|
+ make sure that data.h is writable
|
|
+ perl preparse.pl ~/svn.icu/trunk/src > out.txt
|
|
+ preparse.pl shows no errors, out.txt Info and Warning lines look ok
|
|
|
|
* rebuild Unicode tools (at least genpname) using make
|
|
- You might first need to "make install" ICU so that the tools build can pick
|
|
up the new definitions from the installed header files.
|
|
|
|
* run genpname
|
|
(builds both pnames.icu and propname_data.h)
|
|
- ~/svn.icu/tools/trunk/bld/unicode/c$ genpname/genpname -v -d ~/svn.icu/trunk/src/source/data/in
|
|
- ~/svn.icu/tools/trunk/bld/unicode/c$ genpname/genpname -v -d ~/svn.icu/trunk/src/source/common --csource
|
|
- rebuild ICU & tools
|
|
|
|
* run genprops
|
|
- ~/svn.icu/tools/trunk/bld/unicode/c$ genprops/genprops -d ~/svn.icu/trunk/src/source/data/in -s ~/svn.icu/trunk/src/source/data/unidata -i ~/svn.icu/trunk/dbg/data/out/build/icudt48l -u 6.0
|
|
- ~/svn.icu/tools/trunk/bld/unicode/c$ genprops/genprops -d ~/svn.icu/trunk/src/source/common --csource -s ~/svn.icu/trunk/src/source/data/unidata -i ~/svn.icu/trunk/dbg/data/out/build/icudt48l -u 6.0
|
|
- rebuild ICU & tools
|
|
|
|
* update Java data files
|
|
- refresh just the UCD-related files, just to be safe
|
|
- see (ICU4C)/source/data/icu4j-readme.txt
|
|
- mkdir /tmp/icu4j
|
|
- ~/svn.icu/trunk/dbg$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install
|
|
- copy the big-endian Unicode data files to another location,
|
|
separate from the other data files
|
|
mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/icudt48b
|
|
~/svn.icu/trunk/dbg/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt48b/pnames.icu /tmp/icu4j/com/ibm/icu/impl/data/icudt48b
|
|
~/svn.icu/trunk/dbg/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt48b/uprops.icu /tmp/icu4j/com/ibm/icu/impl/data/icudt48b
|
|
- refresh ICU4J
|
|
~/svn.icu/trunk/dbg/data/out/icu4j$ jar uf ~/svn.icu4j/trunk/src/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/icudt48b
|
|
|
|
* should have updated the layout engine script codes but forgot
|
|
|
|
---------------------------------------------------------------------------- ***
|
|
|
|
Unicode 6.0 update
|
|
|
|
*** related ICU Trac tickets
|
|
|
|
7264 Unicode 6.0 Update
|
|
|
|
*** Unicode version numbers
|
|
- makedata.mak
|
|
- uchar.h
|
|
(configure.in & configure: have been modified to extract the version from uchar.h)
|
|
- com.ibm.icu.util.VersionInfo
|
|
|
|
*** data files & enums & parser code
|
|
|
|
* file preparation
|
|
|
|
~/svn.icu/tools/trunk/src/unicode/c/genprops/misc$ ./ucdcopy.py ~/uni60/20100720/ucd ~/uni60/processed
|
|
- This now prepares both unidata and testdata files in respective output subfolders.
|
|
|
|
* PropertyAliases.txt changes
|
|
- new Script_Extensions property defined in the new ScriptExtensions.txt file
|
|
but not listed in PropertyAliases.txt; reported to unicode.org;
|
|
-> added to tools/trunk/src/unicode/c/genpname/SyntheticPropertyAliases.txt
|
|
scx; Script_Extensions
|
|
-> uchar.h with new UProperty section
|
|
-> com.ibm.icu.lang.UProperty, parallel with uchar.h
|
|
|
|
* PropertyValueAliases.txt changes
|
|
- 12 new block names:
|
|
Alchemical_Symbols
|
|
Bamum_Supplement
|
|
Batak
|
|
Brahmi
|
|
CJK_Unified_Ideographs_Extension_D
|
|
Emoticons
|
|
Ethiopic_Extended_A
|
|
Kana_Supplement
|
|
Mandaic
|
|
Miscellaneous_Symbols_And_Pictographs
|
|
Playing_Cards
|
|
Transport_And_Map_Symbols
|
|
-> add to uchar.h
|
|
-> add to UCharacter.UnicodeBlock
|
|
Eclipse find UBLOCK_([^ ]+) = [0-9]+, (/.+)
|
|
replace public static final UnicodeBlock \1 = new UnicodeBlock("\1", \1_ID); \2
|
|
- Joining_Group (jg) values:
|
|
Teh_Marbuta_Goal becomes the new canonical value for the old Hamza_On_Heh_Goal which becomes an alias
|
|
-> uchar.h & UCharacter.JoiningGroup
|
|
- 3 new scripts:
|
|
sc ; Batk ; Batak
|
|
sc ; Brah ; Brahmi
|
|
sc ; Mand ; Mandaic
|
|
-> remove these from SyntheticPropertyValueAliases.txt
|
|
-> add alias USCRIPT_MANDAIC to USCRIPT_MANDAEAN
|
|
-> fix expectedLong names in cucdapi.c/TestUScriptCodeAPI()
|
|
and in com.ibm.icu.dev.test.lang.TestUScript.java
|
|
- 13 new script codes from ISO 15924 http://www.unicode.org/iso15924/codechanges.html
|
|
(added 2009-11-11..2010-07-18)
|
|
Bass 259 Bassa Vah
|
|
Dupl 755 Duployan shortand
|
|
Elba 226 Elbasan
|
|
Gran 343 Grantha
|
|
Kpel 436 Kpelle
|
|
Loma 437 Loma
|
|
Mend 438 Mende
|
|
Merc 101 Meroitic Cursive
|
|
Narb 106 Old North Arabian
|
|
Nbat 159 Nabataean
|
|
Palm 126 Palmyrene
|
|
Sind 318 Sindhi
|
|
Wara 262 Warang Citi
|
|
-> uscript.h
|
|
-> com.ibm.icu.lang.UScript
|
|
find USCRIPT_([^ ]+) *= ([0-9]+),(.+)
|
|
replace public static final int \1 = \2;\3
|
|
-> SyntheticPropertyValueAliases.txt
|
|
-> add to expectedLong and expectedShort names in cintltst/cucdapi.c/TestUScriptCodeAPI()
|
|
and in com.ibm.icu.dev.test.lang.TestUScript.java
|
|
- ISO 15924 name change
|
|
Mero 100 Meroitic Hieroglyphs (was Meroitic)
|
|
-> add new alias USCRIPT_MEROITIC_HIEROGLYPHS to USCRIPT_MEROITIC
|
|
- property value alias added for Cham, was already moved out of SyntheticPropertyValueAliases.txt
|
|
|
|
* UnicodeData.txt changes
|
|
- new CJK block:
|
|
2B740;<CJK Ideograph Extension D, First>;Lo;0;L;;;;;N;;;;;
|
|
2B81D;<CJK Ideograph Extension D, Last>;Lo;0;L;;;;;N;;;;;
|
|
-> add to tools/trunk/src/unicode/c/gennames/gennames.c, with new ucdVersion
|
|
|
|
* build Unicode tools using CMake+make
|
|
|
|
* run genpname/preparse.pl (on Linux)
|
|
+ cd ~/svn.icu/tools/trunk/src/unicode/c/genpname
|
|
+ make sure that data.h is writable
|
|
+ perl preparse.pl ~/svn.icu/trunk/src > out.txt
|
|
+ preparse.pl shows no errors, out.txt Info and Warning lines look ok
|
|
|
|
* rebuild Unicode tools (at least genpname) using make
|
|
- You might first need to "make install" ICU so that the tools build can pick
|
|
up the new definitions from the installed header files.
|
|
|
|
* run genpname
|
|
- ~/svn.icu/tools/trunk/bld/unicode$ c/genpname/genpname -v -d ~/svn.icu/trunk/src/source/data/in
|
|
- rebuild ICU & tools
|
|
|
|
* update source/data/unidata/norm2/nfkc_cf.txt
|
|
- follow the instructions in nfkc_cf.txt for updating it from DerivedNormalizationProps.txt
|
|
|
|
* update source/data/unidata/norm2/uts46.txt
|
|
- download http://www.unicode.org/Public/idna/6.0.0/IdnaMappingTable.txt
|
|
to ~/svn.icu/tools/trunk/src/unicode/py
|
|
- adjust idna2nrm.py to handle new disallowed_STD3_valid and disallowed_STD3_mapped values
|
|
- ~/svn.icu/tools/trunk/src/unicode/py$ ./idna2nrm.py
|
|
- ~/svn.icu/tools/trunk/src/unicode/py$ cp uts46.txt ~/svn.icu/trunk/src/source/data/unidata/norm2
|
|
|
|
* update uts46test.cpp and UTS46Test.java if there are new characters that are equivalent to
|
|
sequences with non-LDH ASCII (that is, their decompositions contain '=' or similar)
|
|
- grep IdnaMappingTable.txt or uts46.txt for "disallowed_STD3_valid" on non-ASCII characters
|
|
- Unicode 6.0: U+2260, U+226E, U+226F
|
|
|
|
* generate core properties data files
|
|
- ~/svn.icu/tools/trunk/src/unicode$ ./makeprops.sh ~/svn.icu/trunk/src ~/svn.icu/trunk/bld
|
|
- rebuild ICU & tools
|
|
- run makeuca.sh so that genuca picks up the new nfc.nrm:
|
|
~/svn.icu/tools/trunk/src/unicode$ ./makeuca.sh ~/svn.icu/trunk/src ~/svn.icu/trunk/bld
|
|
- rebuild ICU & tools
|
|
|
|
* implement new Script_Extensions property (provisional)
|
|
- parser & generator: genprops & uprops.icu
|
|
- uscript.h, uprops.h, uchar.c, uniset_props.cpp and others, plus cintltst/cucdapi.c & intltest/usettest.cpp
|
|
- UScript.java, UCharacterProperty.java, UnicodeSet.java, TestUScript.java, UnicodeSetTest.java
|
|
|
|
* switch ubidi.icu, ucase.icu and uprops.icu from UTrie to UTrie2
|
|
- (one-time change)
|
|
- genbidi/gencase/genprops tools changes
|
|
- re-run makeprops.sh (see above)
|
|
- UCharacterProperty.java, UCharacterTypeIterator.java,
|
|
UBiDiProps.java, UCaseProps.java, and several others with minor changes;
|
|
UCharacterPropertyReader.java deleted and its code folded into UCharacterProperty.java
|
|
|
|
* update Java data files
|
|
- refresh just the UCD-related files, just to be safe
|
|
- see (ICU4C)/source/data/icu4j-readme.txt
|
|
- mkdir /tmp/icu4j
|
|
- ~/svn.icu/trunk/bld$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install
|
|
output:
|
|
...
|
|
Unicode .icu files built to ./out/build/icudt45l
|
|
mkdir -p ./out/icu4j/com/ibm/icu/impl/data/icudt45b
|
|
echo ubidi.icu ucase.icu uprops.icu > ./out/icu4j/add.txt
|
|
LD_LIBRARY_PATH=../lib:../stubdata:../tools/ctestfw:$LD_LIBRARY_PATH ../bin/icupkg ./out/tmp/icudt45l.dat ./out/icu4j/icudt45b.dat -a ./out/icu4j/add.txt -s ./out/build/icudt45l -x '*' -tb -d ./out/icu4j/com/ibm/icu/impl/data/icudt45b
|
|
jar cf ./out/icu4j/icudata.jar -C ./out/icu4j com/ibm/icu/impl/data/icudt45b
|
|
mkdir -p /tmp/icu4j/main/shared/data
|
|
cp ./out/icu4j/icudata.jar /tmp/icu4j/main/shared/data
|
|
- copy the big-endian Unicode data files to another location,
|
|
separate from the other data files
|
|
mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/icudt45b/coll
|
|
mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/icudt45b/brkitr
|
|
~/svn.icu/trunk/bld/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt45b/*.icu /tmp/icu4j/com/ibm/icu/impl/data/icudt45b
|
|
~/svn.icu/trunk/bld/data/out/icu4j$ rm /tmp/icu4j/com/ibm/icu/impl/data/icudt45b/cnvalias.icu
|
|
~/svn.icu/trunk/bld/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt45b/*.nrm /tmp/icu4j/com/ibm/icu/impl/data/icudt45b
|
|
~/svn.icu/trunk/bld/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt45b/coll/*.icu /tmp/icu4j/com/ibm/icu/impl/data/icudt45b/coll
|
|
~/svn.icu/trunk/bld/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt45b/brkitr/* /tmp/icu4j/com/ibm/icu/impl/data/icudt45b/brkitr
|
|
- refresh ICU4J
|
|
~/svn.icu/trunk/bld/data/out/icu4j$ jar uf ~/svn.icu4j/trunk/src/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/icudt45b
|
|
|
|
* refresh Java test .txt files
|
|
- copy new .txt files into ICU4J's main/tests/core/src/com/ibm/icu/dev/data/unicode
|
|
|
|
* un-hardcode normalization skippable (NF*_Inert) test data
|
|
- removes one manual step from the Unicode upgrade, and removes dependency on one of Mark's tools
|
|
|
|
* copy updated break iterator test files
|
|
- now handled by early ucdcopy.py and
|
|
copying the uni60/processed/testdata files to ~/svn.icu/trunk/src/source/test/testdata
|
|
(old instructions:
|
|
copy from (Unicode 6.0)/ucd/auxiliary/*BreakTest-6....txt
|
|
to ~/svn.icu/trunk/src/source/test/testdata)
|
|
- they are not used in ICU4J
|
|
|
|
* UCA
|
|
|
|
- get output from Mark's tools; look in
|
|
http://www.unicode.org/~book/incoming/mark/uca6.0.0/
|
|
http://www.macchiato.com/unicode/utc/additional-uca-files
|
|
http://www.unicode.org/Public/UCA/6.0.0/
|
|
http://www.unicode.org/~mdavis/uca/
|
|
- update source/data/unidata/FractionalUCA.txt with FractionalUCA_SHORT.txt
|
|
- update source/data/unidata/UCARules.txt with UCA_Rules_SHORT.txt
|
|
- update Han-implicit ranges for new CJK extensions:
|
|
swapCJK() in ucol.cpp & ImplicitCEGenerator.java
|
|
- genuca: allow bytes 02 for U+FFFE, new merge-sort character;
|
|
do not add it into invuca so that tailoring primary-after an ignorable works
|
|
- genuca: permit space between [variable top] bytes
|
|
- ucol.cpp: treat noncharacters like unassigned rather than ignorable
|
|
- run makeuca.sh:
|
|
~/svn.icu/tools/trunk/src/unicode$ ./makeuca.sh ~/svn.icu/trunk/src ~/svn.icu/trunk/bld
|
|
- rebuild ICU4C
|
|
- refresh ICU4J collation data:
|
|
(subset of instructions above for properties data refresh, except copies all coll/*)
|
|
~/svn.icu/trunk/bld$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install
|
|
mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/icudt45b/coll
|
|
~/svn.icu/trunk/bld/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt45b/coll/* /tmp/icu4j/com/ibm/icu/impl/data/icudt45b/coll
|
|
~/svn.icu/trunk/bld/data/out/icu4j$ jar uf ~/svn.icu4j/trunk/src/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/icudt45b
|
|
- update (ICU)/source/test/testdata/CollationTest_*.txt
|
|
and (ICU4J)/main/tests/collate/src/com/ibm/icu/dev/data/CollationTest_*.txt
|
|
with output from Mark's Unicode tools
|
|
- run all tests with the *_SHORT.txt or the full files (the full ones have comments)
|
|
- note on intltest: if collate/UCAConformanceTest fails, then
|
|
utility/MultithreadTest/TestCollators will fail as well;
|
|
fix the conformance test before looking into the multi-thread test
|
|
|
|
* When refreshing all of ICU4J data from ICU4C
|
|
- ~/svn.icu/trunk/bld$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install
|
|
- cp /tmp/icu4j/main/shared/data/icudata.jar ~/svn.icu4j/trunk/src/main/shared/data
|
|
or
|
|
- ~/svn.icu/trunk/bld$ make ICU4J_ROOT=~/svn.icu4j/trunk/src icu4j-data-install
|
|
|
|
*** LayoutEngine script information
|
|
|
|
(For details see the Unicode 5.2 change log below.)
|
|
|
|
* Run ICU4J com.ibm.icu.dev.tool.layout.ScriptNameBuilder. This generates LEScripts.h, LELanguages.h,
|
|
ScriptAndLanguageTags.h and ScriptAndLanguageTags.cpp in the working directory. (It also generates
|
|
ScriptRunData.cpp, which is no longer needed.)
|
|
|
|
The generated files have a current copyright date and "@draft" statement.
|
|
|
|
* copy the above files into <icu>/source/layout, replacing the old files.
|
|
* fix mixed line endings
|
|
* review the diffs and fix incorrect @draft and missing aliases;
|
|
Unicode-derived script codes should be "born stable" like constants in uchar.h, uscript.h etc.
|
|
* manually re-add the "Indic script xyz v.2" tags in ScriptAndLanguageTags.h
|
|
|
|
---------------------------------------------------------------------------- ***
|
|
|
|
Unicode 5.2 update
|
|
|
|
*** related ICU Trac tickets
|
|
|
|
7084 Unicode 5.2
|
|
|
|
7167 verify collation bytes
|
|
7235 Java test NAME_ALIAS
|
|
7236 Java DerivedCoreProperties.txt test
|
|
7237 Java BidiTest.txt
|
|
7238 UTrie2 in core unidata
|
|
7239 test for tailoring gaps
|
|
7240 Java fix CollationMiscTest
|
|
7243 update layout engine for Unicode 5.2
|
|
|
|
*** Unicode version numbers
|
|
- makedata.mak
|
|
- uchar.h
|
|
- configure.in & configure
|
|
- update ucdVersion in gennames.c if an algorithmic range changes
|
|
|
|
*** data files & enums & parser code
|
|
|
|
* file preparation
|
|
|
|
python source\tools\genprops\misc\ucdcopy.py "C:\Documents and Settings\mscherer\My Documents\unicode\ucd\5.2.0" C:\svn\icuproj\icu\trunk\source\data\unidata
|
|
- includes finding files regardless of version numbers,
|
|
copying them, and performing the equivalent processing of the
|
|
ucdstrip and ucdmerge tools on the desired set of files
|
|
|
|
* notes on changes
|
|
- PropertyAliases.txt
|
|
moved from numeric to enumerated:
|
|
ccc ; Canonical_Combining_Class
|
|
new string properties:
|
|
NFKC_CF ; NFKC_Casefold
|
|
Name_Alias; Name_Alias
|
|
new binary properties:
|
|
Cased ; Cased
|
|
CI ; Case_Ignorable
|
|
CWCF ; Changes_When_Casefolded
|
|
CWCM ; Changes_When_Casemapped
|
|
CWKCF ; Changes_When_NFKC_Casefolded
|
|
CWL ; Changes_When_Lowercased
|
|
CWT ; Changes_When_Titlecased
|
|
CWU ; Changes_When_Uppercased
|
|
new CJK Unihan properties (not supported by ICU)
|
|
- PropertyValueAliases.txt
|
|
new block names
|
|
new scripts
|
|
one script code change:
|
|
sc ; Qaai ; Inherited
|
|
->
|
|
sc ; Zinh ; Inherited ; Qaai
|
|
new Line_Break (lb) value:
|
|
lb ; CP ; Close_Parenthesis
|
|
new Joining_Group (jg) values: Farsi_Yeh, Nya
|
|
other new values:
|
|
ccc; 214; ATA ; Attached_Above
|
|
- DerivedBidiClass.txt
|
|
new default-R range: U+1E800 - U+1EFFF
|
|
- UnicodeData.txt
|
|
all of the ISO comments are gone
|
|
new CJK block end:
|
|
9FC3;<CJK Ideograph, Last> -> 9FCB;<CJK Ideograph, Last>
|
|
new CJK block:
|
|
2A700;<CJK Ideograph Extension C, First>;Lo;0;L;;;;;N;;;;;
|
|
2B734;<CJK Ideograph Extension C, Last>;Lo;0;L;;;;;N;;;;;
|
|
|
|
* genpname
|
|
- run preparse.pl
|
|
+ cd \svn\icuproj\icu\trunk\source\tools\genpname
|
|
+ make sure that data.h is writable
|
|
+ perl preparse.pl \svn\icuproj\icu\trunk > out.txt
|
|
+ preparse.pl complains with errors like the following:
|
|
Error: sc:Egyp already set to Egyptian_Hieroglyphs, cannot set to Egyp at preparse.pl line 1322, <GEN6> line 34.
|
|
This is because ICU 4.0 had scripts from ISO 15924 which are now
|
|
added to Unicode 5.2, and the Perl script shows a conflict between SyntheticPropertyValueAliases.txt
|
|
and PropertyValueAliases.txt.
|
|
-> Removed duplicate script entries from SyntheticPropertyValueAliases.txt:
|
|
Egyp, Java, Lana, Mtei, Orkh, Armi, Avst, Kthi, Phli, Prti, Samr, Tavt
|
|
+ preparse.pl complains with errors about block names missing from uchar.h; add them
|
|
|
|
* uchar.h & uscript.h & uprops.h & uprops.c & genprops
|
|
- new block & script values
|
|
+ 26 new blocks
|
|
copy new blocks from Blocks.txt
|
|
MS VC++ 2008 regular expression:
|
|
find "^{[0-9A-F]+}\.\.{[0-9A-F]+}; {[A-Z].+}$"
|
|
replace with " UBLOCK_\3 = 172, /*[\1]*/"
|
|
+ several new script values already added in ICU 4.0 for ISO 15924 coverage
|
|
(removed from SyntheticPropertyValueAliases.txt, see genpname notes above)
|
|
+ 3 new script values added for ISO 15924 and Unicode 5.2 coverage
|
|
+ 1 new script value added for ISO 15924 coverage (not in Unicode 5.2)
|
|
(added to SyntheticPropertyValueAliases.txt)
|
|
- new Joining Group (JG) values: Farsi_Yeh, Nya
|
|
- new Line_Break (lb) value:
|
|
lb ; CP ; Close_Parenthesis
|
|
|
|
* hardcoded Unihan range end/limit
|
|
- Unihan range end moves from 9FC3 to 9FCB
|
|
search for both 9FC3 (end) and 9FC4 (limit) (regex 9FC[34], case-insensitive)
|
|
+ do change gennames.c
|
|
|
|
* Compare definitions of new binary properties with what we used to use
|
|
in algorithms, to see if the definitions changed.
|
|
- Verified that definitions for Cased and Case_Ignorable are unchanged.
|
|
The gencase tool now parses the newly public Case_Ignorable values
|
|
in case the definition changes in the future.
|
|
|
|
* uchar.c & uprops.h & uprops.c & genprops
|
|
- new numeric values that didn't exist in Unicode data before:
|
|
1/7, 1/9, 1/10, 3/10, 1/16, 3/16
|
|
the ones with denominators >9 cannot be supported by uprops.icu formatVersion 5,
|
|
therefore redesign the encoding of numeric types and values for formatVersion 6;
|
|
design for simple numbers up to at least 144 ("one gross"),
|
|
large values up to at least 10^20,
|
|
and fractions with numerators -1..17 and denominators 1..16
|
|
to cover current and expected future values
|
|
(e.g., more Han numeric values, Meroitic twelfths)
|
|
|
|
* reimplement Hangul_Syllable_Type for new Jamo characters
|
|
- the old code assumed that all Jamo characters are in the 11xx block
|
|
- Unicode 5.2 fills holes there and adds new Jamo characters in
|
|
A960..A97F; Hangul Jamo Extended-A
|
|
and in
|
|
D7B0..D7FF; Hangul Jamo Extended-B
|
|
- Hangul_Syllable_Type can be trivially derived from a subset of
|
|
Grapheme_Cluster_Break values
|
|
|
|
* build Unicode data source code for hardcoding core data
|
|
C:\svn\icuproj\icu\trunk\source\data>NMAKE /f makedata.mak ICUMAKE=\svn\icuproj\icu\trunk\source\data\ CFG=x86\release uni-core-data
|
|
|
|
ICU data make path is \svn\icuproj\icu\trunk\source\data\
|
|
ICU root path is \svn\icuproj\icu\trunk
|
|
Information: cannot find "ucmlocal.mk". Not building user-additional converter files.
|
|
Information: cannot find "brklocal.mk". Not building user-additional break iterator files.
|
|
Information: cannot find "reslocal.mk". Not building user-additional resource bundle files.
|
|
Information: cannot find "collocal.mk". Not building user-additional resource bundle files.
|
|
Information: cannot find "rbnflocal.mk". Not building user-additional resource bundle files.
|
|
Information: cannot find "trnslocal.mk". Not building user-additional transliterator files.
|
|
Information: cannot find "misclocal.mk". Not building user-additional miscellaenous files.
|
|
Information: cannot find "spreplocal.mk". Not building user-additional stringprep files.
|
|
Creating data file for Unicode Property Names
|
|
Creating data file for Unicode Character Properties
|
|
Creating data file for Unicode Case Mapping Properties
|
|
Creating data file for Unicode BiDi/Shaping Properties
|
|
Creating data file for Unicode Normalization
|
|
Unicode .icu files built to "\svn\icuproj\icu\trunk\source\data\out\build\icudt43l"
|
|
Unicode .c source files built to "\svn\icuproj\icu\trunk\source\data\out\tmp"
|
|
|
|
- copy the .c source files to C:\svn\icuproj\icu\trunk\source\common
|
|
and rebuild the common library
|
|
|
|
*** UCA
|
|
|
|
- update FractionalUCA.txt with new canonical closure (output from Mark's Unicode tools)
|
|
- update source/data/unidata/UCARules.txt with UCA_Rules_SHORT.txt from Mark's Unicode tools
|
|
- update source/test/testdata/CollationTest_*.txt with output from Mark's Unicode tools
|
|
[ Begin obsolete instructions:
|
|
Starting with UCA 5.2, we use the CollationTest_*_SHORT.txt files not the *_STUB.txt files.
|
|
- generate the source/test/testdata/CollationTest_*_STUB.txt files via source/tools/genuca/genteststub.py
|
|
on Windows:
|
|
python C:\svn\icuproj\icu\trunk\source\tools\genuca\genteststub.py CollationTest_NON_IGNORABLE_SHORT.txt CollationTest_NON_IGNORABLE_STUB.txt
|
|
python C:\svn\icuproj\icu\trunk\source\tools\genuca\genteststub.py CollationTest_SHIFTED_SHORT.txt CollationTest_SHIFTED_STUB.txt
|
|
End obsolete instructions]
|
|
- run all tests with the *_SHORT.txt or the full files (the full ones have comments)
|
|
not just the *_STUB.txt files
|
|
- note on intltest: if collate/UCAConformanceTest fails, then
|
|
utility/MultithreadTest/TestCollators will fail as well;
|
|
fix the conformance test before looking into the multi-thread test
|
|
|
|
*** Implement Cased & Case_Ignorable properties
|
|
- via UProperty; call ucase.h functions ucase_getType() and ucase_getTypeOrIgnorable()
|
|
- Problem: These properties should be disjoint, but aren't
|
|
- UTC 2009nov decision: skip all Case_Ignorable regardless of whether they are Cased or not
|
|
- change ucase.icu to be able to store any combination of Cased and Case_Ignorable
|
|
|
|
*** Implement Changes_When_Xyz properties
|
|
- without stored data
|
|
|
|
*** Implement Name_Alias property
|
|
- add it as another name field in unames.icu
|
|
- make it available via u_charName() and UCharNameChoice and
|
|
- consider it in u_charFromName()
|
|
|
|
*** Break iterators
|
|
|
|
* Update break iterator rules to new UAX versions and new property values
|
|
* Update source/test/testdata/<boundary>Test.txt files from <unicode.org ucd>/ucd/auxiliary
|
|
|
|
*** new BidiTest file
|
|
- review format and data
|
|
- copy BidiTest.txt to source/test/testdata
|
|
- write test code using this data
|
|
- fix ICU code where it fails the conformance test
|
|
|
|
*** Java
|
|
- generally, find and update code corresponding to C/C++
|
|
- UCharacter.UnicodeBlock constants:
|
|
a) add an _ID integer per new block, update COUNT
|
|
b) add a class instance per new block
|
|
Visual Studio regex:
|
|
find UBLOCK_{[^ ]+} = [0-9]+, {/.+}
|
|
replace with public static final UnicodeBlock \1 = new UnicodeBlock("\1", \1_ID); \2
|
|
- CHAR_NAME_ALIAS -> UCharacter.getNameAlias() and getCharFromNameAlias()
|
|
|
|
- port test changes to Java
|
|
|
|
*** LayoutEngine script information
|
|
|
|
(For comparison, see the Unicode 5.1 update: http://bugs.icu-project.org/trac/changeset/23833)
|
|
|
|
* Run ICU4J com.ibm.icu.dev.tool.layout.ScriptNameBuilder. This generates LEScripts.h, LELanguages.h,
|
|
ScriptAndLanguageTags.h and ScriptAndLanguageTags.cpp in the working directory. (It also generates
|
|
ScriptRunData.cpp, which is no longer needed.)
|
|
|
|
The generated files have a current copyright date and "@draft" statement.
|
|
|
|
-> Eric Mader wrote in email on 20090930:
|
|
"I think the tool has been modified to update @draft to @stable for
|
|
older scripts and to add @draft for new scripts.
|
|
(I worked with an intern on this last year.)
|
|
You should check the output after you run it."
|
|
|
|
* copy the above files into <icu>/source/layout, replacing the old files.
|
|
* fix mixed line endings
|
|
* review the diffs and fix incorrect @draft and missing aliases
|
|
* manually re-add the "Indic script xyz v.2" tags in ScriptAndLanguageTags.h
|
|
|
|
Add new default entries to the indicClassTables array in <icu>/source/layout/IndicClassTables.cpp
|
|
and the complexTable array in <icu>/source/layoutex/ParagraphLayout.cpp. (This step should be automated...)
|
|
|
|
-> Eric Mader wrote in email on 20090930:
|
|
"This is just a matter of making sure that all the per-script tables have
|
|
entries for any new scripts that were added.
|
|
If any new Indic characters were added, then the class tables in
|
|
IndicClassTables.cpp should be updated to reflect this.
|
|
John Emmons should know how to do this if it's required."
|
|
|
|
* rebuild the layout and layoutex libraries.
|
|
|
|
*** Documentation
|
|
- Update User Guide
|
|
+ Jamo_Short_Name, sfc->scf, binary property value aliases
|
|
|
|
---------------------------------------------------------------------------- ***
|
|
|
|
Unicode 5.1 update
|
|
|
|
*** related ICU Trac tickets
|
|
|
|
5696 Update to Unicode 5.1
|
|
|
|
*** Unicode version numbers
|
|
- makedata.mak
|
|
- uchar.h
|
|
- configure.in & configure
|
|
- update ucdVersion in gennames.c if an algorithmic range changes
|
|
|
|
*** data files & enums & parser code
|
|
|
|
* file preparation
|
|
- ucdstrip:
|
|
DerivedCoreProperties.txt
|
|
DerivedNormalizationProps.txt
|
|
NormalizationTest.txt
|
|
PropList.txt
|
|
Scripts.txt
|
|
GraphemeBreakProperty.txt
|
|
SentenceBreakProperty.txt
|
|
WordBreakProperty.txt
|
|
- ucdstrip and ucdmerge:
|
|
EastAsianWidth.txt
|
|
LineBreak.txt
|
|
|
|
* my ucd2unidata.bat (needs to be updated each time with UCD and file version numbers)
|
|
copy 5.1.0\ucd\BidiMirroring.txt ..\unidata\
|
|
copy 5.1.0\ucd\Blocks.txt ..\unidata\
|
|
copy 5.1.0\ucd\CaseFolding.txt ..\unidata\
|
|
copy 5.1.0\ucd\DerivedAge.txt ..\unidata\
|
|
copy 5.1.0\ucd\extracted\DerivedBidiClass.txt ..\unidata\
|
|
copy 5.1.0\ucd\extracted\DerivedJoiningGroup.txt ..\unidata\
|
|
copy 5.1.0\ucd\extracted\DerivedJoiningType.txt ..\unidata\
|
|
copy 5.1.0\ucd\extracted\DerivedNumericValues.txt ..\unidata\
|
|
copy 5.1.0\ucd\NormalizationCorrections.txt ..\unidata\
|
|
copy 5.1.0\ucd\PropertyAliases.txt ..\unidata\
|
|
copy 5.1.0\ucd\PropertyValueAliases.txt ..\unidata\
|
|
copy 5.1.0\ucd\SpecialCasing.txt ..\unidata\
|
|
copy 5.1.0\ucd\UnicodeData.txt ..\unidata\
|
|
|
|
ucdstrip < 5.1.0\ucd\DerivedCoreProperties.txt > ..\unidata\DerivedCoreProperties.txt
|
|
ucdstrip < 5.1.0\ucd\DerivedNormalizationProps.txt > ..\unidata\DerivedNormalizationProps.txt
|
|
ucdstrip < 5.1.0\ucd\NormalizationTest.txt > ..\unidata\NormalizationTest.txt
|
|
ucdstrip < 5.1.0\ucd\PropList.txt > ..\unidata\PropList.txt
|
|
ucdstrip < 5.1.0\ucd\Scripts.txt > ..\unidata\Scripts.txt
|
|
ucdstrip < 5.1.0\ucd\auxiliary\GraphemeBreakProperty.txt > ..\unidata\GraphemeBreakProperty.txt
|
|
ucdstrip < 5.1.0\ucd\auxiliary\SentenceBreakProperty.txt > ..\unidata\SentenceBreakProperty.txt
|
|
ucdstrip < 5.1.0\ucd\auxiliary\WordBreakProperty.txt > ..\unidata\WordBreakProperty.txt
|
|
ucdstrip < 5.1.0\ucd\EastAsianWidth.txt | ucdmerge > ..\unidata\EastAsianWidth.txt
|
|
ucdstrip < 5.1.0\ucd\LineBreak.txt | ucdmerge > ..\unidata\LineBreak.txt
|
|
|
|
* genpname
|
|
- run preparse.pl
|
|
+ cd \svn\icuproj\icu\uni51\source\tools\genpname
|
|
+ make sure that data.h is writable
|
|
+ perl preparse.pl \svn\icuproj\icu\uni51 > out.txt
|
|
+ preparse.pl complains with errors like the following:
|
|
Error: sc:Cari already set to Carian, cannot set to Cari at preparse.pl line 1308, <GEN6> line 30.
|
|
This is because ICU 3.8 had scripts from ISO 15924 which are now
|
|
added to Unicode 5.1, and the script shows a conflict between SyntheticPropertyValueAliases.txt
|
|
and PropertyValueAliases.txt.
|
|
-> Removed duplicate script entries from SyntheticPropertyValueAliases.txt:
|
|
Cari, Cham, Kali, Lepc, Lyci, Lydi, Olck, Rjng, Saur, Sund, Vaii
|
|
+ PropertyValueAliases.txt now explicitly contains values for boolean properties:
|
|
N/Y, No/Yes, F/T, False/True
|
|
-> Added N/No and Y/Yes to preparse.pl function read_PropertyValueAliases.
|
|
It will use further values from the file if present.
|
|
|
|
* uchar.h & uscript.h & uprops.h & uprops.c & genprops
|
|
- new block & script values
|
|
+ 17 new blocks
|
|
+ 11 new script values already added in ICU 3.8 for ISO 15924 coverage
|
|
(removed from SyntheticPropertyValueAliases.txt)
|
|
+ 14 new script values added for ISO 15924 coverage (not in Unicode 5.1)
|
|
(added to SyntheticPropertyValueAliases.txt)
|
|
- uprops.icu (uprops.h) only provides 7 bits for script codes.
|
|
In ICU 4.0 there are USCRIPT_CODE_LIMIT=130 script codes now.
|
|
There is none above 127 yet which is the script code for an
|
|
assigned Unicode character, so ICU 4.0 uprops.icu does not store any
|
|
script code values greater than 127.
|
|
However, it does need to store the maximum script value=USCRIPT_CODE_LIMIT-1=129
|
|
in a parallel bit field, and that overflows now.
|
|
Also, future values >=128 would be incompatible anyway.
|
|
uprops.h is modified to move around several of the bit fields
|
|
in the properties vector words, and now uses 8 bits for the script code.
|
|
Two other bit fields also grow to accommodate future growth:
|
|
Block (current count: 172) grows from 8 to 9 bits,
|
|
and Word_Break grows from 4 to 5 bits.
|
|
- renamed property Simple_Case_Folding (sfc->scf)
|
|
+ nothing to be done: handled as normal alias
|
|
- new property JSN Jamo_Short_Name
|
|
+ no new API: only contributes to the Name property
|
|
- new Grapheme_Cluster_Break (GCB) value: SM=SpacingMark
|
|
- new Joining Group (JG) value: Burushashki_Yeh_Barree
|
|
- new Sentence_Break (SB) values:
|
|
SB ; CR ; CR
|
|
SB ; EX ; Extend
|
|
SB ; LF ; LF
|
|
SB ; SC ; SContinue
|
|
- new Word_Break (WB) values:
|
|
WB ; CR ; CR
|
|
WB ; Extend ; Extend
|
|
WB ; LF ; LF
|
|
WB ; MB ; MidNumLet
|
|
|
|
* Further changes in the 2008-02-29 update:
|
|
- Default_Ignorable_Code_Point: The new file removes Cc, Cs, noncharacters from DICP
|
|
because they should not normally be invisible.
|
|
- new Joining Group (JG) value Burushashki_Yeh_Barree was renamed to Burushaski_Yeh_Barree (one 'h' removed)
|
|
- new Grapheme_Cluster_Break (GCB) value: PP=Prepend
|
|
- new Word_Break (WB) value: NL=Newline
|
|
|
|
* hardcoded Unihan range end/limit (see Unicode 4.1 update for comparison)
|
|
- Unihan range end moves from 9FBB to 9FC3
|
|
search for both 9FBB (end) and 9FBC (limit) (regex 9FB[BC], case-insensitive)
|
|
+ do change gennames.c
|
|
|
|
* build Unicode data source code for hardcoding core data
|
|
C:\svn\icuproj\icu\uni51\source\data>NMAKE /f makedata.mak ICUMAKE=\svn\icuproj\icu\uni51\source\data\ CFG=debug uni-core-data
|
|
|
|
ICU data make path is \svn\icuproj\icu\uni51\source\data\
|
|
ICU root path is \svn\icuproj\icu\uni51
|
|
Information: cannot find "ucmlocal.mk". Not building user-additional converter files.
|
|
Information: cannot find "brklocal.mk". Not building user-additional break iterator files.
|
|
Information: cannot find "reslocal.mk". Not building user-additional resource bundle files.
|
|
Information: cannot find "collocal.mk". Not building user-additional resource bundle files.
|
|
Information: cannot find "rbnflocal.mk". Not building user-additional resource bundle files.
|
|
Information: cannot find "trnslocal.mk". Not building user-additional transliterator files.
|
|
Information: cannot find "misclocal.mk". Not building user-additional miscellaenous files.
|
|
Creating data file for Unicode Character Properties
|
|
Creating data file for Unicode Case Mapping Properties
|
|
Creating data file for Unicode BiDi/Shaping Properties
|
|
Creating data file for Unicode Normalization
|
|
Unicode .icu files built to "\svn\icuproj\icu\uni51\source\data\out\build\icudt39l"
|
|
Unicode .c source files built to "\svn\icuproj\icu\uni51\source\data\out\tmp"
|
|
|
|
- copy the .c source files to C:\svn\icuproj\icu\uni51\source\common
|
|
and rebuild the common library
|
|
|
|
*** Break iterators
|
|
|
|
* Update break iterator rules to new UAX versions and new property values
|
|
|
|
*** UCA
|
|
|
|
* update FractionalUCA.txt and UCARules.txt with new canonical closure
|
|
|
|
*** Test suites
|
|
- Test that APIs using Unicode property value aliases (like UnicodeSet)
|
|
support all of the boolean values N/Y, No/Yes, F/T, False/True
|
|
-> TestBinaryValues() tests in both cintltst and intltest
|
|
|
|
*** LayoutEngine script information
|
|
* Run ICU4J com.ibm.icu.dev.tool.layout.ScriptNameBuilder. This generates LEScripts.h, LELanguage.h,
|
|
ScriptAndLanguageTags.h and ScriptAndLanguageTags.cpp in the working directory. (it also generates
|
|
ScriptRunData.cpp, which is no longer needed.)
|
|
|
|
The generated files have a current copyright date and "@draft" statement.
|
|
|
|
* copy the above files into <icu>/source/layout, replacing the old files.
|
|
|
|
Add new default entries to the indicClassTables array in <icu>/source/layout/IndicClassTables.cpp
|
|
and the complexTable array in <icu>/source/layoutex/ParagraphLayout.cpp. (This step should be automated...)
|
|
|
|
* rebuild the layout and layoutex libraries.
|
|
|
|
*** Documentation
|
|
- Update User Guide
|
|
+ Jamo_Short_Name, sfc->scf, binary property value aliases
|
|
|
|
---------------------------------------------------------------------------- ***
|
|
|
|
Unicode 5.0 update
|
|
|
|
*** related Jitterbugs
|
|
|
|
5084 RFE: Update to Unicode 5.0
|
|
|
|
*** data files & enums & parser code
|
|
|
|
* file preparation
|
|
- ucdstrip:
|
|
DerivedCoreProperties.txt
|
|
DerivedNormalizationProps.txt
|
|
NormalizationTest.txt
|
|
PropList.txt
|
|
Scripts.txt
|
|
GraphemeBreakProperty.txt
|
|
SentenceBreakProperty.txt
|
|
WordBreakProperty.txt
|
|
- ucdstrip and ucdmerge:
|
|
EastAsianWidth.txt
|
|
LineBreak.txt
|
|
|
|
* my ucd2unidata.bat (needs to be updated each time with UCD and file version numbers)
|
|
copy 5.0.0\ucd\BidiMirroring.txt ..\unidata\
|
|
copy 5.0.0\ucd\Blocks.txt ..\unidata\
|
|
copy 5.0.0\ucd\CaseFolding.txt ..\unidata\
|
|
copy 5.0.0\ucd\DerivedAge.txt ..\unidata\
|
|
copy 5.0.0\ucd\extracted\DerivedBidiClass.txt ..\unidata\
|
|
copy 5.0.0\ucd\extracted\DerivedJoiningGroup.txt ..\unidata\
|
|
copy 5.0.0\ucd\extracted\DerivedJoiningType.txt ..\unidata\
|
|
copy 5.0.0\ucd\extracted\DerivedNumericValues.txt ..\unidata\
|
|
copy 5.0.0\ucd\NormalizationCorrections.txt ..\unidata\
|
|
copy 5.0.0\ucd\PropertyAliases.txt ..\unidata\
|
|
copy 5.0.0\ucd\PropertyValueAliases.txt ..\unidata\
|
|
copy 5.0.0\ucd\SpecialCasing.txt ..\unidata\
|
|
copy 5.0.0\ucd\UnicodeData.txt ..\unidata\
|
|
|
|
ucdstrip < 5.0.0\ucd\DerivedCoreProperties.txt > ..\unidata\DerivedCoreProperties.txt
|
|
ucdstrip < 5.0.0\ucd\DerivedNormalizationProps.txt > ..\unidata\DerivedNormalizationProps.txt
|
|
ucdstrip < 5.0.0\ucd\NormalizationTest.txt > ..\unidata\NormalizationTest.txt
|
|
ucdstrip < 5.0.0\ucd\PropList.txt > ..\unidata\PropList.txt
|
|
ucdstrip < 5.0.0\ucd\Scripts.txt > ..\unidata\Scripts.txt
|
|
ucdstrip < 5.0.0\ucd\auxiliary\GraphemeBreakProperty.txt > ..\unidata\GraphemeBreakProperty.txt
|
|
ucdstrip < 5.0.0\ucd\auxiliary\SentenceBreakProperty.txt > ..\unidata\SentenceBreakProperty.txt
|
|
ucdstrip < 5.0.0\ucd\auxiliary\WordBreakProperty.txt > ..\unidata\WordBreakProperty.txt
|
|
ucdstrip < 5.0.0\ucd\EastAsianWidth.txt | ucdmerge > ..\unidata\EastAsianWidth.txt
|
|
ucdstrip < 5.0.0\ucd\LineBreak.txt | ucdmerge > ..\unidata\LineBreak.txt
|
|
|
|
* update FractionalUCA.txt and UCARules.txt with new canonical closure
|
|
|
|
* genpname
|
|
- run preparse.pl
|
|
+ make sure that data.h is writable
|
|
+ perl preparse.pl \cvs\oss\icu > out.txt
|
|
|
|
* uchar.h & uscript.h & uprops.h & uprops.c & genprops
|
|
- new block & script values
|
|
+ script values already added in ICU 3.6 because all of ISO 15924 is now covered
|
|
|
|
* build Unicode data source code for hardcoding core data
|
|
C:\cvs\oss\icu\source\data>NMAKE /f makedata.mak ICUMAKE=\cvs\oss\icu\source\data\ CFG=debug uni-core-data
|
|
|
|
ICU data make path is \cvs\oss\icu\source\data\
|
|
ICU root path is \cvs\oss\icu
|
|
Information: cannot find "ucmlocal.mk". Not building user-additional converter files.
|
|
[etc.]
|
|
Creating data file for Unicode Character Properties
|
|
Creating data file for Unicode Case Mapping Properties
|
|
Creating data file for Unicode BiDi/Shaping Properties
|
|
Creating data file for Unicode Normalization
|
|
Unicode .icu files built to "\cvs\oss\icu\source\data\out\build\icudt35l"
|
|
Unicode .c source files built to "\cvs\oss\icu\source\data\out\tmp"
|
|
|
|
- copy the .c source files to C:\cvs\oss\icu\source\common
|
|
and rebuild the common library
|
|
|
|
*** Unicode version numbers
|
|
- makedata.mak
|
|
- uchar.h
|
|
- configure.in
|
|
|
|
*** LayoutEngine script information
|
|
* Run ICU4J com.ibm.icu.dev.tool.layout.ScriptNameBuilder. This generates LEScripts.h, LELanguage.h,
|
|
ScriptAndLanguageTags.h and ScriptAndLanguageTags.cpp in the working directory. (it also generates
|
|
ScriptRunData.cpp, which is no longer needed.)
|
|
|
|
The generated files have a current copyright date and "@draft" statement.
|
|
|
|
* copy the above files into <icu>/source/layout, replacing the old files.
|
|
|
|
Add new default entries to the indicClassTables array in <icu>/source/layout/IndicClassTables.cpp
|
|
and the complexTable array in <icu>/source/layoutex/ParagraphLayout.cpp. (This step should be automated...)
|
|
|
|
* rebuild the layout and layoutex libraries.
|
|
|
|
---------------------------------------------------------------------------- ***
|
|
|
|
Unicode 4.1 update
|
|
|
|
*** related Jitterbugs
|
|
|
|
4332 RFE: Update to Unicode 4.1
|
|
4157 RBBI, TR29 4.1 updates
|
|
|
|
*** data files & enums & parser code
|
|
|
|
* file preparation
|
|
- ucdstrip:
|
|
DerivedCoreProperties.txt
|
|
DerivedNormalizationProps.txt
|
|
NormalizationTest.txt
|
|
GraphemeBreakProperty.txt
|
|
SentenceBreakProperty.txt
|
|
WordBreakProperty.txt
|
|
- ucdstrip and ucdmerge:
|
|
EastAsianWidth.txt
|
|
LineBreak.txt
|
|
|
|
* add new files to the repository
|
|
GraphemeBreakProperty.txt
|
|
SentenceBreakProperty.txt
|
|
WordBreakProperty.txt
|
|
|
|
* update FractionalUCA.txt and UCARules.txt with new canonical closure
|
|
|
|
* genpname
|
|
- handle new enumerated properties in sub read_uchar
|
|
- run preparse.pl
|
|
|
|
* uchar.h & uscript.h & uprops.h & uprops.c & genprops
|
|
- new binary properties
|
|
+ Pattern_Syntax
|
|
+ Pattern_White_Space
|
|
- new enumerated properties
|
|
+ Grapheme_Cluster_Break
|
|
+ Sentence_Break
|
|
+ Word_Break
|
|
- new block & script & line break values
|
|
|
|
* gencase
|
|
- case-ignorable changes
|
|
see http://www.unicode.org/versions/Unicode4.1.0/#CaseMods
|
|
now: (D47a) Word_Break=MidLetter or Mn, Me, Cf, Lm, Sk
|
|
|
|
*** Unicode version numbers
|
|
- makedata.mak
|
|
- uchar.h
|
|
- configure.in
|
|
|
|
*** tests
|
|
- verify that u_charMirror() round-trips
|
|
- test all new properties and some new values of old properties
|
|
|
|
*** other code
|
|
|
|
* hardcoded Unihan range end/limit
|
|
- Unihan range end moves from 9FA5 to 9FBB
|
|
search for both 9FA5 (end) and 9FA6 (limit) (regex 9FA[56], case-insensitive)
|
|
+ do not modify BOCU/BOCSU code because that would change the encoding
|
|
and break binary compatibility!
|
|
+ similarly, do not change the GB 18030 range data (ucnvmbcs.c),
|
|
NamePrepProfile.txt
|
|
+ ignore trietest.c: test data is arbitrary
|
|
+ ignore tstnorm.cpp: test optimization, not important
|
|
+ ignore collation: 9FA[56] only appears in comments; swapCJK() uses the whole block up to 9FFF
|
|
+ do change line_th.txt and word_th.txt
|
|
by replacing hardcoded ranges with the new property values
|
|
+ do change gennames.c
|
|
|
|
source\data\brkitr\line_th.txt(229): \u33E0-\u33FE \u3400-\u4DB5 \u4E00-\u9FA5 \uA000-\uA48C \uA490-\uA4C6
|
|
source\data\brkitr\word_th.txt(23): \u33E0-\u33FE \u3400-\u4DB5 \u4E00-\u9FA5 \uA000-\uA48C \uA490-\uA4C6
|
|
source\tools\gennames\gennames.c(971): 0x4e00, 0x9fa5,
|
|
|
|
* case mappings
|
|
- compare new special casing context conditions with previous ones
|
|
see http://www.unicode.org/versions/Unicode4.1.0/#CaseMods
|
|
|
|
* genpname
|
|
- consider storing only the short name if it is the same as the long name
|
|
|
|
*** other reviews
|
|
- UAX #29 changes (grapheme/word/sentence breaks)
|
|
- UAX #14 changes (line breaks)
|
|
- Pattern_Syntax & Pattern_White_Space
|
|
|
|
---------------------------------------------------------------------------- ***
|
|
|
|
Unicode 4.0.1 update
|
|
|
|
*** related Jitterbugs
|
|
|
|
3170 RFE: Update to Unicode 4.0.1
|
|
3171 Add new Unicode 4.0.1 properties
|
|
3520 use Unicode 4.0.1 updates for break iteration
|
|
|
|
*** data files & enums & parser code
|
|
|
|
* file preparation
|
|
- ucdstrip: DerivedNormalizationProps.txt, NormalizationTest.txt, DerivedCoreProperties.txt
|
|
- ucdstrip and ucdmerge: EastAsianWidth.txt, LineBreak.txt
|
|
|
|
* file fixes
|
|
- fix UnicodeData.txt general categories of Ethiopic digits Nd->No
|
|
according to PRI #26
|
|
http://www.unicode.org/review/resolved-pri.html#pri26
|
|
- undone again because no corrigendum in sight;
|
|
instead modified tests to not check consistency on this for Unicode 4.0.1
|
|
|
|
* ucdterms.txt
|
|
- update from http://www.unicode.org/copyright.html
|
|
formatted for plain text
|
|
|
|
* uchar.h & uprops.h & uprops.c & genprops
|
|
- add UBLOCK_CYRILLIC_SUPPLEMENT because the block is renamed
|
|
- add U_LB_INSEPARABLE due to a spelling fix
|
|
+ put short name comment only on line with new constant
|
|
for genpname perl script parser
|
|
- new binary properties
|
|
+ STerm
|
|
+ Variation_Selector
|
|
|
|
* genpname
|
|
- fix genpname perl script so that it doesn't choke on more than 2 names per property value
|
|
- perl script: correctly calculate the maximum number of fields per row
|
|
|
|
* uscript.h
|
|
- new script code Hrkt=Katakana_Or_Hiragana
|
|
|
|
* gennorm.c track changes in DerivedNormalizationProps.txt
|
|
- "FNC" -> "FC_NFKC"
|
|
- single field "NFD_NO" -> two fields "NFD_QC; N" etc.
|
|
|
|
* genprops/props2.c track changes in DerivedNumericValues.txt
|
|
- changed from 3 columns to 2, dropping the numeric type
|
|
+ assume that the type is always numeric for Han characters,
|
|
and that only those are added in addition to what UnicodeData.txt lists
|
|
|
|
*** Unicode version numbers
|
|
- makedata.mak
|
|
- uchar.h
|
|
- configure.in
|
|
|
|
*** tests
|
|
- update test of default bidi classes according to PRI #28
|
|
/tsutil/cucdtst/TestUnicodeData
|
|
http://www.unicode.org/review/resolved-pri.html#pri28
|
|
- bidi tests: change exemplar character for ES depending on Unicode version
|
|
- change hardcoded expected property values where they change
|
|
|
|
*** other code
|
|
|
|
* name matching
|
|
- read UCD.html
|
|
|
|
* scripts
|
|
- use new Hrkt=Katakana_Or_Hiragana
|
|
|
|
* ZWJ & ZWNJ
|
|
- are now part of combining character sequences
|
|
- break iteration used to assume that LB classes did not overlap; now they do for ZWJ & ZWNJ
|