* Copyright (C) 2004-2006, International Business Machines * Corporation and others. All Rights Reserved. * * file name: changes.txt * encoding: US-ASCII * tab size: 8 (not used) * indentation:4 * * created on: 2004may06 * created by: Markus W. Scherer * * change log for Unicode updates ---------------------------------------------------------------------------- *** Unicode 5.0 update *** related Jitterbugs 5084 RFE: Update to Unicode 5.0 *** data files & enums & parser code * file preparation - ucdstrip: DerivedCoreProperties.txt DerivedNormalizationProps.txt NormalizationTest.txt PropList.txt Scripts.txt GraphemeBreakProperty.txt SentenceBreakProperty.txt WordBreakProperty.txt - ucdstrip and ucdmerge: EastAsianWidth.txt LineBreak.txt * update FractionalUCA.txt and UCARules.txt with new canonical closure * genpname - run preparse.pl + make sure that data.h is writable + perl preparse.pl \cvs\oss\icu > out.txt * uchar.h & uscript.h & uprops.h & uprops.c & genprops - new block & script values + script values already added in ICU 3.6 because all of ISO 15924 is now covered * build Unicode data source code for hardcoding core data C:\cvs\oss\icu\source\data>NMAKE /f makedata.mak ICUMAKE=\cvs\oss\icu\source\data\ CFG=debug uni-core-data ICU data make path is \cvs\oss\icu\source\data\ ICU root path is \cvs\oss\icu Information: cannot find "ucmlocal.mk". Not building user-additional converter files. [etc.] Creating data file for Unicode Character Properties Creating data file for Unicode Case Mapping Properties Creating data file for Unicode BiDi/Shaping Properties Creating data file for Unicode Normalization Unicode .icu files built to "\cvs\oss\icu\source\data\out\build\icudt35l" Unicode .c source files built to "\cvs\oss\icu\source\data\out\tmp" - copy the .c source files to C:\cvs\oss\icu\source\common and rebuild the common library *** Unicode version numbers - makedata.mak - uchar.h - configure.in ---------------------------------------------------------------------------- *** Unicode 4.1 update *** related Jitterbugs 4332 RFE: Update to Unicode 4.1 4157 RBBI, TR29 4.1 updates *** data files & enums & parser code * file preparation - ucdstrip: DerivedCoreProperties.txt DerivedNormalizationProps.txt NormalizationTest.txt GraphemeBreakProperty.txt SentenceBreakProperty.txt WordBreakProperty.txt - ucdstrip and ucdmerge: EastAsianWidth.txt LineBreak.txt * add new files to the repository GraphemeBreakProperty.txt SentenceBreakProperty.txt WordBreakProperty.txt * update FractionalUCA.txt and UCARules.txt with new canonical closure * genpname - handle new enumerated properties in sub read_uchar - run preparse.pl * uchar.h & uscript.h & uprops.h & uprops.c & genprops - new binary properties + Pattern_Syntax + Pattern_White_Space - new enumerated properties + Grapheme_Cluster_Break + Sentence_Break + Word_Break - new block & script & line break values * gencase - case-ignorable changes see http://www.unicode.org/versions/Unicode4.1.0/#CaseMods now: (D47a) Word_Break=MidLetter or Mn, Me, Cf, Lm, Sk *** Unicode version numbers - makedata.mak - uchar.h - configure.in *** tests - verify that u_charMirror() round-trips - test all new properties and some new values of old properties *** other code * hardcoded Unihan range end/limit - Unihan range end moves from 9FA5 to 9FBB search for both 9FA5 (end) and 9FA6 (limit) (regex 9FA[56], case-insensitive) + do not modify BOCU/BOCSU code because that would change the encoding and break binary compatibility! + similarly, do not change the GB 18030 range data (ucnvmbcs.c), NamePrepProfile.txt + ignore trietest.c: test data is arbitrary + ignore tstnorm.cpp: test optimization, not important + ignore collation: 9FA[56] only appears in comments; swapCJK() uses the whole block up to 9FFF + do change line_th.txt and word_th.txt by replacing hardcoded ranges with the new property values + do change gennames.c source\data\brkitr\line_th.txt(229): \u33E0-\u33FE \u3400-\u4DB5 \u4E00-\u9FA5 \uA000-\uA48C \uA490-\uA4C6 source\data\brkitr\word_th.txt(23): \u33E0-\u33FE \u3400-\u4DB5 \u4E00-\u9FA5 \uA000-\uA48C \uA490-\uA4C6 source\tools\gennames\gennames.c(971): 0x4e00, 0x9fa5, * case mappings - compare new special casing context conditions with previous ones see http://www.unicode.org/versions/Unicode4.1.0/#CaseMods * genpname - consider storing only the short name if it is the same as the long name *** other reviews - UAX #29 changes (grapheme/word/sentence breaks) - UAX #14 changes (line breaks) - Pattern_Syntax & Pattern_White_Space ---------------------------------------------------------------------------- *** Unicode 4.0.1 update *** related Jitterbugs 3170 RFE: Update to Unicode 4.0.1 3171 Add new Unicode 4.0.1 properties 3520 use Unicode 4.0.1 updates for break iteration *** data files & enums & parser code * file preparation - ucdstrip: DerivedNormalizationProps.txt, NormalizationTest.txt, DerivedCoreProperties.txt - ucdstrip and ucdmerge: EastAsianWidth.txt, LineBreak.txt * file fixes - fix UnicodeData.txt general categories of Ethiopic digits Nd->No according to PRI #26 http://www.unicode.org/review/resolved-pri.html#pri26 - undone again because no corrigendum in sight; instead modified tests to not check consistency on this for Unicode 4.0.1 * ucdterms.txt - update from http://www.unicode.org/copyright.html formatted for plain text * uchar.h & uprops.h & uprops.c & genprops - add UBLOCK_CYRILLIC_SUPPLEMENT because the block is renamed - add U_LB_INSEPARABLE due to a spelling fix + put short name comment only on line with new constant for genpname perl script parser - new binary properties + STerm + Variation_Selector * genpname - fix genpname perl script so that it doesn't choke on more than 2 names per property value - perl script: correctly calculate the maximum number of fields per row * uscript.h - new script code Hrkt=Katakana_Or_Hiragana * gennorm.c track changes in DerivedNormalizationProps.txt - "FNC" -> "FC_NFKC" - single field "NFD_NO" -> two fields "NFD_QC; N" etc. * genprops/props2.c track changes in DerivedNumericValues.txt - changed from 3 columns to 2, dropping the numeric type + assume that the type is always numeric for Han characters, and that only those are added in addition to what UnicodeData.txt lists *** Unicode version numbers - makedata.mak - uchar.h - configure.in *** tests - update test of default bidi classes according to PRI #28 /tsutil/cucdtst/TestUnicodeData http://www.unicode.org/review/resolved-pri.html#pri28 - bidi tests: change exemplar character for ES depending on Unicode version - change hardcoded expected property values where they change *** other code * name matching - read UCD.html * scripts - use new Hrkt=Katakana_Or_Hiragana * ZWJ & ZWNJ - are now part of combining character sequences - break iteration used to assume that LB classes did not overlap; now they do for ZWJ & ZWNJ