scuffed-code/icu4c/source/data/unidata/changes.txt

Unicode 4.1 update

*** related Jitterbugs

4332 RFE: Update to Unicode 4.1
4157 RBBI, TR29 4.1 updates

*** data files & enums & parser code

* file preparation
- ucdstrip:
    DerivedCoreProperties.txt
    DerivedNormalizationProps.txt
    NormalizationTest.txt
    GraphemeBreakProperty.txt
    SentenceBreakProperty.txt
    WordBreakProperty.txt
- ucdstrip and ucdmerge:
    EastAsianWidth.txt
    LineBreak.txt

* add new files to the repository
    GraphemeBreakProperty.txt
    SentenceBreakProperty.txt
    WordBreakProperty.txt

* update FractionalUCA.txt and UCARules.txt with new canonical closure

* genpname
- handle new enumerated properties in sub read_uchar
- run preparse.pl

* uchar.h & uscript.h & uprops.h & uprops.c & genprops
- new binary properties
  + Pattern_Syntax
  + Pattern_White_Space
- new enumerated properties
  + Grapheme_Cluster_Break
  + Sentence_Break
  + Word_Break
- new block & script & line break values

* gencase
- case-ignorable changes
  see http://www.unicode.org/versions/Unicode4.1.0/#CaseMods
  now: (D47a) Word_Break=MidLetter or Mn, Me, Cf, Lm, Sk

*** Unicode version numbers
- makedata.mak
- uchar.h
- configure.in

*** tests
- verify that u_charMirror() round-trips
- test all new properties and some new values of old properties

*** other code

* hardcoded Unihan range end/limit
- Unihan range end moves from 9FA5 to 9FBB
  search for both 9FA5 (end) and 9FA6 (limit) (regex 9FA[56], case-insensitive)
  + do not modify BOCU/BOCSU code because that would change the encoding
    and break binary compatibility!
  + similarly, do not change the GB 18030 range data (ucnvmbcs.c),
    NamePrepProfile.txt
  + ignore trietest.c: test data is arbitrary
  + ignore tstnorm.cpp: test optimization, not important
  + ignore collation: 9FA[56] only appears in comments; swapCJK() uses the whole block up to 9FFF
  + do change line_th.txt and word_th.txt
    by replacing hardcoded ranges with the new property values
  + do change gennames.c

source\data\brkitr\line_th.txt(229):        \u33E0-\u33FE \u3400-\u4DB5 \u4E00-\u9FA5 \uA000-\uA48C \uA490-\uA4C6
source\data\brkitr\word_th.txt(23):        \u33E0-\u33FE \u3400-\u4DB5 \u4E00-\u9FA5 \uA000-\uA48C \uA490-\uA4C6
source\tools\gennames\gennames.c(971):        0x4e00, 0x9fa5,

* case mappings
- compare new special casing context conditions with previous ones
  see http://www.unicode.org/versions/Unicode4.1.0/#CaseMods

* genpname
- consider storing only the short name if it is the same as the long name

*** other reviews
- UAX #29 changes (grapheme/word/sentence breaks)
- UAX #14 changes (line breaks)
- Pattern_Syntax & Pattern_White_Space

---------------------------------------------------------------------------- ***

Unicode 4.0.1 update

*** related Jitterbugs

3170 RFE: Update to Unicode 4.0.1
3171 Add new Unicode 4.0.1 properties
3520 use Unicode 4.0.1 updates for break iteration

*** data files & enums & parser code

* file preparation
- ucdstrip: DerivedNormalizationProps.txt, NormalizationTest.txt, DerivedCoreProperties.txt
- ucdstrip and ucdmerge: EastAsianWidth.txt, LineBreak.txt

* file fixes
- fix UnicodeData.txt general categories of Ethiopic digits Nd->No
  according to PRI #26
  http://www.unicode.org/review/resolved-pri.html#pri26
- undone again because no corrigendum in sight;
  instead modified tests to not check consistency on this for Unicode 4.0.1

* ucdterms.txt
- update from http://www.unicode.org/copyright.html
  formatted for plain text

* uchar.h & uprops.h & uprops.c & genprops
- add UBLOCK_CYRILLIC_SUPPLEMENT because the block is renamed
- add U_LB_INSEPARABLE due to a spelling fix
  + put short name comment only on line with new constant
    for genpname perl script parser
- new binary properties
  + STerm
  + Variation_Selector

* genpname
- fix genpname perl script so that it doesn't choke on more than 2 names per property value
- perl script: correctly calculate the maximum number of fields per row

* uscript.h
- new script code Hrkt=Katakana_Or_Hiragana

* gennorm.c track changes in DerivedNormalizationProps.txt
- "FNC" -> "FC_NFKC"
- single field "NFD_NO" -> two fields "NFD_QC; N" etc.

* genprops/props2.c track changes in DerivedNumericValues.txt
- changed from 3 columns to 2, dropping the numeric type
  + assume that the type is always numeric for Han characters,
    and that only those are added in addition to what UnicodeData.txt lists

*** Unicode version numbers
- makedata.mak
- uchar.h
- configure.in

*** tests
- update test of default bidi classes according to PRI #28
  /tsutil/cucdtst/TestUnicodeData
  http://www.unicode.org/review/resolved-pri.html#pri28
- bidi tests: change exemplar character for ES depending on Unicode version
- change hardcoded expected property values where they change

*** other code

* name matching
- read UCD.html

* scripts
- use new Hrkt=Katakana_Or_Hiragana

* ZWJ & ZWNJ
- are now part of combining character sequences
- break iteration used to assume that LB classes did not overlap; now they do for ZWJ & ZWNJ