ICU-7963 Break Iterator data files update for Unicode 6.0

X-SVN-Rev: 28646
This commit is contained in:
Andy Heninger 2010-09-18 01:22:35 +00:00
parent 02a21226d9
commit efa8bfba9e
8 changed files with 27 additions and 20 deletions

View File

@ -1,12 +1,12 @@
# #
# Copyright (C) 2002-2009, International Business Machines Corporation and others. # Copyright (C) 2002-2010, International Business Machines Corporation and others.
# All Rights Reserved. # All Rights Reserved.
# #
# file: char.txt # file: char.txt
# #
# ICU Character Break Rules, also known as Grapheme Cluster Boundaries # ICU Character Break Rules, also known as Grapheme Cluster Boundaries
# See Unicode Standard Annex #29. # See Unicode Standard Annex #29.
# These rules are based on TR29 Revision 13, for Unicode Version 5.1 # These rules are based on TR29 Revision 16, for Unicode Version 6.0
# #
# #

View File

@ -1,12 +1,12 @@
# #
# Copyright (C) 2002-2009, International Business Machines Corporation and others. # Copyright (C) 2002-2010, International Business Machines Corporation and others.
# All Rights Reserved. # All Rights Reserved.
# #
# file: char_th.txt # file: char_th.txt
# #
# ICU Character Break Rules, also known as Grapheme Cluster Boundaries # ICU Character Break Rules, also known as Grapheme Cluster Boundaries
# See Unicode Standard Annex #29. # See Unicode Standard Annex #29.
# These rules are based on TR29 Revision 13, for Unicode Version 5.1 # These rules are based on TR29 Revision 16, for Unicode Version 6.0
# #
# #

View File

@ -1,14 +1,16 @@
# Copyright (c) 2002-2009 International Business Machines Corporation and # Copyright (c) 2002-2010 International Business Machines Corporation and
# others. All Rights Reserved. # others. All Rights Reserved.
# #
# file: line.txt # file: line.txt
# #
# Line Breaking Rules # Line Breaking Rules
# Implement default line breaking as defined by # Implement default line breaking as defined by
# Unicode Standard Annex #14 Revision 24 for Unicode 5.2 # Unicode Standard Annex #14 Revision 24 for Unicode 6.0
# http://www.unicode.org/reports/tr14/ # http://www.unicode.org/reports/tr14/
#
# TODO: Rule LB 8 remains as it was in Unicode 5.2
# This is only because of a limitation of ICU break engine implementation,
# not because the older behavior is desirable.
# #
# Character Classes defined by TR 14. # Character Classes defined by TR 14.
@ -214,6 +216,9 @@ $CM+ [$SP $ZW];
# #
# LB 8 Break after zero width space # LB 8 Break after zero width space
# TODO: ZW SP* <break>
# An engine change is required to write the reverse rule for this.
# For now, leave the Unicode 5.2 rule, ZW <break>
# #
$LB8Breaks = [$LB4Breaks $ZW]; $LB8Breaks = [$LB4Breaks $ZW];
$LB8NonBreaks = [[$LB4NonBreaks] - [$ZW]]; $LB8NonBreaks = [[$LB4NonBreaks] - [$ZW]];
@ -452,8 +457,10 @@ $LF $CR;
[$SP $ZW] [$LB4NonBreaks-$CM]; [$SP $ZW] [$LB4NonBreaks-$CM];
[$SP $ZW] $CM+ $CAN_CM; [$SP $ZW] $CM+ $CAN_CM;
# LB 8 Break after zero width space # LB 8 ZW SP* <break>
# TODO: to implement this, we need more than one look-ahead hard break in play at a time.
# Requires an engine enhancement.
# / $SP* $ZW
# LB 9,10 Combining marks. # LB 9,10 Combining marks.
# X $CM needs to behave like X, where X is not $SP or controls. # X $CM needs to behave like X, where X is not $SP or controls.

View File

@ -1,12 +1,12 @@
# #
# Copyright (C) 2002-2009, International Business Machines Corporation and others. # Copyright (C) 2002-2010, International Business Machines Corporation and others.
# All Rights Reserved. # All Rights Reserved.
# #
# file: sent.txt # file: sent.txt
# #
# ICU Sentence Break Rules # ICU Sentence Break Rules
# See Unicode Standard Annex #29. # See Unicode Standard Annex #29.
# These rules are based on UAX 29 Revision 13 for Unicode Version 5.1.0 # These rules are based on UAX 29 Revision 16 for Unicode Version 6.0
# #

View File

@ -1,12 +1,12 @@
# #
# Copyright (C) 2002-2009, International Business Machines Corporation and others. # Copyright (C) 2002-2010, International Business Machines Corporation and others.
# All Rights Reserved. # All Rights Reserved.
# #
# file: sent_el.txt # file: sent_el.txt
# #
# ICU Sentence Break Rules # ICU Sentence Break Rules
# See Unicode Standard Annex #29. # See Unicode Standard Annex #29.
# These rules are based on UAX 29 Revision 13 for Unicode Version 5.1.0 # These rules are based on UAX 29 Revision 16 for Unicode Version 6.0
# #

View File

@ -1,12 +1,12 @@
# #
# Copyright (C) 2002-2009, International Business Machines Corporation # Copyright (C) 2002-2010, International Business Machines Corporation
# and others. All Rights Reserved. # and others. All Rights Reserved.
# #
# file: word.txt # file: word.txt
# #
# ICU Word Break Rules # ICU Word Break Rules
# See Unicode Standard Annex #29. # See Unicode Standard Annex #29.
# These rules are based on UAX-29 Revision 13 for Unicode 5.1 # These rules are based on UAX-29 Revision 16 for Unicode 6.0
# #
# Note: Updates to word.txt will usually need to be merged into # Note: Updates to word.txt will usually need to be merged into
# word_POSIX.txt and word_ja.txt also. # word_POSIX.txt and word_ja.txt also.

View File

@ -1,12 +1,12 @@
# #
# Copyright (C) 2002-2009, International Business Machines Corporation # Copyright (C) 2002-2010, International Business Machines Corporation
# and others. All Rights Reserved. # and others. All Rights Reserved.
# #
# file: word_POSIX.txt # file: word_POSIX.txt
# #
# ICU Word Break Rules, POSIX locale. # ICU Word Break Rules, POSIX locale.
# See Unicode Standard Annex #29. # See Unicode Standard Annex #29.
# These rules are based on UAX-29 Revision 13 for Unicode 5.1 # These rules are based on UAX-29 Revision 16 for Unicode 6.0
# #
# Note: Updates to word.txt will usually need to be merged into # Note: Updates to word.txt will usually need to be merged into
# word_POSIX.txt and word_ja.txt also. # word_POSIX.txt and word_ja.txt also.

View File

@ -1,12 +1,12 @@
# #
# Copyright (C) 2002-2009, International Business Machines Corporation # Copyright (C) 2002-2010, International Business Machines Corporation
# and others. All Rights Reserved. # and others. All Rights Reserved.
# #
# file: word_ja.txt # file: word_ja.txt
# #
# ICU Word Break Rules # ICU Word Break Rules
# See Unicode Standard Annex #29. # See Unicode Standard Annex #29.
# These rules are based on UAX-29 Revision 13 for Unicode 5.1 # These rules are based on UAX-29 Revision 16 for Unicode 6.0
# #
# Note: Updates to word.txt will usually need to be merged into # Note: Updates to word.txt will usually need to be merged into
# word_POSIX.txt and word_ja.txt also. # word_POSIX.txt and word_ja.txt also.