ICU-1080 Used tidy on the Transliterator changes, and removed a Unix build step that's no longer needed.
X-SVN-Rev: 6857
This commit is contained in:
parent
fb01ab107d
commit
60d3f76090
@ -47,14 +47,12 @@
|
|||||||
<ul class="TOC">
|
<ul class="TOC">
|
||||||
<li><a href="#NewsUnicodeVer">Support for Unicode 3.1</a></li>
|
<li><a href="#NewsUnicodeVer">Support for Unicode 3.1</a></li>
|
||||||
|
|
||||||
<li><a href="#NewsLicense">License Change from IPL to the X
|
|
||||||
license</a></li>
|
|
||||||
|
|
||||||
<li><a href="#NewsCollation">Collation Improvements</a></li>
|
|
||||||
|
|
||||||
<li><a href="#NewsTranslit">Transliterator Improvements</a></li>
|
<li><a href="#NewsTranslit">Transliterator Improvements</a></li>
|
||||||
|
|
||||||
<li><a href="#NewsUnicodeSet">UnicodeSet Improvements</a></li>
|
<li><a href="#NewsUnicodeSet">UnicodeSet Improvements</a></li>
|
||||||
|
|
||||||
|
<li><a href="#NewsLicense">License Change from IPL to the X
|
||||||
|
license</a></li>
|
||||||
</ul>
|
</ul>
|
||||||
</li>
|
</li>
|
||||||
|
|
||||||
@ -167,6 +165,200 @@
|
|||||||
versions of ICU already supported Unicode 3.0 data with UTF-16 surrogate
|
versions of ICU already supported Unicode 3.0 data with UTF-16 surrogate
|
||||||
pairs.</p>
|
pairs.</p>
|
||||||
|
|
||||||
|
<h3><a name="NewsTranslit">Transliterator Improvements</a></h3>
|
||||||
|
|
||||||
|
<p>The transliterator service has undergone an extensive overhaul, in
|
||||||
|
both the rule-based engine and the built-in system rules.</p>
|
||||||
|
|
||||||
|
<ul>
|
||||||
|
<li><b>New or rewritten rules:</b> <tt>Any-Accents</tt>,
|
||||||
|
<tt>Any-Publishing</tt>, <tt>Cyrillic-Latin</tt>*,
|
||||||
|
<tt>Greek-Latin</tt>*, <tt>Greek-Latin/UNGEGN</tt> (aka
|
||||||
|
<tt>el-Latin</tt>), <tt>Hiragana-Latin</tt>*, and
|
||||||
|
<tt>Latin-Katakana</tt>*. New algorithmic rules include
|
||||||
|
<tt>Any-Name</tt>*, the normalization rules <tt>Any-NFC</tt>,
|
||||||
|
<tt>Any-NFKC</tt>, <tt>Any-NFD</tt>, and <tt>Any-NFKD</tt>, casing
|
||||||
|
rules <tt>Any-Upper</tt>, <tt>Any-Lower</tt>, and <tt>Any-Title</tt>.
|
||||||
|
<tt>Unicode-Hex</tt>* has been renamed <tt>Any-Hex</tt>*.
|
||||||
|
<tt>Any-Remove</tt> deletes its input. [*<em>applies to reverse rule as
|
||||||
|
well</em>]</li>
|
||||||
|
|
||||||
|
<li><b>Indic script rules:</b> Transliterators between Indic scripts
|
||||||
|
and from each script to and from Latin have been completely revised.
|
||||||
|
Scripts included are Bengali, Devanagari, Gujarati, Gurmukhi, Kannada,
|
||||||
|
Malayalam, Oriya, Tamil, and Telugu. Taking Bengali as an example,
|
||||||
|
transliterators <tt>Bengali-X</tt> and <tt>X-Bengali</tt> exist, where
|
||||||
|
X is any of the other listed Indic scripts, or Latin.</li>
|
||||||
|
|
||||||
|
<li><b>Deleted rules:</b> <tt>UnicodeName-UnicodeChar</tt> has been
|
||||||
|
replaced by <tt>Any-Name</tt>*. <tt>Latin-Arabic</tt>* and
|
||||||
|
<tt>Latin-Hebrew</tt>* have been removed until they can be rewritten.
|
||||||
|
<tt>KeyboardEscape-Latin1</tt> has been replaced by
|
||||||
|
<tt>Any-Accents</tt> and <tt>Any-Publishing</tt>. <tt>Latin-Kana</tt>*
|
||||||
|
has been replaced by <tt>Latin-Katakana</tt>* and
|
||||||
|
<tt>Latin-Hiragana</tt>*. [*<em>applies to reverse rule as
|
||||||
|
well</em>]</li>
|
||||||
|
|
||||||
|
<li><b>ID syntax changes:</b> Transliterator IDs ignore case and
|
||||||
|
whitespace now. They now have the standard form
|
||||||
|
<em>[filter]source-target/variant</em>. The "<em>[filter]</em>" element
|
||||||
|
is optional; if present, it limits the characters that the
|
||||||
|
transliterator operates on. The "<em>source-</em>" element is optional;
|
||||||
|
if omitted, it is taken to be <tt>Any</tt>. The "<em>/variant</em>"
|
||||||
|
element is also optional; if present, it selects between different
|
||||||
|
flavors of a related set of transliterators, for example,
|
||||||
|
<tt>Greek-Latin</tt> and <tt>Greek-Latin/UNGEGN</tt>. The source,
|
||||||
|
target, and variant specifiers are case-insensitive strings of the form
|
||||||
|
<tt>/[_[:L:]][_[:L:][:N:]]*/</tt>.</li>
|
||||||
|
|
||||||
|
<li>
|
||||||
|
<b>Locale support:</b> The source, target, or both may be locales. In
|
||||||
|
this case the transliterator rules will be looked up in the system
|
||||||
|
locale resource bundles. Rules are sought under three tags, listed
|
||||||
|
below. The text after the underscore in each tag is always
|
||||||
|
canonicalized to uppercase before lookup. <em>Note: The underscore is
|
||||||
|
currently omitted from ICU4C tags, but will be restored when
|
||||||
|
possible.</em>
|
||||||
|
|
||||||
|
<ul>
|
||||||
|
<li><tt>TransliterateTo_<em>SCRIPT</em></tt>: Unidirectional rules
|
||||||
|
from the enclosing locale to another script or specifier.</li>
|
||||||
|
|
||||||
|
<li><tt>TransliterateFrom_<em>SCRIPT</em></tt>: Unidirectional
|
||||||
|
rules from another script or specifier to the enclosing
|
||||||
|
locale.</li>
|
||||||
|
|
||||||
|
<li><tt>Transliterate_<em>SCRIPT</em></tt>: Bidirectional rules,
|
||||||
|
with the forward direction being To and the reverse direction being
|
||||||
|
From.</li>
|
||||||
|
</ul>
|
||||||
|
Lookup proceeds in the following order:
|
||||||
|
|
||||||
|
<ul>
|
||||||
|
<li>In the dynamic registry: <em>source-target</em></li>
|
||||||
|
|
||||||
|
<li>In the <em>source</em> locale:
|
||||||
|
<tt>TransliterateTo_<em>TARGET</em></tt> then
|
||||||
|
<tt>Transliterate_<em>TARGET</em></tt> (forward direction)</li>
|
||||||
|
|
||||||
|
<li>In the <em>target</em> locale:
|
||||||
|
<tt>TransliterateFrom_<em>SOURCE</em></tt> then
|
||||||
|
<tt>Transliterate_<em>SOURCE</em></tt> (reverse direction)</li>
|
||||||
|
</ul>
|
||||||
|
|
||||||
|
If either the source or target specifier is not a locale then the
|
||||||
|
corresponding locale lookup is skipped. If either is a locale, then
|
||||||
|
locale fallback from <tt>aa_BB_CCC</tt> to <tt>aa_BB</tt> to
|
||||||
|
<tt>aa</tt> is performed (where <tt>aa</tt>, <tt>BB</tt>, and
|
||||||
|
<tt>CCC</tt> are the locale language, country, and variant). The
|
||||||
|
final fallback is from the specifier, whether it is a locale or not
|
||||||
|
(e.g., script abbreviation), to the long script name associated with
|
||||||
|
that specifier. If a tag lookup succeeds, the attached element should
|
||||||
|
be a string array of <i>2n</i> items where <i>n</i> >= 1. Each
|
||||||
|
pair of strings is a variant name and rule string. The variants are
|
||||||
|
matched against the requested variant. If no variant is specified
|
||||||
|
then the first variant is considered to match.
|
||||||
|
</li>
|
||||||
|
|
||||||
|
<li><b>Filters on compounds IDs:</b> A filter on a compound
|
||||||
|
transliterator can now be specified by giving a leading entry that
|
||||||
|
contains a filter and no transliterator ID. For example, "<tt>[abc];
|
||||||
|
Latin-Katakana; Katakana-Hiragana</tt>" submits only the characters
|
||||||
|
contained in the UnicodeSet <tt>[abc]</tt> to the compound
|
||||||
|
transliterator <tt>Latin-Katakana; Katakana-Hiragana</tt>.</li>
|
||||||
|
|
||||||
|
<li><b>Explicit reverse IDs:</b> Typically if a transliterator
|
||||||
|
<tt>A-B</tt> is formed, and its inverse is requested, the system tries
|
||||||
|
to create <tt>B-A</tt>. That is, the source and target are exchanged.
|
||||||
|
In some cases, the user may wish a different transliterator to be
|
||||||
|
considered the reverse. In order to do this, the reverse ID is
|
||||||
|
specified in parentheses immediately following the ID. For example,
|
||||||
|
"<tt>A-B (B-C)</tt>" is a transliterator <tt>A-B</tt> whose inverse is
|
||||||
|
<tt>B-C</tt>. If the ID of the inverse is requested, "<tt>B-C
|
||||||
|
(A-B)</tt>" is returned. The forward or reverse component may be empty,
|
||||||
|
so "<tt>(B-C)</tt>" and "<tt>A-B()</tt>" are legal IDs with
|
||||||
|
<tt>Null</tt> transliterator for the forward and reverse direction,
|
||||||
|
respectively. This is most useful in compounds where one element has no
|
||||||
|
inverse or where a different inverse from the standard inverse is
|
||||||
|
desired. For example, "<tt>Any-Lower(); Latin-Cyrillic</tt>".</li>
|
||||||
|
|
||||||
|
<li><b>Quantifiers:</b> Transliterator rules may now contain
|
||||||
|
quantifiers '<tt>*</tt>', '<tt>+</tt>', and '<tt>?</tt>'. These
|
||||||
|
indicate zero or more, one or more, and zero or one matches,
|
||||||
|
respectively. Quantifiers apply to the last element, be it a single
|
||||||
|
character, a UnicodeSet, a segment definition, or a quote; the entire
|
||||||
|
preceding element is repeated. Quantifiers are implemented as greedy,
|
||||||
|
non-backtracking matchers, unlike their typical implementation in
|
||||||
|
regular expressions. As a result, expressions that match in a
|
||||||
|
traditional regular expression engine (e.g., Perl) will not match in
|
||||||
|
transliterator. E.g., "[a-z]+ q > x;" will <em>not</em> match
|
||||||
|
"abcq", since the '<tt>+</tt>' quantifier consumes all four
|
||||||
|
characters.</li>
|
||||||
|
|
||||||
|
<li><b>Dot character:</b> A new special character is recognized in
|
||||||
|
rules, '<tt>.</tt>' (U+0020). This character matches any characters in
|
||||||
|
the set <tt>[^[:Zp:][:Zl:]\r\n$]</tt>. Note the trailing '<tt>$</tt>'
|
||||||
|
in the set pattern, which indicates that the ETHER character is
|
||||||
|
<em>not</em> matched by '<tt>.</tt>'.</li>
|
||||||
|
|
||||||
|
<li><b>::ID blocks in rules:</b> Transliterator IDs may now be included
|
||||||
|
in rule sets. These may occur in two locations: as one contiguous block
|
||||||
|
before any other rules, and as one contiguous block after all rules.
|
||||||
|
The effect of placing <tt>::ID</tt>s into a rule set is to enclose the
|
||||||
|
rule-based transliterator within a compound transliterator containing
|
||||||
|
the indicated IDs. The <tt>::ID</tt> syntax is exactly the same as the
|
||||||
|
standard ID syntax, with the difference that each ID element is
|
||||||
|
preceded by the special token "<tt>::</tt>".</li>
|
||||||
|
|
||||||
|
<li><b>Segment definitions more flexible:</b> Segment definitions may
|
||||||
|
be nested and are now unlimited in number. Prior to 2.0, segments could
|
||||||
|
not be nested and were limited to nine ($1 to $9).</li>
|
||||||
|
|
||||||
|
<li><b>Variable range pragma:</b> A new pragma is supported. This
|
||||||
|
follows the syntax:<code>use variable range 0xE800 0xEFFF;</code> (Any
|
||||||
|
two code points may be specified.) The code points are specified as
|
||||||
|
decimal constants, octal constants with a leading '0', or hexadecimal
|
||||||
|
constants with a leading "0x". The given range is used internally for
|
||||||
|
stand-in characters during processing. The default range is
|
||||||
|
<b>0xF000..0xF8FF</b>. If a rule set explicitly uses characters in the
|
||||||
|
default variable range, a new range, not containing any characters in
|
||||||
|
use in the rule set, must be specified. <em>Note:</em> This is the
|
||||||
|
first of several planned pragmas.</li>
|
||||||
|
|
||||||
|
<li><b>Factory method registration:</b> Factory methods (function
|
||||||
|
pointers in ICU4C; functor objects in ICU4J) may be registered against
|
||||||
|
transliterator IDs. This is generally more efficient than the
|
||||||
|
registration of singleton prototypes, since no actual transliterator
|
||||||
|
object need be created until the user requires one. See the
|
||||||
|
<tt>registerFactory()</tt> method in <tt>Transliterator</tt>.</li>
|
||||||
|
|
||||||
|
<li><b>Filtering semantics changed for subclasses:</b> Subclasses now
|
||||||
|
need not concern themselves with filters. Instead, they may assume that
|
||||||
|
all characters received by <tt>handleTransliterate()</tt> have already
|
||||||
|
passed through the filter. This simplifies subclass code greatly.</li>
|
||||||
|
</ul>
|
||||||
|
|
||||||
|
<h3><a name="NewsUnicodeSet">UnicodeSet Improvements</a></h3>
|
||||||
|
|
||||||
|
<ul>
|
||||||
|
<li><b><tt>[:Any:]</tt> set:</b> The set <tt>[:Any:]</tt> matches all
|
||||||
|
Unicode code points, that is, U+0000..U+10FFFF.</li>
|
||||||
|
|
||||||
|
<li><b><tt>\p{}</tt> syntax:</b> UnicodeSet now recognizes a Perlish
|
||||||
|
syntax for character properties. Any property designated as
|
||||||
|
<tt>[:Foo:]</tt> may equivalently be designated <tt>\p{Foo}</tt>.</li>
|
||||||
|
|
||||||
|
<li><b>Short, medium, and long property names:</b> In addition to the
|
||||||
|
short property names, such as <tt>[:Ll:]</tt>, equivalent medium (e.g.,
|
||||||
|
<tt>[:gc=Ll:]</tt>) and long (e.g.,
|
||||||
|
<tt>[:GeneralCategory=LowercaseLetter:]</tt>) forms are recongized. See
|
||||||
|
the <a href=
|
||||||
|
"http://oss.software.ibm.com/cvs/icu/~checkout~/icuhtml/design/unicodeset_properties.html">
|
||||||
|
UnicodeSet Properties design document</a> for details. As of this
|
||||||
|
release, general categories, numeric value, and script are
|
||||||
|
supported.</li>
|
||||||
|
</ul>
|
||||||
|
|
||||||
<h3><a name="NewsLicense">License Change</a></h3>
|
<h3><a name="NewsLicense">License Change</a></h3>
|
||||||
|
|
||||||
<p>The ICU projects (ICU4C and ICU4J) have changed their licenses from
|
<p>The ICU projects (ICU4C and ICU4J) have changed their licenses from
|
||||||
@ -193,218 +385,6 @@
|
|||||||
"http://oss.software.ibm.com/icu/project_faq.html#license">Project
|
"http://oss.software.ibm.com/icu/project_faq.html#license">Project
|
||||||
FAQ</a>.</p>
|
FAQ</a>.</p>
|
||||||
|
|
||||||
<h3><a name="NewsCollation">Collation Improvements</a></h3>
|
|
||||||
|
|
||||||
<p>The collation framework has been reimplemented to make it faster,
|
|
||||||
Unicode Collation Algorithm compliant, and to make the locale-specific
|
|
||||||
collation data smaller (by separating it from the shared UCA data).<br>
|
|
||||||
<em>Sort keys and even some collation results have changed from ICU 1.6
|
|
||||||
and ICU 1.7.</em><br>
|
|
||||||
For details, see our <a href=
|
|
||||||
"http://oss.software.ibm.com/icu/develop/collation/">collation design
|
|
||||||
document</a>.</p>
|
|
||||||
|
|
||||||
<h3><a name="NewsTranslit">Transliterator Improvements</a></h3>
|
|
||||||
|
|
||||||
<p>The transliterator service has undergone an extensive overhaul,
|
|
||||||
in both the rule-based engine and the built-in system rules.
|
|
||||||
|
|
||||||
<ul>
|
|
||||||
|
|
||||||
<li><b>New or rewritten rules:</b> <tt>Any-Accents</tt>,
|
|
||||||
<tt>Any-Publishing</tt>, <tt>Cyrillic-Latin</tt>*,
|
|
||||||
<tt>Greek-Latin</tt>*, <tt>Greek-Latin/UNGEGN</tt> (aka
|
|
||||||
<tt>el-Latin</tt>), <tt>Hiragana-Latin</tt>*, and
|
|
||||||
<tt>Latin-Katakana</tt>*. New algorithmic rules include
|
|
||||||
<tt>Any-Name</tt>*, the normalization rules <tt>Any-NFC</tt>,
|
|
||||||
<tt>Any-NFKC</tt>, <tt>Any-NFD</tt>, and <tt>Any-NFKD</tt>, casing
|
|
||||||
rules <tt>Any-Upper</tt>, <tt>Any-Lower</tt>, and
|
|
||||||
<tt>Any-Title</tt>. <tt>Unicode-Hex</tt>* has been renamed
|
|
||||||
<tt>Any-Hex</tt>*. <tt>Any-Remove</tt> deletes its input.
|
|
||||||
[*<em>applies to reverse rule as well</em>]
|
|
||||||
|
|
||||||
<li><b>Indic script rules:</b> Transliterators between Indic
|
|
||||||
scripts and from each script to and from Latin have been
|
|
||||||
completely revised. Scripts included are Bengali, Devanagari,
|
|
||||||
Gujarati, Gurmukhi, Kannada, Malayalam, Oriya, Tamil, and Telugu.
|
|
||||||
Taking Bengali as an example, transliterators <tt>Bengali-X</tt>
|
|
||||||
and <tt>X-Bengali</tt> exist, where X is any of the other listed
|
|
||||||
Indic scripts, or Latin.
|
|
||||||
|
|
||||||
<li><b>Deleted rules:</b> <tt>UnicodeName-UnicodeChar</tt> has
|
|
||||||
been replaced by <tt>Any-Name</tt>*. <tt>Latin-Arabic</tt>* and
|
|
||||||
<tt>Latin-Hebrew</tt>* have been removed until they can be
|
|
||||||
rewritten. <tt>KeyboardEscape-Latin1</tt> has been replaced by
|
|
||||||
<tt>Any-Accents</tt> and <tt>Any-Publishing</tt>.
|
|
||||||
<tt>Latin-Kana</tt>* has been replaced by <tt>Latin-Katakana</tt>*
|
|
||||||
and <tt>Latin-Hiragana</tt>*.
|
|
||||||
[*<em>applies to reverse rule as well</em>]
|
|
||||||
|
|
||||||
<li><b>ID syntax changes:</b> Transliterator IDs ignore case and
|
|
||||||
whitespace now. They now have the standard form
|
|
||||||
<em>[filter]source-target/variant</em>. The "<em>[filter]</em>"
|
|
||||||
element is optional; if present, it limits the characters that the
|
|
||||||
transliterator operates on. The "<em>source-</em>" element is
|
|
||||||
optional; if omitted, it is taken to be <tt>Any</tt>. The
|
|
||||||
"<em>/variant</em>" element is also optional; if present, it
|
|
||||||
selects between different flavors of a related set of
|
|
||||||
transliterators, for example, <tt>Greek-Latin</tt> and
|
|
||||||
<tt>Greek-Latin/UNGEGN</tt>. The source, target, and variant
|
|
||||||
specifiers are case-insensitive strings of the form
|
|
||||||
<tt>/[_[:L:]][_[:L:][:N:]]*/</tt>.
|
|
||||||
|
|
||||||
<li><b>Locale support:</b> The source, target, or both may be
|
|
||||||
locales. In this case the transliterator rules will be looked up
|
|
||||||
in the system locale resource bundles. Rules are sought under
|
|
||||||
three tags, listed below. The text after the underscore in each
|
|
||||||
tag is always canonicalized to uppercase before lookup. <em>Note:
|
|
||||||
The underscore is currently omitted from ICU4C tags, but will be
|
|
||||||
restored when possible.</em>
|
|
||||||
|
|
||||||
<ul><li><tt>TransliterateTo_<em>SCRIPT</em></tt>:
|
|
||||||
Unidirectional rules from the enclosing locale to another script
|
|
||||||
or specifier.
|
|
||||||
<li><tt>TransliterateFrom_<em>SCRIPT</em></tt>:
|
|
||||||
Unidirectional rules from another script
|
|
||||||
or specifier to the enclosing locale.
|
|
||||||
<li><tt>Transliterate_<em>SCRIPT</em></tt>:
|
|
||||||
Bidirectional rules, with the forward direction being To and
|
|
||||||
the reverse direction being From.
|
|
||||||
</ul>
|
|
||||||
|
|
||||||
Lookup proceeds in the following order:
|
|
||||||
|
|
||||||
<ul><li>In the dynamic registry: <em>source-target</em>
|
|
||||||
<li>In the <em>source</em> locale:
|
|
||||||
<tt>TransliterateTo_<em>TARGET</em></tt> then
|
|
||||||
<tt>Transliterate_<em>TARGET</em></tt> (forward direction)
|
|
||||||
<li>In the <em>target</em> locale:
|
|
||||||
<tt>TransliterateFrom_<em>SOURCE</em></tt> then
|
|
||||||
<tt>Transliterate_<em>SOURCE</em></tt> (reverse direction)
|
|
||||||
</ul>
|
|
||||||
|
|
||||||
If either the source or target specifier is not a locale then the
|
|
||||||
corresponding locale lookup is skipped. If either is a locale,
|
|
||||||
then locale fallback from <tt>aa_BB_CCC</tt> to <tt>aa_BB</tt> to
|
|
||||||
<tt>aa</tt> is performed (where <tt>aa</tt>, <tt>BB</tt>, and
|
|
||||||
<tt>CCC</tt> are the locale language, country, and variant). The
|
|
||||||
final fallback is from the specifier, whether it is a locale or
|
|
||||||
not (e.g., script abbreviation), to the long script name
|
|
||||||
associated with that specifier. If a tag lookup succeeds, the
|
|
||||||
attached element should be a string array of <i>2n</i> items where
|
|
||||||
<i>n</i> >= 1. Each pair of strings is a variant name and rule
|
|
||||||
string. The variants are matched against the requested variant.
|
|
||||||
If no variant is specified then the first variant is considered to
|
|
||||||
match.
|
|
||||||
|
|
||||||
<li><b>Filters on compounds IDs:</b> A filter on a compound
|
|
||||||
transliterator can now be specified by giving a leading entry that
|
|
||||||
contains a filter and no transliterator ID. For example,
|
|
||||||
"<tt>[abc]; Latin-Katakana; Katakana-Hiragana</tt>" submits only
|
|
||||||
the characters contained in the UnicodeSet <tt>[abc]</tt> to the
|
|
||||||
compound transliterator <tt>Latin-Katakana;
|
|
||||||
Katakana-Hiragana</tt>.
|
|
||||||
|
|
||||||
<li><b>Explicit reverse IDs:</b> Typically if a transliterator
|
|
||||||
<tt>A-B</tt> is formed, and its inverse is requested, the system
|
|
||||||
tries to create <tt>B-A</tt>. That is, the source and target are
|
|
||||||
exchanged. In some cases, the user may wish a different
|
|
||||||
transliterator to be considered the reverse. In order to do this,
|
|
||||||
the reverse ID is specified in parentheses immediately following
|
|
||||||
the ID. For example, "<tt>A-B (B-C)</tt>" is a transliterator
|
|
||||||
<tt>A-B</tt> whose inverse is <tt>B-C</tt>. If the ID of the
|
|
||||||
inverse is requested, "<tt>B-C (A-B)</tt>" is returned. The
|
|
||||||
forward or reverse component may be empty, so "<tt>(B-C)</tt>" and
|
|
||||||
"<tt>A-B()</tt>" are legal IDs with <tt>Null</tt> transliterator
|
|
||||||
for the forward and reverse direction, respectively. This is most
|
|
||||||
useful in compounds where one element has no inverse or where a
|
|
||||||
different inverse from the standard inverse is desired. For
|
|
||||||
example, "<tt>Any-Lower(); Latin-Cyrillic</tt>".
|
|
||||||
|
|
||||||
<li><b>Quantifiers:</b> Transliterator rules may now contain
|
|
||||||
quantifiers '<tt>*</tt>', '<tt>+</tt>', and '<tt>?</tt>'. These
|
|
||||||
indicate zero or more, one or more, and zero or one matches,
|
|
||||||
respectively. Quantifiers apply to the last element, be it a
|
|
||||||
single character, a UnicodeSet, a segment definition, or a quote;
|
|
||||||
the entire preceding element is repeated. Quantifiers are
|
|
||||||
implemented as greedy, non-backtracking matchers, unlike their
|
|
||||||
typical implementation in regular expressions. As a result,
|
|
||||||
expressions that match in a traditional regular expression engine
|
|
||||||
(e.g., Perl) will not match in transliterator. E.g., "[a-z]+ q >
|
|
||||||
x;" will <em>not</em> match "abcq", since the '<tt>+</tt>'
|
|
||||||
quantifier consumes all four characters.
|
|
||||||
|
|
||||||
<li><b>Dot character:</b> A new special character is recognized in
|
|
||||||
rules, '<tt>.</tt>' (U+0020). This character matches any
|
|
||||||
characters in the set <tt>[^[:Zp:][:Zl:]\r\n$]</tt>. Note the
|
|
||||||
trailing '<tt>$</tt>' in the set pattern, which indicates that the
|
|
||||||
ETHER character is <em>not</em> matched by '<tt>.</tt>'.
|
|
||||||
|
|
||||||
<li><b>::ID blocks in rules:</b> Transliterator IDs may now be
|
|
||||||
included in rule sets. These may occur in two locations: as one
|
|
||||||
contiguous block before any other rules, and as one contiguous
|
|
||||||
block after all rules. The effect of placing <tt>::ID</tt>s into
|
|
||||||
a rule set is to enclose the rule-based transliterator within a
|
|
||||||
compound transliterator containing the indicated IDs. The
|
|
||||||
<tt>::ID</tt> syntax is exactly the same as the standard ID
|
|
||||||
syntax, with the difference that each ID element is preceded by
|
|
||||||
the special token "<tt>::</tt>".
|
|
||||||
|
|
||||||
<li><b>Segment definitions more flexible:</b> Segment definitions
|
|
||||||
may be nested and are now unlimited in number. Prior to 2.0,
|
|
||||||
segments could not be nested and were limited to nine ($1 to $9).
|
|
||||||
|
|
||||||
<li><b>Variable range pragma:</b> A new pragma is supported. This
|
|
||||||
follows the syntax:<code>use variable range 0xE800 0xEFFF;</code>
|
|
||||||
(Any two code points may be specified.) The code points are
|
|
||||||
specified as decimal constants, octal constants with a leading
|
|
||||||
'0', or hexadecimal constants with a leading "0x". The given
|
|
||||||
range is used internally for stand-in characters during
|
|
||||||
processing. The default range is <b>0xF000..0xF8FF</b>. If a
|
|
||||||
rule set explicitly uses characters in the default variable range,
|
|
||||||
a new range, not containing any characters in use in the rule set,
|
|
||||||
must be specified. <em>Note:</em> This is the first of several
|
|
||||||
planned pragmas.
|
|
||||||
|
|
||||||
<li><b>Factory method registration:</b> Factory methods (function
|
|
||||||
pointers in ICU4C; functor objects in ICU4J) may be registered
|
|
||||||
against transliterator IDs. This is generally more efficient than
|
|
||||||
the registration of singleton prototypes, since no actual
|
|
||||||
transliterator object need be created until the user requires one.
|
|
||||||
See the <tt>registerFactory()</tt> method in
|
|
||||||
<tt>Transliterator</tt>.
|
|
||||||
|
|
||||||
<li><b>Filtering semantics changed for subclasses:</b> Subclasses
|
|
||||||
now need not concern themselves with filters. Instead, they may
|
|
||||||
assume that all characters received by
|
|
||||||
<tt>handleTransliterate()</tt> have already passed through the
|
|
||||||
filter. This simplifies subclass code greatly.
|
|
||||||
|
|
||||||
</ul>
|
|
||||||
|
|
||||||
<h3><a name="NewsUnicodeSet">UnicodeSet Improvements</a></h3>
|
|
||||||
|
|
||||||
<ul>
|
|
||||||
|
|
||||||
<li><b><tt>[:Any:]</tt> set:</b> The set <tt>[:Any:]</tt> matches
|
|
||||||
all Unicode code points, that is, U+0000..U+10FFFF.
|
|
||||||
|
|
||||||
<li><b><tt>\p{}</tt> syntax:</b> UnicodeSet now recognizes a
|
|
||||||
Perlish syntax for character properties. Any property designated
|
|
||||||
as <tt>[:Foo:]</tt> may equivalently be designated
|
|
||||||
<tt>\p{Foo}</tt>.
|
|
||||||
|
|
||||||
<li><b>Short, medium, and long property names:</b> In addition to
|
|
||||||
the short property names, such as <tt>[:Ll:]</tt>, equivalent
|
|
||||||
medium (e.g., <tt>[:gc=Ll:]</tt>) and long (e.g.,
|
|
||||||
<tt>[:GeneralCategory=LowercaseLetter:]</tt>) forms are
|
|
||||||
recongized. See the <a
|
|
||||||
href="http://oss.software.ibm.com/cvs/icu/~checkout~/icuhtml/design/unicodeset_properties.html">UnicodeSet
|
|
||||||
Properties design document</a> for details. As of this release,
|
|
||||||
general categories, numeric value, and script are supported.
|
|
||||||
|
|
||||||
</ul>
|
|
||||||
|
|
||||||
<h2><a name="WhatContain">What the International Components for Unicode
|
<h2><a name="WhatContain">What the International Components for Unicode
|
||||||
Contain</a></h2>
|
Contain</a></h2>
|
||||||
|
|
||||||
@ -1094,10 +1074,6 @@
|
|||||||
|
|
||||||
<li>Change directory to the "icu/source".</li>
|
<li>Change directory to the "icu/source".</li>
|
||||||
|
|
||||||
<li>If it is not already set, please set the executable flag for the
|
|
||||||
following files (by executing 'chmod +x' command): runConfigureICU,
|
|
||||||
configure, install-sh and config.*,</li>
|
|
||||||
|
|
||||||
<li>Run the <a href="source/runConfigureICU">runConfigureICU</a> script
|
<li>Run the <a href="source/runConfigureICU">runConfigureICU</a> script
|
||||||
for your platform. If you are not using the runConfigureICU script or
|
for your platform. If you are not using the runConfigureICU script or
|
||||||
your platform is not supported by the script, you need to set your CC,
|
your platform is not supported by the script, you need to set your CC,
|
||||||
|
Loading…
Reference in New Issue
Block a user