ICU-20088 fix code, tables, typos, broken links, add code styling

This commit is contained in:
Robert Melo 2020-08-16 19:51:05 -03:00
parent 9c965ff757
commit 5a714ae8f7
49 changed files with 860 additions and 864 deletions

View File

@ -8,7 +8,7 @@ Updating ICU's built-in Break Iterator rules
Here are instructions for updating ICU's built-in break iterator rules, for Grapheme, Word, Line and Sentence breaks.
The ICU rules implement the boundary behavior from Unicode [UAX-14](https://unicode.org/reports/tr14/) and [UAX-29](https://unicode.org/reports/tr29/), with tailorings from CLDR and some ICU-specific enhancements. ICU rules updates are needed in response to changes from Unicode or CLDR, or for bug fixes. Often ideas for CLDR or UAX updates are prototyped in ICU first, before becoming official.
The ICU rules implement the boundary behavior from Unicode [UAX-14](https://www.unicode.org/reports/tr14/) and [UAX-29](https://www.unicode.org/reports/tr29/), with tailorings from CLDR and some ICU-specific enhancements. ICU rules updates are needed in response to changes from Unicode or CLDR, or for bug fixes. Often ideas for CLDR or UAX updates are prototyped in ICU first, before becoming official.
This is not a cook book process. Familiarity with ICU break iterator behavior and rules is needed. Sets of break rules often interact in subtle and difficult to understand ways. Expect some bumps.
@ -50,7 +50,7 @@ The rule updates are done first for ICU4C, and then ported (code changes) or mov
Add tests to to spot check the basics of the changes, to verify that some simple, straight forward cases work as expected. There is no need to thoroughly check corner cases; the goal at this step is a quick sanity check that will fail before the rule update and pass afterwards.
The [Unicode Utilities](http://unicode.org/cldr/utility/) can be very helpful at this point, for showing what characters
The [Unicode Utilities](http://www.unicode.org/cldr/utility/) can be very helpful at this point, for showing what characters
match a UnicodeSet expression, and for listing the properties of a particular character.
Tests added for the above example:
@ -146,7 +146,7 @@ The rule updates are done first for ICU4C, and then ported (code changes) or mov
cd icu4c/source/test/intltest
LD_LIBRARY_PATH=../../lib:../../stubdata:../../tools/ctestfw:$LD_LIBRARY_PATH ./intltest rbbi/RBBITest/TestUnicodeFiles
The test files are from the Unicode Consortium. The official, released versions are at https://www.unicode.org/Public/UCD/latest/ucd/auxiliary/ . The files are copied, unmodified, into the ICU source tree to make them accessible to the ICU tests.
The test files are from the Unicode Consortium. The official, released versions are at https://www.unicode.org/Public/UCD/latest/ucd/auxiliary/. The files are copied, unmodified, into the ICU source tree to make them accessible to the ICU tests.
If the update is for a new Unicode version, or for a new CLDR tailoring of the root Unicode rules, it should include updated test data files. If they're missing, ask whoever is requesting or providing the updated rules for help. The test data is generated by CLDR tooling.
@ -209,7 +209,7 @@ The rule updates are done first for ICU4C, and then ported (code changes) or mov
by this test have not yet been.
Make the updates to the test rules and re-run. The rule syntax is described in
icu4c/source/test/testdata/break_rules/README.md.
[icu4c/source/test/testdata/break_rules/README.md](https://github.com/unicode-org/icu/blob/master/icu4c/source/test/testdata/break_rules/README.md)
The test reference rules are in this same directory.
Again, after everything appears to be working, let the test run for an extended length of time. Long runs are especially important with the more complex break rule sets, such as line break.

View File

@ -25,8 +25,8 @@ License & terms of use: http://www.unicode.org/copyright.html
ICU locates boundary positions within text by means of rules, which are a form
of regular expressions. The form of the rules is similar, but not identical,
to the boundary rules from the Unicode specifications
[ [UAX-14](https://unicode.org/reports/tr14/),
[UAX-29](https://unicode.org/reports/tr29/) ], and there is a reasonably close
[[UAX-14](https://www.unicode.org/reports/tr14/),
[UAX-29](https://www.unicode.org/reports/tr29/)], and there is a reasonably close
correspondence between the two.
Taken as a set, the ICU rules describe how to move forward to the next boundary,

View File

@ -66,8 +66,8 @@ ICU `BreakIterator`s can be used to locate the following kinds of text boundarie
Each type of boundary is found in accordance with the rules specified by Unicode
Standard Annex #29, *Unicode Text Segmentation*
(<https://unicode.org/reports/tr29/> ) or Unicode Standard Annex #14, *Unicode
Line Breaking Algorithm* (<https://unicode.org/reports/tr14/>)
(<https://www.unicode.org/reports/tr29/> ) or Unicode Standard Annex #14, *Unicode
Line Breaking Algorithm* (<https://www.unicode.org/reports/tr14/>)
### Character Boundary
@ -237,7 +237,7 @@ French, Italian and Portuguese.
ICU's break iterators are based on the default boundary rules described in the
Unicode Standard Annexes [14](https://www.unicode.org/reports/tr14/) and
[29](https://www.unicode.org/unicode/reports/tr29/) . These are relatively
[29](https://www.unicode.org/reports/tr29/). These are relatively
simple boundary rules that can be implemented efficiently, and are sufficient
for many purposes and languages. However, some languages and applications will
require a more sophisticated linguistic analysis of the text in order to find

View File

@ -112,7 +112,7 @@ try {
If the ICU predefined collators are not appropriate for your intended usage, you
can define your own set of rules and instantiate a collator that uses them. For more
details, please see [the section on collation customization](customization/index.md).
details, please see [the section on collation customization](customization/index).
This example demonstrates the instantiation of a collator.
@ -272,7 +272,7 @@ The Java API provides only the `Collator.getCollationKey` method, which produces
Sort keys are generally only useful in databases or other circumstances where
function calls are extremely expensive. See [Sortkeys vs
Comparison](concepts.md#sortkeys-vs-comparison).
Comparison](concepts#sortkeys-vs-comparison).
### Sort Key Features
@ -354,7 +354,7 @@ of memory needed to store the resulting sort key. This mode is automatically
activated if the output buffer size passed is set to zero. Should the sort key
become longer than the buffer provided, function again slips into preflighting
mode. The overall performance is poorer than if the function is called with a
zero output buffer . If the size of the sort key returned is greater than the
zero output buffer. If the size of the sort key returned is greater than the
size of the buffer provided, the content of the result buffer is undefined. In
that case, the result buffer could be reallocated to its proper size and the
sort key generator function can be used again.
@ -693,7 +693,7 @@ API is provided. Each attribute has its own setter API of the form
## References
1. Ken Whistler, Markus Scherer: "Unicode Technical Standard #10, Unicode Collation
Algorithm" (<http://www.unicode.org/unicode/reports/tr10/>)
Algorithm" (<http://www.unicode.org/reports/tr10/>)
2. ICU Design doc: "Collation v2" (<http://site.icu-project.org/design/collation/v2>)
@ -701,11 +701,11 @@ API is provided. Each attribute has its own setter API of the form
(<https://htmlpreview.github.io/?https://github.com/unicode-org/icu-docs/blob/master/design/collation/ICU_collation_design.htm>)
3. The Unicode Standard, chapter 5, "Implementation guidelines"
(<http://www.unicode.org/unicode/uni2book/ch05.pdf>)
(<http://www.unicode.org/uni2book/ch05.pdf>)
4. Laura Werner: "Efficient text searching in Java: Finding the right string in
any language"
(<http://icu-project.org/docs/papers/efficient_text_searching_in_java.html>)
5. Mark Davis, Martin Dürst: "Unicode Standard Annex #15: Unicode Normalization
Forms" (<http://www.unicode.org/unicode/reports/tr15/>).
Forms" (<http://www.unicode.org/reports/tr15/>).

View File

@ -223,7 +223,7 @@ generated sort keys are longer than supplied buffers too often
A good strategy
for this problem would be to manually manage a large buffer for storing sortkeys
and keep a list of indices to sort keys in this buffer (see the "large buffers"
[Collation Example](examples.md#using-large-buffers-to-manage-sort-keys)
[Collation Example](examples#using-large-buffers-to-manage-sort-keys)
for more details).
Here are some rules of a thumb, please do not rely on them. If you are looking

View File

@ -131,7 +131,7 @@ Following is a list of the names for each level and an example usage:
5. Identical Level: When all other levels are equal, the identical level is
used as a tiebreaker. The Unicode code point values of the NFD form of each
string are compared at this level, just in case there is no difference at
levels 1-4 . For example, Hebrew cantillation marks are only distinguished
levels 1-4. For example, Hebrew cantillation marks are only distinguished
at this level. This level should be used sparingly, as only code point
value differences between two strings is an extremely rare occurrence.
Using this level substantially decreases the performance for

View File

@ -48,7 +48,7 @@ See also
* [UCA: Variable
Weighting](http://www.unicode.org/reports/tr10/#Variable_Weighting)
* [LDML: Setting
Options](http://unicode.org/repos/cldr/trunk/specs/ldml/tr35-collation.html#Setting_Options)
Options](https://htmlpreview.github.io/?https://github.com/unicode-org/cldr/blob/master/docs/ldml/tr35-collation.html#Setting_Options)
Here is an overview of the sorting results with these options.

View File

@ -820,10 +820,10 @@ These do not cause **c** to sort as if it were **ae**, nor should they.
The Unicode Collation Algorithm specifies that all text sort as if it were first
normalized into NFD. For performance reasons, ICU collation data is
pre-processed so that there is no need to perform normalization on strings that
are in [FCD](http://unicode.org/notes/tn5/#FCD) and do not contain any composite
are in [FCD](http://www.unicode.org/notes/tn5/#FCD) and do not contain any composite
combining marks. Composite combining marks are: { U+0344, U+0F73, U+0F75, U+0F81
}
[`[[:^lccc=0:]&[:toNFD=/../:]]`](http://unicode.org/cldr/utility/list-unicodeset.jsp?a=%5B%3A%5Elccc%3D0%3A%5D%26%5B%3AtoNFD%3D%2F..%2F%3A%5D&abb=on&g=)
[`[[:^lccc=0:]&[:toNFD=/../:]]`](http://www.unicode.org/cldr/utility/list-unicodeset.jsp?a=%5B%3A%5Elccc%3D0%3A%5D%26%5B%3AtoNFD%3D%2F..%2F%3A%5D&abb=on&g=)
(These characters must be decomposed for discontiguous contractions to work
properly. Use of these characters is discouraged by the Unicode Standard.). The
vast majority of strings are in this form.

View File

@ -27,7 +27,7 @@ are also provided the facilities to customize orderings.
Starting in release 1.8, the ICU Collation Service is compliant to the Unicode
Collation Algorithm (UCA) ([Unicode Technical Standard
#10](http://www.unicode.org/unicode/reports/tr10/)) and based on the Default
#10](http://www.unicode.org/reports/tr10/)) and based on the Default
Unicode Collation Element Table (DUCET) which defines the same sort order as ISO
14651.
@ -132,15 +132,15 @@ The basic ICU Collation Service is provided by two main categories of APIs:
use index tables of sort keys to index strings quickly. Note, however, that
this only improves performance for large numbers of strings because sorting
via the comparison functions is very fast. For more information, see
[Sortkeys vs Comparison](concepts.md#sortkeys-vs-comparison).
[Sortkeys vs Comparison](concepts#sortkeys-vs-comparison).
ICU provides an AlphabeticIndex API for generating language-appropriate
sorted-section labels like in dictionaries and phone books.
ICU also provides a higher-level [string search](string-search.md)
ICU also provides a higher-level [string search](string-search)
API which can be used, for example, for case-insensitive or accent-insensitive
search in an editor or in a web page. ICU string search is based on the
low-level [collation element iteration](architecture.md).
low-level [collation element iteration](architecture).
## Programming Examples

View File

@ -155,7 +155,7 @@ match is returned. However, the behavior can be overwritten by supplying a word
The minimum unit of match is aligned to an extended grapheme cluster in the ICU
string search service implementation defined by [UAX #29 Unicode Text
Segmentation](http://unicode.org/reports/tr29/). Therefore, all matches will
Segmentation](http://www.unicode.org/reports/tr29/). Therefore, all matches will
begin and end on extended grapheme cluster boundaries. If the given input search
pattern starts with non-base character, no matches will be returned.
When there are contractions in the collation sequence and the contraction
@ -264,7 +264,7 @@ service. Therefore, all the performance implications that apply to a collator
are also applicable to the string search service. To obtain the best
performance, use the default collator attributes described in the Performance
and Storage Implications on Attributes section in the [Collation Service
Architecture](architecture.md#-performance-and-storage-implications-on-attributes)
Architecture](architecture#performance-and-storage-implications-of-attributes)
chapter. In addition, users need to be aware of
the following `StringSearch` specific considerations:

View File

@ -38,7 +38,7 @@ Unicode (usually in the Basic Latin range).
For additional detail about the compression algorithm, which has been approved
by the Unicode Consortium, please refer to [Unicode Technical Report #6 (A
Standard Compression Scheme for
Unicode)](https://www.unicode.org/unicode/reports/tr6/).
Unicode)](https://www.unicode.org/reports/tr6/).
The Standard Compression Scheme for Unicode (SCSU) is used to:

View File

@ -607,7 +607,7 @@ character preceding a Variation Selector.
Unicode has a character property to identify such characters, as well as
currently-unassigned code points that are intended to be used for similar
purposes: Default_Ignorable_Code_Point, or "DI" for short:
http://unicode.org/cldr/utility/list-unicodeset.jsp?a=[:DI:]
http://www.unicode.org/cldr/utility/list-unicodeset.jsp?a=[:DI:]
Most charsets do not have most or any of these characters.

View File

@ -118,10 +118,10 @@ used in ICU. For related material, please see:
1. [ICU character set collection](http://icu-project.org/charts/charset/)
2. [Unicode Technical Report 22](http://www.unicode.org/unicode/reports/tr22/)
2. [Unicode Technical Report 22](http://www.unicode.org/reports/tr22/)
3. "Cross Mapping Tables" in [Unicode Online
Data](http://www.unicode.org/unicode/onlinedat/online.html)
Data](http://www.unicode.org/onlinedat/online.html)
## ICU Mapping Table Data Files

View File

@ -91,7 +91,7 @@ Unicode.
of such problems, see for example ["Character Conversions and Mapping
Tables"](http://icu-project.org/docs/papers/conversions_and_mappings_iuc19.ppt)
on <http://icu-project.org/docs/> and the [XML Japanese
Profile](http://www.w3.org/TR/japanese-xml/) .)
Profile](http://www.w3.org/TR/japanese-xml/)).
1. Use UTF-8 and UTF-16.
@ -141,7 +141,7 @@ Unicode.
references](https://unicode-org.github.io/icu-docs/apidoc/released/icu4c/ucnv_8h.html) .
7. For data exchange (rather than pure display), turn off fallback
mappings: ucnv_setFallback(cnv, FALSE);
mappings: `ucnv_setFallback(cnv, FALSE)`;
8. For some text formats, especially XML and HTML, it is possible to set an
"escape callback" function that turns unmappable Unicode code points

View File

@ -25,15 +25,17 @@ License & terms of use: http://www.unicode.org/copyright.html
ICU has two main calendar classes used for parsing and formatting Calendar
information correctly:
1. Calendar
1. `Calendar`
An abstract base class that defines the calendar API. This API supports
UDate to fields conversion and field arithmetic.
2. GregorianCalendar
A concrete subclass of Calendar that implements the standard calendar used
2. `GregorianCalendar`
A concrete subclass of `Calendar` that implements the standard calendar used
today internationally.
In addition to these, ICU has other Calendar sub classes to support
In addition to these, ICU has other `Calendar` subclasses to support
non-gregorian calendars including:
* Japanese
@ -56,55 +58,51 @@ non-gregorian calendars including:
* Ethiopic
The Calendar class is designed to support additional calendar systems in the
future.
The `Calendar` class is designed to support additional calendar systems in the future.
> :point_right: **Note**: *Calendar classes are related to UDate, the TimeZone classes, and the DateFormat
> :point_right: **Note**: *`Calendar` classes are related to `UDate`, the `TimeZone` classes, and the `DateFormat`
classes.*
### Calendar locale and keyword handling
When a calendar object is created, via either Calendar::create(), or
ucal_open(), or indirectly within a date formatter, ICU looks up the 'default'
When a calendar object is created, via either `Calendar::create()`, or
`ucal_open()`, or indirectly within a date formatter, ICU looks up the 'default'
calendar type for that locale. At present, all locales default to a Gregorian
calendar, except for the compatibility locales th_TH_TRADITIONAL and
ja_JP_TRADITIONAL. If the "calendar" keyword is supplied, this value will
override the default for that locale.
For instance, Calendar::createInstance("fr_FR", status) will create a Gregorian
calendar, but Calendar::createInstance("fr_FR@calendar=buddhist") will create a
Buddhist calendar.
For instance, `Calendar::createInstance("fr_FR", status)` will create a Gregorian calendar,
but `Calendar::createInstance("fr_FR@calendar=buddhist")` will create a Buddhist calendar.
It is an error to use an invalid calendar type. It will produce a missing
resource error.
It is an error to use an invalid calendar type. It will produce a missing resource error.
> :point_right: **Note**: *As of ICU 2.8, the above description applies to ICU4J only. ICU4J will have
this behavior in 3.0*
## Usage
This section discusses how to use the Calendar class and the GregorianCalendar
subclass.
This section discusses how to use the `Calendar` class and the `GregorianCalendar` subclass.
### Calendar
Calendar is an abstract base class. It defines common protocols for a hierarchy
of classes. Concrete subclasses of Calendar, for example the GregorianCalendar
`Calendar` is an abstract base class. It defines common protocols for a hierarchy
of classes. Concrete subclasses of `Calendar`, for example the `GregorianCalendar`
class, define specific operations that correspond to a real-world calendar
system. Calendar objects (instantiations of concrete subclasses of Calendar),
system. `Calendar` objects (instantiations of concrete subclasses of `Calendar`),
embody state that represents a specific context. They correspond to a real-world
locale. They also contain state that specifies a moment in time.
The API defined by Calendar encompasses multiple functions:
The API defined by `Calendar` encompasses multiple functions:
1. Representation of a specific time as a UDate
1. Representation of a specific time as a `UDate`
2. Representation of a specific time as a set of integer fields, such as YEAR,
MONTH, HOUR, etc.
2. Representation of a specific time as a set of integer fields, such as `YEAR`,
`MONTH`, `HOUR`, etc.
3. Conversion from UDate to fields
3. Conversion from `UDate` to fields
4. Conversion from fields to UDate
4. Conversion from fields to `UDate`
5. Field arithmetic, including adding, rolling, and field difference
@ -116,49 +114,49 @@ The API defined by Calendar encompasses multiple functions:
#### Representation and Conversion
The basic function of the Calendar class is to convert between a UDate value and
a set of integer fields. A UDate value is stored as UTC time in milliseconds,
which means it is calendar and time zone independent. UDate is the most compact
and portable way to store and transmit a date and time. Integer field values, on
The basic function of the `Calendar` class is to convert between a `UDate` value and
a set of integer fields. A `UDate` value is stored as UTC time in milliseconds,
which means it is calendar and time zone independent. `UDate` is the most compact
and portable way to store and transmit a date and time. `Integer` field values, on
the other hand, depend on the calendar system (that is, the concrete subclass of
Calendar) and the calendar object's context state.
`Calendar`) and the calendar object's context state.
> :point_right: **Note**: *Integer field values are needed when implementing a human interface that must
> :point_right: **Note**: *`Integer` field values are needed when implementing a human interface that must
display or input a date and/or time.*
At any given time, a calendar object uses (when DateFormat is not sufficient)
either its internal UDate or its integer fields (depending on which has been set
most recently via setTime() or set()), to represent a specific date and time.
Whatever the current internal representation, when the caller requests a UDate
At any given time, a calendar object uses (when `DateFormat` is not sufficient)
either its internal `UDate` or its integer fields (depending on which has been set
most recently via `setTime()` or `set()`), to represent a specific date and time.
Whatever the current internal representation, when the caller requests a `UDate`
or an integer field it is computed if necessary. The caller need never trigger
the conversion explicitly. The caller must perform a conversion to set either
the UDate or the integer fields, and then retrieve the desired data. This also
the `UDate` or the integer fields, and then retrieve the desired data. This also
applies in situations where the caller has some integer fields and wants to
obtain others.
#### Field Arithmetic
Arithmetic with UDate values is straightforward. Since the values are
Arithmetic with `UDate` values is straightforward. Since the values are
millisecond scalar values, direct addition and subtraction is all that is
required. Arithmetic with integer fields is more complicated. For example, what
is the date June 4, 1999 plus 300 days? Calendar defines three basic methods (in
several variants) that perform field arithmetic: add(), roll(), and
fieldDifference().
is the date June 4, 1999 plus 300 days? `Calendar` defines three basic methods (in
several variants) that perform field arithmetic: `add()`, `roll()`, and
`fieldDifference()`.
The add() method adds positive or negative values to a specified field. For
example, calling add(Calendar::MONTH, 2) on a GregorianCalendar object set to
March 15, 1999 sets the calendar to May 15, 1999. The roll() method is similar,
The `add()` method adds positive or negative values to a specified field. For
example, calling `add(Calendar::MONTH, 2)` on a `GregorianCalendar` object set to
March 15, 1999 sets the calendar to May 15, 1999. The `roll()` method is similar,
but does not modify fields that are larger. For example, calling
roll(Calendar::HOUR, n) changes the hour that a calendar is set to without
changing the day. Calling roll(Calendar::MONTH, n) changes the month without
`roll(Calendar::HOUR, n)` changes the hour that a calendar is set to without
changing the day. Calling `roll(Calendar::MONTH, n)` changes the month without
changing the year.
The fieldDifference() method is the inverse of the add() method. It computes the
difference between a calendar's currently set time and a specified UDate in
terms of a specified field. Repeated calls to fieldDifference() compute the
difference between two UDates in terms of whatever fields the caller specifies
(for example, years, months, days, and hours). If the add() method is called
with the results of fieldDifference(when, n) , then the calendar is moved toward
The `fieldDifference()` method is the inverse of the `add()` method. It computes the
difference between a calendar's currently set time and a specified `UDate` in
terms of a specified field. Repeated calls to `fieldDifference()` compute the
difference between two `UDate` objects in terms of whatever fields the caller specifies
(for example, years, months, days, and hours). If the `add()` method is called
with the results of `fieldDifference(when, n)`, then the calendar is moved toward
field by field.
This is demonstrated in the following example:
@ -176,9 +174,9 @@ int dayDiff = cal.fieldDifference(date, Calendar.DAY_OF_MONTH); // dayDiff ;<=19
#### Context Management
A calendar object performs its computations within a specific context. The
A `Calendar` object performs its computations within a specific context. The
context affects the results of conversions and arithmetic computations. When a
calendar object is created, it establishes its context using either default
`Calendar` object is created, it establishes its context using either default
values or values specified by the caller:
1. Locale-specific week data, including the first day of the week and the
@ -186,29 +184,28 @@ values or values specified by the caller:
resource data for the specified locale, or if none is specified, for the
default locale.
2. A TimeZone object. Initially, this is set to the specified zone object, or
if none is specified, the default TimeZone.
2. A `TimeZone` object. Initially, this is set to the specified zone object, or
if none is specified, the default `TimeZone`.
The context of a calendar object can be queried after the calendar is created
using calls such as getMinimalDaysInFirstWeek(), getFirstDayOfWeek(), and
getTimeZone(). The context can be changed using calls such as
setMinimalDaysInFirstWeek(), setFirstDayOfWeek(), and setTimeZone().
The context of a `Calendar` object can be queried after the calendar is created
using calls such as `getMinimalDaysInFirstWeek()`, `getFirstDayOfWeek()`, and
`getTimeZone()`. The context can be changed using calls such as
`setMinimalDaysInFirstWeek()`, `setFirstDayOfWeek()`, and `setTimeZone()`.
#### Factory Methods
Like other format classes, the best way to create a calendar object is by using
one of the factory methods. These are static methods on the Calendar class that
one of the factory methods. These are static methods on the `Calendar` class that
create and return an instance of a concrete subclass. Factory methods should be
used to enable the code to obtain the correct calendar for a locale without
having to know specific details. The factory methods on Calendar are named
createInstance().
having to know specific details. The factory methods on `Calendar` are named
`createInstance()`.
***MONTH field***
> :point_right: **Note**: *Calendar numbers months starting from zero, so calling cal.set(1998, 3, 5)
***`MONTH` field***
> :point_right: **Note**: *Calendar numbers months starting from zero, so calling `cal.set(1998, 3, 5)`
sets cal to April 15, 1998, not March 15, 1998. This follows the Java
convention. To avoid mistakes, use the constants defined in the Calendar class
for the months and days of the week. For example, cal.set(1998, Calendar::APRIL,
15).*
convention. To avoid mistakes, use the constants defined in the `Calendar` class
for the months and days of the week. For example, `cal.set(1998, Calendar::APRIL, 15)`.*
#### Ambiguous Wall Clock Time Resolution
@ -227,34 +224,34 @@ this case, the ICU Calendar resolves the time using the UTC offset after the
transition by default. For example, 1:30 AM on the date is resolved as 1:30 AM
standard time.
Ambiguous wall clock time resolution behaviors can be customized by Calendar
APIs setRepeatedWallTimeOption() and setSkippedWallTimeOption(). These APIs are
APIs `setRepeatedWallTimeOption()` and `setSkippedWallTimeOption()`. These APIs are
available in ICU 49 or later versions.
### Gregorian Calendar
### `GregorianCalendar`
The GregorianCalendar class implements two calendar systems, the Gregorian
The `GregorianCalendar` class implements two calendar systems, the Gregorian
calendar and the Julian calendar. These calendar systems are closely related,
differing mainly in their definition of the leap year. The Julian calendar has
leap years every four years; the Gregorian calendar refines this by excluding
century years that are not divisible by 400. GregorianCalendar defines two eras,
century years that are not divisible by 400. `GregorianCalendar` defines two eras,
BC (B.C.E.) and AD (C.E.).
Historically, most western countries used the Julian calendar until the 16th to
20th century, depending on the country. They then switched to the Gregorian
calendar. The GregorianCalendar class mirrors this behavior by defining a
calendar. The `GregorianCalendar` class mirrors this behavior by defining a
cut-over date. Before this date, the Julian calendar algorithms are used. After
it, the Gregorian calendar algorithms are used. By default, the cut-over date is
set to October 4, 1582 C.E., which reflects the time when countries first began
adopting the Gregorian calendar. The GregorianCalendar class does not attempt
adopting the Gregorian calendar. The `GregorianCalendar` class does not attempt
historical accuracy beyond this behavior, and does not vary its cut-over date by
locale. However, users can modify the cut-over date by using the
setGregorianChange() method.
`setGregorianChange()` method.
Code that is written correctly instantiates calendar objects using the Calendar
factory methods, and therefore holds a Calendar* pointer, Such code can not
directly access the GregorianCalendar-specific methods not present in Calendar.
factory methods, and therefore holds a `Calendar*` pointer. Such code cannot
directly access the GregorianCalendar-specific methods not present in `Calendar`.
The correct way to handle this is to perform a dynamic cast, after testing the
type of the object using getDynamicClassID(). For example:
type of the object using `getDynamicClassID()`. For example:
```c++
void setCutover(Calendar *cal, UDate myCutover) {
@ -270,8 +267,8 @@ with the factory methods.*
### Disambiguation
When computing a UDate from fields, some special circumstances can arise. There
might be insufficient information to compute the UDate (such as only year and
When computing a `UDate` from fields, some special circumstances can arise. There
might be insufficient information to compute the `UDate` (such as only year and
month but no day in the month), there might be inconsistent information (such as
"Tuesday, July 15, 1996" -— July 15, 1996, is actually a Monday), or the input
time might be ambiguous because of time zone transition.
@ -279,21 +276,21 @@ time might be ambiguous because of time zone transition.
1. **Insufficient Information**
ICU Calendar uses the default field values to specify missing fields. The
default for a field is the same as that of the start of the epoch (that is,
YEAR = 1970, MONTH = JANUARY, DAY_OF_MONTH = 1).
`YEAR = 1970`, `MONTH = JANUARY`, `DAY_OF_MONTH = 1`).
2. **Inconsistent Information**
If fields conflict, the calendar gives preference to fields set more
recently. For example, when determining the day, the calendar looks for one
of the following combinations of fields:
MONTH + DAY_OF_MONTH
MONTH + WEEK_OF_MONTH + DAY_OF_WEEK
MONTH + DAY_OF_WEEK_IN_MONTH + DAY_OF_WEEK
DAY_OF_YEAR
DAY_OF_WEEK + WEEK_OF_YEAR
`MONTH + DAY_OF_MONTH`
`MONTH + WEEK_OF_MONTH + DAY_OF_WEEK`
`MONTH + DAY_OF_WEEK_IN_MONTH + DAY_OF_WEEK`
`DAY_OF_YEAR`
`DAY_OF_WEEK + WEEK_OF_YEAR`
For the time of day, the calendar looks for one of the following
combinations of fields:
HOUR_OF_DAY
AM_PM + HOUR
`HOUR_OF_DAY`
`AM_PM + HOUR`
3. **Ambiguous Wall Clock Time**
When time offset from UTC has changed, it produces ambiguous time slot
@ -312,14 +309,14 @@ time might be ambiguous because of time zone transition.
1:30 AM standard time.
***Options for Ambiguous Time Resolution***
> :point_right: **Note**: *Ambiguous wall clock time resolution behaviors can be customized by Calendar APIs setRepeatedTimeOption() and setSkippedTimeOption(). These methods are available in ICU 49 or later versions.*
> :point_right: **Note**: *Ambiguous wall clock time resolution behaviors can be customized by Calendar APIs `setRepeatedTimeOption()` and `setSkippedTimeOption()`. These methods are available in ICU 49 or later versions.*
***WEEK_OF_YEAR field***
> :point_right: **Note**: *Values calculated for the WEEK_OF_YEAR field range from 1 to 53. Week 1 for a year is the first week that contains at least getMinimalDaysInFirstWeek() days from that year. It depends on the values of getMinimalDaysInFirstWeek(), getFirstDayOfWeek(), and the day of the week of January 1. Weeks between week 1 of one year and week 1 of the following year are numbered sequentially from 2 to 52 or 53 (if needed).
For example, January 1, 1998 was a Thursday. If getFirstDayOfWeek() is MONDAY
and getMinimalDaysInFirstWeek() is 4 (these are the values reflecting ISO 8601
***`WEEK_OF_YEAR` field***
> :point_right: **Note**: *Values calculated for the `WEEK_OF_YEAR` field range from 1 to 53. Week 1 for a year is the first week that contains at least `getMinimalDaysInFirstWeek()` days from that year. It depends on the values of `getMinimalDaysInFirstWeek()`, `getFirstDayOfWeek()`, and the day of the week of January 1. Weeks between week 1 of one year and week 1 of the following year are numbered sequentially from 2 to 52 or 53 (if needed).
For example, January 1, 1998 was a Thursday. If `getFirstDayOfWeek()` is `MONDAY`
and `getMinimalDaysInFirstWeek()` is `4` (these are the values reflecting ISO 8601
and many national standards), then week 1 of 1998 starts on December 29, 1997,
and ends on January 4, 1998. However, if getFirstDayOfWeek() is SUNDAY, then
and ends on January 4, 1998. However, if `getFirstDayOfWeek()` is `SUNDAY`, then
week 1 of 1998 starts on January 4, 1998, and ends on January 10, 1998. The
first three days of 1998 are then part of week 53 of 1997.*

View File

@ -30,7 +30,7 @@ point at which the offset changes and the amount by which it changes. Thus, a
time zone is described by the following information:
* An identifying string, or ID. This consists only of invariant characters
(see the file utypes.h). It typically has the format continent / city. The
(see the file `utypes.h`). It typically has the format continent / city. The
city chosen is not the only city in which the zone applies, but rather a
representative city for the region. Some IDs consist of three or four
uppercase letters; these are legacy zone names that are aliases to standard
@ -54,7 +54,7 @@ If DST is observed, then three additional pieces of information are needed:
### System and User Time Zones
ICU supports local time zones through the classes TimeZone and SimpleTimeZone in
ICU supports local time zones through the classes `TimeZone` and `SimpleTimeZone` in
the C++ API. In the C API, time zones are designated by their ID strings.
Users can construct their own time zone objects by specifying the above
@ -65,12 +65,12 @@ GMT offset and in alphabetical order of ID.
Since this list changes one or more times a year, *this document only represents
a snapshot*. For the most current list of ICU system zones, use the method
TimeZone::getAvailableIDs().
`TimeZone::getAvailableIDs()`.
*The zones are listed in binary sort order (that is, 'A' through 'Z' come before
'a' through 'z'). This is the same order in which the zones are stored
internally, and the same order in which they are returned by
TimeZone::getAvailableIDs(). The reason for this is that ICU locates zones using
`TimeZone::getAvailableIDs()`. The reason for this is that ICU locates zones using
a binary search, and the binary search relies on this sort order.*
*You might notice that zones such as Etc/GMT+1 appear to have the wrong sign for
their GMT offset. In fact, their sign is inverted since the the Etc zones follow
@ -81,72 +81,76 @@ details.*
### References
The ICU system time zones are derived from the tz database (also known as the
“Olson” database) at [ftp://elsie.nci.nih.gov/pub](ftp://elsie.nci.nih.gov/pub)
. This is the data used across much of the industry, including by UNIX systems,
“Olson” database) at [ftp://elsie.nci.nih.gov/pub](ftp://elsie.nci.nih.gov/pub).
This is the data used across much of the industry, including by UNIX systems,
and is usually updated several times each year. ICU (since version 2.8) and base
Java (since Java 1.4) contain code and tz data supporting both current and
historic time zone usage.
## How ICU Represents Dates/Times
ICU represents dates and times using UDates. A UDate is a scalar value that
ICU represents dates and times using `UDate`s. A `UDate` is a scalar value that
indicates a specific point in time, independent of calendar system and local
time zone. It is stored as the number of milliseconds from a reference point
known as the epoch. The epoch is midnight Universal Time Coordinated (UTC)
January 1, 1970 A.D. Negative UDate values indicate times before the epoch.
January 1, 1970 A.D. Negative `UDate` values indicate times before the epoch.
*These classes have the same architecture as the Java classes.*
Most people only need to use the DateFormat classes for parsing and formatting
Most people only need to use the `DateFormat` classes for parsing and formatting
dates and times. However, for those who need to convert dates and times or
perform numeric calculations, the services described in this section can be very
useful.
To translate a UDate to a useful form, a calendar system and local time zone
must be specified. These are specified in the form of objects of the Calendar
and TimeZone classes. Once these two objects are specified, they can be used to
convert the UDate to and from its corresponding calendar fields. The different
fields are defined in the Calendar class and include the year, month, day, hour,
To translate a `UDate` to a useful form, a calendar system and local time zone
must be specified. These are specified in the form of objects of the `Calendar`
and `TimeZone` classes. Once these two objects are specified, they can be used to
convert the `UDate` to and from its corresponding calendar fields. The different
fields are defined in the `Calendar` class and include the year, month, day, hour,
minute, second, and so on.
Specific Calendar objects correspond to calendar systems (such as Gregorian) and
Specific `Calendar` objects correspond to calendar systems (such as Gregorian) and
conventions (such as the first day of the week) in use in different parts of the
world. To obtain a Calendar object for France, for example, call
Calendar::createInstance(Locale::getFrance(), status).
world. To obtain a `Calendar` object for France, for example, call
`Calendar::createInstance(Locale::getFrance(), status)`.
The TimeZone class defines the conversion between universal coordinated time
(UTC),, and local time, according to real-world rules. Different TimeZone
The `TimeZone` class defines the conversion between universal coordinated time
(UTC), and local time, according to real-world rules. Different `TimeZone`
objects correspond to different real-world time zones. For example, call
TimeZone::createTimeZone("America/Los_Angeles") to obtain an object that
`TimeZone::createTimeZone("America/Los_Angeles")` to obtain an object that
implements the U.S. Pacific time zone, both Pacific Standard Time (PST) and
Pacific Daylight Time (PDT).
As previously mentioned, the Calendar and TimeZone objects must be specified
As previously mentioned, the `Calendar` and `TimeZone` objects must be specified
correctly together. One way of doing so is to create each independently, then
use the Calendar::setTimeZone() method to associate the time zone with the
calendar. Another is to use the Calendar::createInstance() method that takes a
TimeZone object. For example, call Calendar::createInstance(
TimeZone::createInstance( "America/Los_Angeles"), Locale:getUS(), status) to
obtain a Calendar appropriate for use in the U.S. Pacific time zone.
use the `Calendar::setTimeZone()` method to associate the time zone with the
calendar. Another is to use the `Calendar::createInstance()` method that takes a
`TimeZone` object. For example, call `Calendar::createInstance(
TimeZone::createInstance( "America/Los_Angeles"), Locale:getUS(), status)` to
obtain a `Calendar` appropriate for use in the U.S. Pacific time zone.
ICU has four classes pertaining to calendars and timezones:
* [Calendar](calendar/index.md)
Calendar is an abstract base class that represents a calendar system.
Calendar objects map UDate values to and from the individual fields used in
a particular calendar system. Calendar also performs field computations such
* [`Calendar`](calendar/index.md)
`Calendar` is an abstract base class that represents a calendar system.
`Calendar` objects map `UDate` values to and from the individual fields used in
a particular calendar system. `Calendar` also performs field computations such
as advancing a date by two months.
* [Gregorian Calendar](calendar/index.md) (§)
GregorianCalendar is a concrete subclass of Calendar that implements the
* [`Gregorian Calendar`](calendar/index.md)
`GregorianCalendar` is a concrete subclass of `Calendar` that implements the
rules of the Julian calendar and the Gregorian calendar, which is the common
calendar in use internationally today.
* [TimeZone](timezone/index.md)
TimeZone is an abstract base class that represents a time zone. TimeZone
* [`TimeZone`](timezone/index.md)
`TimeZone` is an abstract base class that represents a time zone. `TimeZone`
objects map between universal coordinated time (UTC) and local time.
* [SimpleTimeZone](timezone/index.md) (§)
SimpleTimeZone is a concrete subclass of TimeZone that implements standard
* [`SimpleTimeZone`](timezone/index.md)
`SimpleTimeZone` is a concrete subclass of `TimeZone` that implements standard
time and daylight savings time according to real-world rules. Individual
SimpleTimeZone objects correspond to real-world time zones.
`SimpleTimeZone` objects correspond to real-world time zones.

View File

@ -36,23 +36,23 @@ Binary datetimes differ in a number of ways: the data type, the unit, and the
epoch (origin). We'll refer to these as time scales. For example: (Sorted by
epoch and unit, descending. In Java, `int64_t`=`long` and `int32_t`=`int`.)
| Source | Data Type | Epoch | Unit |
| ------------------------------------------ | -------------------------------------------------------------------------------- | ----------- | ------------------------------------------------------- |
| MacOS X (CFDate/NSDate) | double (1.0=1s but fractional seconds are used as well; imprecise for 0.1s etc.) | 2001-Jan-01 | seconds (and fractions thereof) |
| Unix time_t | int32_t or int64_t (signed int32_t limited to 1970..2038) | 1970-Jan-01 | seconds |
| Java Date | int64_t | 1970-Jan-01 | milliseconds |
| Joda DateTime | int64_t | 1970-Jan-01 | milliseconds |
| ICU4C UDate | double (does not use fractional milliseconds) | 1970-Jan-01 | milliseconds |
| JavaScript Date | double (does not use fractional milliseconds; JavaScript Number stores a double) | 1970-Jan-01 | milliseconds |
| Unix struct timeval (as in gettimeofday) | struct: time_t (seconds); suseconds_t (microseconds) | 1970-Jan-01 | microseconds |
| Gnome g_get_real_time() | gint64 | 1970-Jan-01 | microseconds |
| Unix struct timespec (as in clock_gettime) | struct: time_t (seconds); long (nanoseconds) | 1970-Jan-01 | nanoseconds |
| MacOS (old) | uint32_t (1904..2040) | 1904-Jan-01 | seconds |
| Excel | ? | 1899-Dec-31 | days |
| DB2 | ? | 1899-Dec-31 | days |
| Windows FILETIME | int64_t | 1601-Jan-01 | ticks (100 nanoseconds; finest granularity in industry) |
| .NET DateTime | uint62 (only 0001-9999; only 62 bits; also 2-bit field for UTC/local) | 0001-Jan-01 | ticks (100 nanoseconds; finest granularity in industry) |
| ICU Universal Time Scale | int64_t | 0001-Jan-01 | same as .Net but allows 29000BC..29000AD |
| Source | Data Type | Epoch | Unit |
| ---------------------------------------------- | ---------------------------------------------------------------------------------- | ----------- | ------------------------------------------------------- |
| MacOS X (`CFDate/NSDate`) | `double` (1.0=1s but fractional seconds are used as well; imprecise for 0.1s etc.) | 2001-Jan-01 | seconds (and fractions thereof) |
| Unix `time_t` | `int32_t` or `int64_t` (`signed int32_t` limited to 1970..2038) | 1970-Jan-01 | seconds |
| Java `Date` | `int64_t` | 1970-Jan-01 | milliseconds |
| Joda `DateTime` | `int64_t` | 1970-Jan-01 | milliseconds |
| ICU4C `UDate` | `double` (does not use fractional milliseconds) | 1970-Jan-01 | milliseconds |
| JavaScript `Date` | `double` (does not use fractional milliseconds; JavaScript Number stores a double) | 1970-Jan-01 | milliseconds |
| Unix `struct timeval (as in gettimeofday)` | `struct: time_t` (seconds); suseconds_t (microseconds) | 1970-Jan-01 | microseconds |
| Gnome `g_get_real_time()` | `gint64` | 1970-Jan-01 | microseconds |
| Unix `struct timespec` (as in `clock_gettime`) | `struct: time_t` (seconds); long (nanoseconds) | 1970-Jan-01 | nanoseconds |
| MacOS (old) | `uint32_t` (1904..2040) | 1904-Jan-01 | seconds |
| Excel | ? | 1899-Dec-31 | days |
| DB2 | ? | 1899-Dec-31 | days |
| Windows `FILETIME` | `int64_t` | 1601-Jan-01 | ticks (100 nanoseconds; finest granularity in industry) |
| .NET `DateTime` | `uint62` (only 0001-9999; only 62 bits; also 2-bit field for UTC/local) | 0001-Jan-01 | ticks (100 nanoseconds; finest granularity in industry) |
| ICU Universal Time Scale | `int64_t` | 0001-Jan-01 | same as .Net but allows 29000BC..29000AD |
All of the epochs start at 00:00 am (the earliest possible time on the day in
question), and are usually assumed to be UTC.
@ -81,7 +81,7 @@ Gregorian calendar. That is, the Gregorian calendar's leap year rules are used
for all times, even before 1582 when it was introduced. (This is different from
the default ICU calendar which switches from the Julian to the Gregorian
calendar in 1582. See `GregorianCalendar::setGregorianChange()` and
`ucal_setGregorianChange()`.)
`ucal_setGregorianChange()`).
ICU provides conversion functions to and from all other major time scales,
allowing datetimes in any time scale to be converted to the universal time
@ -91,12 +91,12 @@ scale, safely manipulated, and converted back to any other datetime time scale.
So how did we decide what to use for the universal time scale? Java time has
plenty of range, but cannot represent a .NET `System.DateTime` value without
severe loss of precision. ICU4C time addresses this by using a double that is
severe loss of precision. ICU4C time addresses this by using a `double` that is
otherwise equivalent to the Java time. However, there are disadvantages with
doubles. They provide for much more graceful degradation in arithmetic
operations. But they only have 53 bits of accuracy, which means that they will
lose precision when converting back and forth to ticks. What would really be
nice would be a long double (80 bits -- 64 bit mantissa), but that is not
nice would be a `long double` (80 bits -- 64 bit mantissa), but that is not
supported on most systems.
The Unix extended time uses a structure with two components: time in seconds and
@ -109,7 +109,7 @@ have a fixed size.
Because of these issues, we concluded that the .NET `System.DateTime` is the best
timescale to use. However, we use the full range allowed by the data type,
allowing for datetimes back to 29,000 BC and up to 29,000 AD. (`System.DateTime`
uses only 62 bits and only supports dates from 0001 AD to 9999 AD.) This time
uses only 62 bits and only supports dates from 0001 AD to 9999 AD). This time
scale is very fine grained, does not lose precision, and covers a range that
will meet almost all requirements. It will not handle the range that Java times
do, but frankly, being able to handle dates before 29,000 BC or after 29,000 AD
@ -123,31 +123,31 @@ information about a particular timescale. In all of these routines, the
timescales are referenced using an integer constant, according to the following
table:
| Source | ICU4C | ICU4J |
| ---------------------- | --------------------------- | ---------------------- |
| Java | UDTS_JAVA_TIME | JAVA_TIME |
| Unix | UDTS_UNIX_TIME | UNIX_TIME |
| ICU4C | UDTS_ICU4C_TIME | ICU4C_TIME |
| Windows FILETIME | UDTS_WINDOWS_FILE_TIME | WINDOWS_FILE_TIME |
| .NET DateTime | UDTS_DOTNET_DATE_TIME | DOTNET_DATE_TIME |
| Macintosh (old) | UDTS_MAC_OLD_TIME | MAC_OLD_TIME |
| Macintosh | UDTS_MAC_TIME | MAC_TIME |
| Excel | UDTS_EXCEL_TIME | EXCEL_TIME |
| DB2 | UDTS_DB2_TIME | DB2_TIME |
| Unix with microseconds | UDTS_UNIX_MICROSECONDS_TIME | UNIX_MICROSECONDS_TIME |
| Source | ICU4C | ICU4J |
| ---------------------- | ----------------------------- | ------------------------ |
| Java | `UDTS_JAVA_TIME` | `JAVA_TIME` |
| Unix | `UDTS_UNIX_TIME` | `UNIX_TIME` |
| ICU4C | `UDTS_ICU4C_TIME` | `ICU4C_TIME` |
| Windows FILETIME | `UDTS_WINDOWS_FILE_TIME` | `WINDOWS_FILE_TIME` |
| .NET DateTime | `UDTS_DOTNET_DATE_TIME` | `DOTNET_DATE_TIME` |
| Macintosh (old) | `UDTS_MAC_OLD_TIME` | `MAC_OLD_TIME` |
| Macintosh | `UDTS_MAC_TIME` | `MAC_TIME` |
| Excel | `UDTS_EXCEL_TIME` | `EXCEL_TIME` |
| DB2 | `UDTS_DB2_TIME` | `DB2_TIME` |
| Unix with microseconds | `UDTS_UNIX_MICROSECONDS_TIME` | `UNIX_MICROSECONDS_TIME` |
The routine that gets a particular piece of information about a timescale takes
an integer constant that identifies the particular piece of information,
according to the following table:
| Value | ICU4C | ICU4J |
| -------------------- | ----------------------- | ------------------ |
| Precision | UTSV_UNITS_VALUE | UNITS_VALUE |
| Epoch offet | UTSV_EPOCH_OFFSET_VALUE | EPOCH_OFFSET_VALUE |
| Minimum "from" value | UTSV_FROM_MIN_VALUE | FROM_MIN_VALUE |
| Maximum "from" value | UTSV_FROM_MAX_VALUE | FROM_MAX_VALUE |
| Minimum "to" value | UTSV_TO_MIN_VALUE | TO_MIN_VALUE |
| Maximum "to" value | UTSV_TO_MAX_VALUE | TO_MAX_VALUE |
| Value | ICU4C | ICU4J |
| -------------------- | -------------------------- | -------------------- |
| Precision | `UTSV_UNITS_VALUE` | `UNITS_VALUE` |
| Epoch offset | `UTSV_EPOCH_OFFSET_VALUE` | `EPOCH_OFFSET_VALUE` |
| Minimum "from" value | `UTSV_FROM_MIN_VALUE` | `FROM_MIN_VALUE` |
| Maximum "from" value | `UTSV_FROM_MAX_VALUE` | `FROM_MAX_VALUE` |
| Minimum "to" value | `UTSV_TO_MIN_VALUE` | `TO_MIN_VALUE` |
| Maximum "to" value | `UTSV_TO_MAX_VALUE` | `TO_MAX_VALUE` |
Here is what the values mean:

View File

@ -30,13 +30,13 @@ the introduction of namespace in C++.
The JDK internationalization components and ICU components both share the same
common basic architectures with regard to the following:
1. locales
2. data-driven services
3. ICU threading models and the open and close model
4. cloning customization
5. error handling
6. extensibility
7. resource bundle inheritance model
1. [Locales](#locales)
2. [Data-driven services](#data-driven-services)
3. [ICU threading models and the open and close model](#icu-threading-model-and-open-and-close-model)
4. [Cloning customization](#cloning-customization)
5. [Error handling](#error-handling)
6. [Extensibility](#extensibility)
7. [Resource bundle inheritance model](#resource-bundle-inheritance-model)
There are design features in ICU4C that are not in the Java Development Kit
(JDK) due
@ -46,10 +46,10 @@ to programming language restrictions. These features include the following:
Locale IDs are composed of language, country, and variant information. The
following links provide additional useful information regarding ISO standards:
[ISO-639](http://lcweb.loc.gov/standards/iso639-2/englangn.html) , and an ISO
[ISO-639](http://lcweb.loc.gov/standards/iso639-2/englangn.html), and an ISO
Country Code,
[ISO-3166](http://www.iso.org/iso/en/prods-services/iso3166ma/02iso-3166-code-lists/list-en1.html)
. For example, Italian, Italy, and Euro are designated as: it_IT_EURO.
[ISO-3166](http://www.iso.org/iso/en/prods-services/iso3166ma/02iso-3166-code-lists/list-en1.html).
For example, Italian, Italy, and Euro are designated as: it_IT_EURO.
### Data-driven Services
@ -150,7 +150,7 @@ Thus, the normal mode of operation is to:
> :point_right: **Note**: These service instances may be closed in any sequence.
The preceding steps are given as an example.
#### Cloning Customization
### Cloning Customization
Typically, the services supplied with ICU cover the vast majority of usages.
However, there are circumstances where the service needs to be customized for a
@ -206,14 +206,14 @@ service objects, organized according to locale. Then, if a particular locale's
formatter is in high demand, that formatter can be used, and then returned to
the pool.
### ICU Memory Usage
#### ICU Memory Usage
ICU4C APIs are designed to allow separate heaps for its libraries vs. the
application. This is achieved by providing functions to allocate and release
objects owned by ICU4C using only ICU4C library functions. For more details see
the Memory Usage section in the [Coding Guidelines](dev/codingguidelines.md).
### ICU Initialization and Termination
#### ICU Initialization and Termination
The ICU library does not normally require any explicit initialization prior to
use. An application begins use simply by calling any ICU API in the usual way.
@ -389,21 +389,21 @@ methods available:
**Lookup chain** : Searching for a resource bundle.
1. `en_US_<some-variant>`
2. `en_US`
3. `en`
4. `<defaultLang>_<defaultCountry>`
5. `<defaultLang>`
6. `root`
en_US_<some-variant>
en_US
en
<defaultLang>_<defaultCountry>
<defaultLang>
root
**Lookup chain** : Searching for a \<key, value> pair after
`en_US_<some-variant>` has ben loaded. ICU does not use the default locale in
this case.
1. `en_US_<some-variant>`
2. `en_US`
3. `en`
4. `root`
en_US_<some-variant>
en_US
en
root
## Other ICU Design Principles
@ -413,7 +413,7 @@ usage.
### Version Numbers in ICU
Version changes show clients when parts of ICU change. ICU; its components (such
as Collator); each resource bundle, including all the locale data resource
as `Collator`); each resource bundle, including all the locale data resource
bundles; and individual tagged items within a resource bundle, have their own
version numbers. Version numbers numerically and lexically increase as changes
are made.
@ -443,7 +443,7 @@ The interpretation of version numbers depends on what is being described.
The first version number field contains the ICU release version number, for
example 49. Each new version might contain new features, new locale data, and
modified behavior. (See below for more information on
[ICU Binary Compatibility](###icu-binary-compatibility).)
[ICU Binary Compatibility](#icu-binary-compatibility)).
The second field is 1 for the initial release (e.g., 49.1). The second and
sometimes third fields are incremented for binary compatible maintenance
@ -475,7 +475,7 @@ published semi-formal “enhancement” releases with odd second-field numbers
Library filenames and some other internal uses already used a concatenation of
the first two fields ("48" for 4.8).
Resource Bundles and Elements
#### Resource Bundles and Elements
The data stored in resource bundles is tagged with version numbers. A resource
bundle can contain a tagged string named "Version" that declares the version
@ -483,8 +483,8 @@ number in dotted-integer format. For example,
```text
en {
Version { "1.0.3.5" }
...
Version { "1.0.3.5" }
...
}
```
@ -652,7 +652,7 @@ list members a chance to review upcoming changes, and to discuss them. A
proposal often changes significantly as a result of discussion. Most proposals
will eventually find consensus among list members; otherwise, the ICU-TC decides
what to do. If the addition or change of APIs would affect you, please subscribe
to the main [icu-design mailing list](http://icu-project.org/contacts.html) .
to the main [icu-design mailing list](http://icu-project.org/contacts.html).
When a **new API** is added to ICU, it **is marked as draft with a `@draft ICU
x.y` label in the API documentation, **where x.y is the ICU version when the

View File

@ -113,7 +113,7 @@ If the API function is non-const, then it should have a `UErrorCode` parameter.
Default C++ assignment operators and copy constructors should not be used (they
should be declared private and not implemented). Instead, define an `assign(Class
&other, UErrorCode &errorCode)` function. Normal constructors are fine, and
should have a UErrorCode parameter.
should have a `UErrorCode` parameter.
### Warning Codes
@ -241,7 +241,7 @@ Example in `ucnv.h`:
ucnv_open( ... ) ...
```
This cites code in icu4c/source/samples/ucnv/convsamp.cpp as follows:
This cites code in `icu4c/source/samples/ucnv/convsamp.cpp` as follows:
```c++
//! [ucnv_open]
@ -261,12 +261,12 @@ ICU C++ classes and methods as well as the ICU C methods.
In C/C++, we enclose `@draft` and such APIs with `#ifndef U_HIDE_DRAFT_API` or
similar as appropriate. When a draft API becomes stable, we need to remove the
surrounding #ifndef.
surrounding `#ifndef`.
Note: The `@system` tag is *in addition to* the
`@draft`/`@stable`/`@deprecated`/`@obsolete` status tag.
Copy/paste the appropriate #ifndef..#endif pair from the following:
Copy/paste the appropriate `#ifndef..#endif` pair from the following:
```c++
#ifndef U_HIDE_DRAFT_API
@ -289,7 +289,7 @@ We `#ifndef` `@draft`/`@deprecated`/... APIs as much as possible, including C
functions, many C++ class methods (see exceptions below), enum constants (see
exceptions below), whole enums, whole classes, etc.
We do not #ifndef APIs where that would be problematic:
We do not `#ifndef` APIs where that would be problematic:
* struct/class members where that would modify the object layout (non-static
struct/class fields, virtual methods)
@ -478,19 +478,19 @@ Use Java doc-style in-file documentation created with
#### Multiple Statements
Place multiple statements in multiple lines. if() or loop heads must not be
Place multiple statements in multiple lines. `if()` or loop heads must not be
followed by their bodies on the same line.
#### Placements of {} Curly Braces
#### Placements of `{}` Curly Braces
Place curly braces {} in reasonable and consistent locations. Each of us
Place curly braces `{}` in reasonable and consistent locations. Each of us
subscribes to different philosophies. It is recommended to use the style of a
file, instead of mixing different styles. It is requested, however, to not have
if() and loop bodies without curly braces.
`if()` and loop bodies without curly braces.
#### if() {...} and Loop Bodies
#### `if() {...}` and Loop Bodies
Use curly braces for if() and else as well as loop bodies, etc., even if there
Use curly braces for `if()` and else as well as loop bodies, etc., even if there
is only one statement.
#### Function Declarations
@ -563,7 +563,7 @@ U_CDECL_END
Determine if two headers are needed. If the same functionality is provided with
both a C and a C++ API, then there can be two headers, one for each language,
even if one uses the other. For example, there can be umsg.h for C and `msgfmt.h`
even if one uses the other. For example, there can be `umsg.h` for C and `msgfmt.h`
for C++.
Not all functionality has or needs both kinds of API. More and more
@ -638,19 +638,19 @@ happened!)
When casting an enum value to an integer type, make sure that the enum value's
numeric value is within range of the integer type.
#### Do not check for this!=NULL, do not check for NULL references
#### Do not check for `this!=NULL`, do not check for `NULL` references
In public APIs, assume this!=0 and assume that references are not 0. In C code,
"this" is the "service object" pointer, such as `set` in
`uset_add(USet* set, UChar32 c)` — don't check for set!=NULL.
In public APIs, assume `this!=0` and assume that references are not 0. In C code,
`"this"` is the "service object" pointer, such as `set` in
`uset_add(USet* set, UChar32 c)` — don't check for `set!=NULL`.
We do usually check all other (non-this) pointers for NULL, in those cases when
NULL is not valid. (Many functions allow a NULL string or buffer pointer if the
We do usually check all other (non-this) pointers for `NULL`, in those cases when
`NULL` is not valid. (Many functions allow a `NULL` string or buffer pointer if the
length or capacity is 0.)
Rationale: "this" is not really an argument, and checking it costs a little bit
Rationale: `"this"` is not really an argument, and checking it costs a little bit
of code size and runtime. Other libraries also commonly do not check for valid
"this", and resulting failures are fairly obvious.
`"this"`, and resulting failures are fairly obvious.
### Memory Usage
@ -666,7 +666,7 @@ details.
Exception: Most C++ API functions that return a `StringEnumeration` (by pointer
which the caller must delete) are named `getXyz()` rather than `createXyz()`
because "get" is much more natural. (These are not factory methods in the sense
because `"get"` is much more natural. (These are not factory methods in the sense
of `NumberFormat::createScientificInstance()`.) For example,
`static StringEnumeration *Collator::``get``Keywords(UErrorCode &)`. We should document
clearly in the API comments that the caller must delete the returned
@ -870,7 +870,7 @@ U_NAMESPACE_USE
Locale loc("fi");
```
U_NAMESPACE_USE (expands to using namespace icu_M_N; when available) is
`U_NAMESPACE_USE` (expands to using namespace icu_M_N; when available) is
automatically done when `utypes.h` is included, so that all ICU classes are
immediately usable. However, we recommend that you turn this off via
`CXXFLAGS="-DU_USING_ICU_NAMESPACE=0"`.
@ -954,7 +954,7 @@ implementation.
subclass as well (copy implementations from existing C++ APIs).
* If a class is a new, immediate subclass of `UObject` (e.g.,
`Normalizer2`), creating a whole new class hierarchy, then declare a
*private* `getDynamicClassID()` and define it to return NULL (to
*private* `getDynamicClassID()` and define it to return `NULL` (to
override the pure virtual version in `UObject`); copy the relevant lines
from `normalizer2.h` and `normalizer2.cpp`
(`UOBJECT_DEFINE_NO_RTTI_IMPLEMENTATION(className)`). Do not add any
@ -1012,7 +1012,7 @@ adopt-on-success):
must be set to `U_MEMORY_ALLOCATION_ERROR`.)
**Pitfall**: If you allocate/construct via "`ClassName *p = new ClassName(adoptee);`"
and the memory allocation failed (p==NULL), then the
and the memory allocation failed (`p==NULL`), then the
constructor has not been called, the adoptee has not been adopted, and you
are still responsible for deleting it!
@ -1183,7 +1183,7 @@ constructor relies on loading data), then either it must use and set a
like `UnicodeString` and `UnicodeSet`, and the constructor needs to set the object
to bogus if it fails.
#### UVector, UVector32, or UVector64
#### `UVector`, `UVector32`, or `UVector64`
Use `UVector` to store arrays of `void *`; use `UVector32` to store arrays of
`int32_t`; use `UVector64` to store arrays of `int64_t`. Historically, `UVector`
@ -1438,8 +1438,8 @@ ICU Java classes and methods.
The standard order for modifier keywords on APIs is:
* public static final synchronized strictfp
* public abstract
* `public static final synchronized strictfp`
* `public abstract`
Do not use wild card import, such as "`import java.util.*`". The sort order of
import statements is `java` / `javax` / `org` / `com`. Within each top level package
@ -1493,7 +1493,7 @@ Avoid creating new interfaces unless you know you need to mix the interface into
two or more classes that have separate inheritance. Interfaces are impossible to
modify later in a backwards-compatible way. Abstract classes, on the other hand,
can add new methods with default behavior. Use interfaces only if it is required
by the arcitecture, not just for expediency.
by the architecture, not just for expediency.
Current releases of ICU4J (since ICU 63) are restricted to use Java SE 7 APIs
and language features.
@ -1842,7 +1842,7 @@ Type just `intltest -h` to see the usage:
The "Fake Time" capability allows ICU4C to be tested as if the hardware clock is
set to a specific time. This section documents how to use this facility.
Note that this facility requires the POSIX 'gettimeofday' function to be
Note that this facility requires the POSIX `'gettimeofday'` function to be
operable.
This facility affects all ICU 'current time' calculations, including date,
@ -2013,10 +2013,10 @@ format).
* Example: **`CINTLTST_OPTS=/tscoll`** in the cintltst directory provides
arguments to the cintltest test upon make check, to only run collation
tests.
* intltest: INTLTEST_OPTS
* cintltst: CINTLTST_OPTS
* iotest: IOTEST_OPTS
* icuinfo: ICUINFO_OPTS
* intltest: `INTLTEST_OPTS`
* cintltst: `CINTLTST_OPTS`
* iotest: `IOTEST_OPTS`
* icuinfo: `ICUINFO_OPTS`
* (letest does not have an OPTS variable as of ICU 4.6.)
### Windows/Microsoft Visual Studio

View File

@ -41,37 +41,37 @@ synchronization primitives.
**Low Level Atomics**, a set of platform or compiler dependent typedefs and
inlines. Provided in the internal header file
[umutex.h](../../../../icu4c/source/common/umutex.h).
[`umutex.h`](../../../../icu4c/source/common/umutex.h).
| Type/Function | Description |
|-------------------------------------------------------|-----------------------------------------------------------------------------|
| typedef u_atomic_int32_t | A 32 bit integer that will work with low level atomic operations. (typedef) |
| umtx_loadAcquire(u_atomic_int32_t &var) | |
| umtx_storeRelease(u_atomic_int32_t &var, int32_t val) | |
| umtx_atomic_inc(u_atomic_int32_t &var) | |
| umtx_atomic_dec(u_atomic_int32_t &var) | |
| Type/Function | Description |
|---------------------------------------------------------|-------------------------------------------------------------------------------|
| `typedef u_atomic_int32_t` | A 32 bit integer that will work with low level atomic operations. (`typedef`) |
| `umtx_loadAcquire(u_atomic_int32_t &var)` | |
| `umtx_storeRelease(u_atomic_int32_t &var, int32_t val)` | |
| `umtx_atomic_inc(u_atomic_int32_t &var)` | |
| `umtx_atomic_dec(u_atomic_int32_t &var)` | |
**Mutexes**. Type declarations for ICU mutex wrappers. Provided in a header file.
| Type | Description |
|---------------------|---------------------------------------------------------------------------------------------------|
| struct UMutex | An ICU mutex. All instances will be static. Typically just contains an underlying platform mutex. |
| U_MUTEX_INITIALIZER | A C style initializer for a static instance of a UMutex. |
| Type | Description |
|-----------------------|---------------------------------------------------------------------------------------------------|
| `struct UMutex` | An ICU mutex. All instances will be static. Typically just contains an underlying platform mutex. |
| `U_MUTEX_INITIALIZER` | A C style initializer for a static instance of a `UMutex`. |
**Mutex and InitOnce implementations**. Out-of-line platform-specific code.
Provided in a .cpp file.
| Function | Description |
|---------------------------------------|------------------------------------------|
| umtx_lock(UMutex *mutex) | Lock a mutex. |
| umtx_unlock(UMutex* mutex) | Unlock a mutex. |
| umtx_initImplPreInit(UInitOnce &uio) | umtx_initOnce() implementation function. |
| umtx_initImplPostInit(UInitOnce &uio) | umtx_initOnce() implementation function. |
| Function | Description |
|-----------------------------------------|--------------------------------------------|
| `umtx_lock(UMutex *mutex)` | Lock a mutex. |
| `umtx_unlock(UMutex* mutex)` | Unlock a mutex. |
| `umtx_initImplPreInit(UInitOnce &uio)` | `umtx_initOnce()` implementation function. |
| `umtx_initImplPostInit(UInitOnce &uio)` | `umtx_initOnce()` implementation function. |
`UInitOnce` and `umtx_initOnce()` are used internally by ICU for thread-safe
one-time initialization. Their implementation is split into a
platform-independent part (contained in
[umutex.h](../../../../icu4c/source/common/umutex.h)),
[`umutex.h`](../../../../icu4c/source/common/umutex.h)),
and the pair of platform-dependent implementation functions listed above.
**Build Setup**
@ -81,11 +81,11 @@ in the ICU build. If defined, the files are included at the top of the normal
platform `#ifdef` chains in the ICU sources, and effectively define a new
platform.
| Macro | Description |
|------------------|-------------------------------------------------------|
| U_USER_ATOMICS_H | Set to the name of the low level atomics header file. |
| U_USER_MUTEX_H | Mutexes header file. |
| U_USER_MUTEX_CPP | Mutexes and InitOnce implementation file. |
| Macro | Description |
|--------------------|---------------------------------------------------------|
| `U_USER_ATOMICS_H` | Set to the name of the low level atomics header file. |
| `U_USER_MUTEX_H` | Mutexes header file. |
| `U_USER_MUTEX_CPP` | Mutexes and `InitOnce` implementation file. |
It is possible (and reasonable) to supply only the two mutex files, while
retaining the ICU default implementation for the low level atomics.
@ -97,7 +97,7 @@ Example ICU configure with user mutexes specified:
**Stability**
This interface may change between ICU releases. The required set of functions
may be be extended, or details of the behavior required may be altered.
may be extended, or details of the behavior required may be altered.
The types and functions defined by this interface reach deeply into the ICU
implementation, and we need to retain the ability to make changes should the
@ -111,7 +111,7 @@ This implementation uses C++11 language mutexes and atomics. These make for a
convenient reference implementation because the C++11 constructs are well
defined and straight forward to use.
Similar implementations for POSIX and WIndows can be found in files
Similar implementations for POSIX and Windows can be found in files
`common/umutex.h` and `common/umutex.cpp`, in the platform `#ifdef` chains; these are
part of the standard ICU distribution.

View File

@ -59,28 +59,28 @@ For examples of use, search the ICU library code.
**Low Level Atomics**
| Type/Function | Description |
|----------------------------------------|-----------------------------------------------------------------|
| typedef u_atomic_int32_t | A 32 bit integer type for use with low level atomic operations. |
| umtx_atomic_inc(u_atomic_int32_t &var) | |
| umtx_atomic_dec(u_atomic_int32_t &var) | |
| Type/Function | Description |
|------------------------------------------|-----------------------------------------------------------------|
| `typedef u_atomic_int32_t` | A 32 bit integer type for use with low level atomic operations. |
| `umtx_atomic_inc(u_atomic_int32_t &var)` | |
| `umtx_atomic_dec(u_atomic_int32_t &var)` | |
**Mutexes**
| Type/Function | Description |
|----------------------------|--------------------------------------------------------------------|
| struct UMutex | An ICU mutex. All instances must be static. |
| U_MUTEX_INITIALIZER | A C style initializer for a UMutex. |
| umtx_lock(UMutex *mutex) | Lock a mutex. |
| umtx_unlock(UMutex* mutex) | Unlock a mutex. |
| class Mutex | C++ Mutex wrapper withautomatic lock & unlock. See header mutex.h. |
| Type/Function | Description |
|------------------------------|-----------------------------------------------------------------------|
| `struct UMutex` | An ICU mutex. All instances must be `static`. |
| `U_MUTEX_INITIALIZER` | A C style initializer for a `UMutex`. |
| `umtx_lock(UMutex *mutex)` | Lock a mutex. |
| `umtx_unlock(UMutex* mutex)` | Unlock a mutex. |
| `class Mutex` | C++ Mutex wrapper with automatic lock & unlock. See header `mutex.h.` |
**One Time Initialization**
| Type/Function | Description |
|-------------------------------|-----------------------------------------------------------------------------------------|
| struct UInitOnce | Provides an efficient facility for one-time initialization of static or global objects. |
| umtx_initOnce(UInitOnce, ...) | A family of initialization functions. |
| Type/Function | Description |
|---------------------------------|-----------------------------------------------------------------------------------------|
| `struct UInitOnce` | Provides an efficient facility for one-time initialization of static or global objects. |
| `umtx_initOnce(UInitOnce, ...)` | A family of initialization functions. |
All of these functions are for internal ICU implementation use only. They are
not exported, and not intended for external use.

View File

@ -30,7 +30,7 @@ Converting from the internal representation (milliseconds since midnight,
January 1, 1970) to text is known as "formatting," and converting from text to
milliseconds is known as "parsing." These processes involve two mappings:
* A mapping between a point in time (UDate) and a set of calendar fields,
* A mapping between a point in time (`UDate`) and a set of calendar fields,
which in turn depends on:
* The rules of a particular calendar system (e.g. Gregorian, Buddhist,
Chinese Lunar)
@ -39,17 +39,17 @@ milliseconds is known as "parsing." These processes involve two mappings:
representation, which depends on the fields selected for display, their
display style, and the conventions of a particular locale.
## DateFormat
## `DateFormat`
DateFormat helps format and parse dates for any locale. Your code can be
`DateFormat` helps format and parse dates for any locale. Your code can be
completely independent of the locale conventions for months, days of the week,
or calendar format.
### Formatting Dates
The DateFormat interface in ICU enables you to format a Date in milliseconds
The `DateFormat` interface in ICU enables you to format a `Date` in milliseconds
into a string representation of the date. It also parses the string back to the
internal Date representation in milliseconds.
internal `Date` representation in milliseconds.
```cpp
DateFormat* df = DateFormat::createDateInstance();
@ -61,7 +61,7 @@ for (int32_t i = 0; i < 3; ++i) {
}
```
To format a date for a different Locale, specify it in the call to:
To format a date for a different `Locale`, specify it in the call to:
```cpp
DateFormat* df = DateFormat::createDateInstance
@ -70,7 +70,7 @@ DateFormat* df = DateFormat::createDateInstance
### Parsing Dates
Use a DateFormat to parse also:
Use a `DateFormat` to parse also:
```cpp
UErrorCode status = ZERO_ERROR;
@ -79,7 +79,7 @@ UDate myDate = df->parse(myString, status);
When numeric fields abut one another directly, with no intervening delimiter
characters, they constitute a run of abutting numeric fields. Such runs are
parsed specially. For example, the format "HHmmss" parses the input text
parsed specially. For example, the format `"HHmmss"` parses the input text
"123456" to 12:34:56, parses the input text "12345" to 1:23:45, and fails to
parse "1234". In other words, the leftmost field of the run is flexible, while
the others keep a fixed width. If the parse fails anywhere in the run, then the
@ -90,81 +90,80 @@ the run fails.
### Producing Normal Date Formats for a Locale
Use createDateInstance to produce the normal date format for that country. There
are other static factory methods available. Use createTimeInstance to produce
the normal time format for that country. Use createDateTimeInstance to produce a
DateFormat that formats both date and time. You can pass different options to
these factory methods to control the length of the result; from SHORT to MEDIUM
to LONG to FULL. The exact result depends on the locale, but generally:
Use `createDateInstance` to produce the normal date format for that country. There
are other static factory methods available. Use `createTimeInstance` to produce
the normal time format for that country. Use `createDateTimeInstance` to produce a
`DateFormat` that formats both date and time. You can pass different options to
these factory methods to control the length of the result; from `SHORT` to `MEDIUM`
to `LONG` to `FULL`. The exact result depends on the locale, but generally:
1. SHORT is numeric, such as 12/13/52 or 3:30pm
1. `SHORT` is numeric, such as 12/13/52 or 3:30pm
2. MEDIUM is longer, such as Jan. 12, 1952
2. `MEDIUM` is longer, such as Jan. 12, 1952
3. LONG is longer, such as January 12, 1952 or 3:30:32pm
3. `LONG` is longer, such as January 12, 1952 or 3:30:32pm
4. FULL is completely specified, such as Tuesday, April 12, 1952 AD or
4. `FULL` is completely specified, such as Tuesday, April 12, 1952 AD or
3:30:42pm PST
For more general flexibility, the [DateTimePatternGenerator](index.md) can map a
custom selection of time and date fields, along with various display styles for
those fields, to a locale-appropriate format that can then be set as the format
to use by the DateFormat.
For more general flexibility, the [`DateTimePatternGenerator`](index#datetimepatterngenerator)
can map a custom selection of time and date fields, along with various display styles for
those fields, to a locale-appropriate format that can then be set as the format to use by the `DateFormat`.
### Producing Relative Date Formats for a Locale
ICU currently provides limited support for formatting dates using a “relative”
style, specified using RELATIVE_SHORT, RELATIVE_MEDIUM, RELATIVE_LONG. or
RELATIVE_FULL. As currently implemented, relative date formatting only affects
style, specified using `RELATIVE_SHORT`, `RELATIVE_MEDIUM`, `RELATIVE_LONG` or
`RELATIVE_FULL`. As currently implemented, relative date formatting only affects
the formatting of dates within a limited range of calendar days before or after
the current date, based on the CLDR `<field type="day">`/`<relative>` data: For
example, in English, "Yesterday", "Today", and "Tomorrow". Within this range,
the specific relative style currently makes no difference. Outside of this
range, relative dates are formatted using the corresponding non-relative style
(SHORT, MEDIUM, etc.). Relative time styles are not currently supported, and
(`SHORT`, `MEDIUM`, etc.). Relative time styles are not currently supported, and
behave just like the corresponding non-relative style.
### Setting Time Zones
You can set the time zone on the format. If you want more control over the
format or parsing, cast the DateFormat you get from the factory methods to a
SimpleDateFormat. This works for the majority of countries.
format or parsing, cast the `DateFormat` you get from the factory methods to a
`SimpleDateFormat`. This works for the majority of countries.
> :point_right: **Note**: *Remember to check getDynamicClassID() before carrying out the cast.*
> :point_right: **Note**: *Remember to check `getDynamicClassID()` before carrying out the cast.*
### Working with Positions
You can also use forms of the parse and format methods with ParsePosition and
FieldPosition to enable you to:
You can also use forms of the parse and format methods with `ParsePosition` and
`FieldPosition` to enable you to:
1. Progressively parse through pieces of a string.
2. Align any particular field, or find out where it is for selection on the
screen.
## SimpleDateFormat
## `SimpleDateFormat`
SimpleDateFormat is a concrete class used for formatting and parsing dates in a
`SimpleDateFormat` is a concrete class used for formatting and parsing dates in a
language-independent manner. It allows for formatting, parsing, and
normalization. It formats or parses a date or time, which is the standard
milliseconds since 24:00 GMT, Jan. 1, 1970.
SimpleDateFormat is the only built-in implementation of DateFormat. It provides
`SimpleDateFormat` is the only built-in implementation of `DateFormat`. It provides
a programmable interface that can be used to produce formatted dates and times
in a wide variety of formats. The formats include almost all of the most common
ones.
Create a date-time formatter using the following methods rather than
constructing an instance of SimpleDateFormat. In this way, the program is
constructing an instance of `SimpleDateFormat`. In this way, the program is
guaranteed to get an appropriate formatting pattern of the locale.
1. DateFormat::getInstance()
1. `DateFormat::getInstance()`
2. getDateInstance()
2. `getDateInstance()`
3. getDateTimeInstance()
3. `getDateTimeInstance()`
If you need a more unusual pattern, construct a SimpleDateFormat directly and
If you need a more unusual pattern, construct a `SimpleDateFormat` directly and
give it an appropriate pattern.
### Date/Time Format Syntax
@ -174,11 +173,11 @@ are replaced with date and time data from a calendar when formatting or used to
generate data for a calendar when parsing.
The Date Field Symbol Table below contains the characters used in patterns to
show the appropriate formats for a given locale, such as yyyy for the year.
Characters may be used multiple times. For example, if y is used for the year,
'yy' might produce '99', whereas 'yyyy' produces '1999'. For most numerical
fields, the number of characters specifies the field width. For example, if h is
the hour, 'h' might produce '5', but 'hh' produces '05'. For some characters,
show the appropriate formats for a given locale, such as `yyyy` for the year.
Characters may be used multiple times. For example, if `y` is used for the year,
`'yy'` might produce '99', whereas `'yyyy'` produces '1999'. For most numerical
fields, the number of characters specifies the field width. For example, if `h` is
the hour, `'h'` might produce '5', but `'hh'` produces '05'. For some characters,
the count specifies whether an abbreviated or full form should be used, but may
have other choices, as given below.
@ -187,14 +186,14 @@ single quotes. Text within single quotes is not interpreted in any way (except
for two adjacent single quotes). Otherwise all ASCII letter from a to z and A to
Z are reserved as syntax characters, and require quoting if they are to
represent literal characters. In addition, certain ASCII punctuation characters
may become variable in the future (eg ":" being interpreted as the time
separator and '/' as a date separator, and replaced by respective
may become variable in the future (eg `":"` being interpreted as the time
separator and `'/'` as a date separator, and replaced by respective
locale-sensitive characters in display).
"Stand Alone" values refer to those designed to stand on their own, as opposed
to being with other formatted values. "2nd quarter" would use the stand alone
format (QQQQ), whereas "2nd quarter 2007" would use the regular format (qqqq
yyyy).
to being with other formatted values. "2nd quarter" would use the standalone
format `(QQQQ)`, whereas "2nd quarter 2007" would use the regular format `(qqqq
yyyy)`.
The pattern characters used in the Date Field Symbol Table are defined by CLDR;
for more information see [CLDR Date Field Symbol Table](https://www.unicode.org/reports/tr35/tr35-dates.html#Date_Field_Symbol_Table).
@ -204,52 +203,52 @@ Note that the examples may not reflect current CLDR data.
| Symbol | Meaning | Pattern | Example Output |
| --- | --- | --- | --- |
| G | era designator | G, GG, or GGG<br/>GGGG<br/>GGGGG | AD<br/>Anno Domini<br/>A |
| y | year | yy<br/>y or yyyy | 96<br/>1996 |
| Y | year of "Week of Year" | Y | 1997 |
| u | extended year | u | 4601 |
| U | cyclic year name, as in Chinese lunar calendar | U | 甲子 |
| r | related Gregorian year | r | 1996 |
| Q | quarter | Q<br/>QQ<br/>QQQ<br/>QQQQ<br/>QQQQQ | 2<br/>02<br/>Q2<br/>2nd quarter<br/>2 |
| q | Stand Alone quarter | q<br/>qq<br/>qqq<br/>qqqq<br/>qqqqq | 2<br/>02<br/>Q2<br/>2nd quarter<br/>2 |
| M | month in year | M<br/>MM<br/>MMM<br/>MMMM<br/>MMMMM | 9<br/>09<br/>Sep<br/>September<br/>S |
| L | Stand Alone month in year | L<br/>LL<br/>LLL<br/>LLLL<br/>LLLLL | 9<br/>09<br/>Sep<br/>September<br/>S |
| w | week of year | w<br/>ww | 27<br/>27 |
| W | week of month | W | 2 |
| d | day in month | d<br/>dd | 2<br/>02 |
| D | day of year | D | 189 |
| F | day of week in month | F | 2 (2nd Wed in July) |
| g | modified julian day | g | 2451334 |
| E | day of week | E, EE, or EEE<br/>EEEE<br/>EEEEE<br/>EEEEEE | Tue<br/>Tuesday<br/>T<br/>Tu |
| e | local day of week<br/>example: if Monday is 1st day, Tuesday is 2nd ) | e or ee<br/>eee<br/>eeee<br/>eeeee<br/>eeeeee | 2<br/>Tue<br/>Tuesday<br/>T<br/>Tu |
| c | Stand Alone local day of week | c or cc<br/>ccc<br/>cccc<br/>ccccc<br/>cccccc | 2<br/>Tue<br/>Tuesday<br/>T<br/>Tu |
| a | am/pm marker | a | pm |
| h | hour in am/pm (1~12) | h<br/>hh | 7<br/>07 |
| H | hour in day (0~23) | H<br/>HH | 0<br/>00 |
| k | hour in day (1~24) | k<br/>kk | 24<br/>24 |
| K | hour in am/pm (0~11) | K<br/>KK | 0<br/>00 |
| m | minute in hour | m<br/>mm | 4<br/>04 |
| s | second in minute | s<br/>ss | 5<br/>05 |
| S | fractional second - truncates (like other time fields)<br/>to the count of letters when formatting. Appends<br/>zeros if more than 3 letters specified. Truncates at<br/>three significant digits when parsing. | S<br/>SS<br/>SSS<br/>SSSS | 2<br/>23<br/>235<br/>2350 |
| A | milliseconds in day | A | 61201235 |
| z | Time Zone: specific non-location | z, zz, or zzz<br/>zzzz | PDT<br/>Pacific Daylight Time |
| Z | Time Zone: ISO8601 basic hms? / RFC 822<br/>Time Zone: long localized GMT (=OOOO)<br/>TIme Zone: ISO8601 extended hms? (=XXXXX) | Z, ZZ, or ZZZ<br/>ZZZZ<br/>ZZZZZ | -0800<br/>GMT-08:00<br/>-08:00, -07:52:58, Z |
| O | Time Zone: short localized GMT<br/>Time Zone: long localized GMT (=ZZZZ) | O<br/>OOOO | GMT-8<br/>GMT-08:00 |
| v | Time Zone: generic non-location<br/>(falls back first to VVVV) | v<br/>vvvv | PT<br/>Pacific Time or Los Angeles Time |
| V | Time Zone: short time zone ID<br/>Time Zone: long time zone ID<br/>Time Zone: time zone exemplar city<br/>Time Zone: generic location (falls back to OOOO) | V<br/>VV<br/>VVV<br/>VVVV | uslax<br/>America/Los_Angeles<br/>Los Angeles<br/>Los Angeles Time |
| X | Time Zone: ISO8601 basic hm?, with Z for 0<br/>Time Zone: ISO8601 basic hm, with Z<br/>Time Zone: ISO8601 extended hm, with Z<br/>Time Zone: ISO8601 basic hms?, with Z<br/>Time Zone: ISO8601 extended hms?, with Z | X<br/>XX<br/>XXX<br/>XXXX<br/>XXXXX | -08, +0530, Z<br/>-0800, Z<br/>-08:00, Z<br/>-0800, -075258, Z<br/>-08:00, -07:52:58, Z |
| x | Time Zone: ISO8601 basic hm?, without Z for 0<br/>Time Zone: ISO8601 basic hm, without Z<br/>Time Zone: ISO8601 extended hm, without Z<br/>Time Zone: ISO8601 basic hms?, without Z<br/>Time Zone: ISO8601 extended hms?, without Z | x<br/>xx<br/>xxx<br/>xxxx<br/>xxxxx | -08, +0530<br/>-0800<br/>-08:00<br/>-0800, -075258<br/>-08:00, -07:52:58 |
| ' | escape for text | ' | (nothing) |
| ' ' | two single quotes produce one | ' ' | ' |
| `G` | era designator | `G`, `GG`, or `GGG`<br/>`GGGG`<br/>`GGGGG` | AD<br/>Anno Domini<br/>A |
| `y` | year | yy<br/>`y` or `yyyy` | 96<br/>1996 |
| `Y` | year of "Week of Year" | `Y` | 1997 |
| `u` | extended year | `u` | 4601 |
| `U` | cyclic year name, as in Chinese lunar calendar | `U` | 甲子 |
| `r` | related Gregorian year | `r` | 1996 |
| `Q` | quarter | `Q`<br/>`QQ`<br/>`QQQ`<br/>`QQQQ`<br/>`QQQQQ` | 2<br/>02<br/>Q2<br/>2nd quarter<br/>2 |
| `q` | Stand Alone quarter | `q`<br/>`qq`<br/>`qqq`<br/>`qqqq`<br/>`qqqqq` | 2<br/>02<br/>Q2<br/>2nd quarter<br/>2 |
| `M` | month in year | `M`<br/>`MM`<br/>`MMM`<br/>`MMMM`<br/>`MMMMM` | 9<br/>09<br/>Sep<br/>September<br/>S |
| `L` | Stand Alone month in year | `L`<br/>`LL`<br/>`LLL`<br/>`LLLL`<br/>`LLLLL` | 9<br/>09<br/>Sep<br/>September<br/>S |
| `w` | week of year | `w`<br/>`ww` | 27<br/>27 |
| `W` | week of month | `W` | 2 |
| `d` | day in month | `d`<br/>`dd` | 2<br/>02 |
| `D` | day of year | `D` | 189 |
| `F` | day of week in month | `F` | 2 (2nd Wed in July) |
| `g` | modified julian day | `g` | 2451334 |
| `E` | day of week | `E,` `EE`, or `EEE`<br/>`EEEE`<br/>`EEEEE`<br/>`EEEEEE` | Tue<br/>Tuesday<br/>T<br/>Tu |
| `e` | local day of week<br/>example: if Monday is 1st day, Tuesday is 2nd ) | `e` or `ee`<br/>`eee`<br/>`eeee`<br/>`eeeee`<br/>`eeeeee` | 2<br/>Tue<br/>Tuesday<br/>T<br/>Tu |
| `c` | Stand Alone local day of week | `c` or `cc`<br/>`ccc`<br/>`cccc`<br/>`ccccc`<br/>`cccccc` | 2<br/>Tue<br/>Tuesday<br/>T<br/>Tu |
| `a` | am/pm marker | `a` | pm |
| `h` | hour in am/pm (1~12) | `h`<br/>`hh` | 7<br/>07 |
| `H` | hour in day (0~23) | `H`<br/>`HH` | 0<br/>00 |
| `k` | hour in day (1~24) | `k`<br/>`kk` | 24<br/>24 |
| `K` | hour in am/pm (0~11) | `K`<br/>`KK` | 0<br/>00 |
| `m` | minute in hour | `m`<br/>`mm` | 4<br/>04 |
| `s` | second in minute | `s`<br/>`ss` | 5<br/>05 |
| `S` | fractional second - truncates (like other time fields)<br/>to the count of letters when formatting. Appends<br/>zeros if more than 3 letters specified. Truncates at<br/>three significant digits when parsing. | `S`<br/>`SS`<br/>`SSS`<br/>`SSSS` | 2<br/>23<br/>235<br/>2350 |
| `A` | milliseconds in day | `A` | 61201235 |
| `z` | Time Zone: specific non-location | `z`, `zz`, or `zzz`<br/>`zzzz` | PDT<br/>Pacific Daylight Time |
| `Z` | Time Zone: ISO8601 basic hms? / RFC 822<br/>Time Zone: long localized GMT (=OOOO)<br/>TIme Zone: ISO8601 extended hms? (=XXXXX) | `Z`, `ZZ`, or `ZZZ`<br/>`ZZZZ`<br/>`ZZZZZ` | -0800<br/>GMT-08:00<br/>-08:00, -07:52:58, Z |
| `O` | Time Zone: short localized GMT<br/>Time Zone: long localized GMT (=ZZZZ) | `O`<br/>`OOOO` | GMT-8<br/>GMT-08:00 |
| `v` | Time Zone: generic non-location<br/>(falls back first to VVVV) | `v`<br/>`vvvv` | PT<br/>Pacific Time or Los Angeles Time |
| `V` | Time Zone: short time zone ID<br/>Time Zone: long time zone ID<br/>Time Zone: time zone exemplar city<br/>Time Zone: generic location (falls back to OOOO) | `V`<br/>`VV`<br/>`VVV`<br/>`VVVV` | uslax<br/>America/Los_Angeles<br/>Los Angeles<br/>Los Angeles Time |
| `X` | Time Zone: ISO8601 basic hm?, with Z for 0<br/>Time Zone: ISO8601 basic hm, with Z<br/>Time Zone: ISO8601 extended hm, with Z<br/>Time Zone: ISO8601 basic hms?, with Z<br/>Time Zone: ISO8601 extended hms?, with Z | `X`<br/>`XX`<br/>`XXX`<br/>`XXXX`<br/>`XXXXX` | -08, +0530, Z<br/>-0800, Z<br/>-08:00, Z<br/>-0800, -075258, Z<br/>-08:00, -07:52:58, Z |
| `x` | Time Zone: ISO8601 basic hm?, without Z for 0<br/>Time Zone: ISO8601 basic hm, without Z<br/>Time Zone: ISO8601 extended hm, without Z<br/>Time Zone: ISO8601 basic hms?, without Z<br/>Time Zone: ISO8601 extended hms?, without Z | `x`<br/>`xx`<br/>`xxx`<br/>`xxxx`<br/>`xxxxx` | -08, +0530<br/>-0800<br/>-08:00<br/>-0800, -075258<br/>-08:00, -07:52:58 |
| `'` | escape for text | `'` | (nothing) |
| `' '` | two single quotes produce one | `' '` | ' |
> :point_right: **Note**: *Any characters in the pattern that are not in the ranges of
['a'..'z'] and ['A'..'Z'] will be treated as quoted text. For instance,
characters like ':', '.', ' ', '#' and '@' will appear in the resulting time
text even they are not enclosed within single quotes.The single quote is used to
characters like `':'`, `'.'`, `' '`, `'#'` and `'@'` will appear in the resulting time
text even they are not enclosed within single quotes. The single quote is used to
'escape' letters. Two single quotes in a row, whether inside or outside a quoted
sequence, represent a 'real' single quote.*
> :point_right: **Note**: *A pattern containing any invalid pattern letter results in a failing UErrorCode
> :point_right: **Note**: *A pattern containing any invalid pattern letter results in a failing `UErrorCode`
result during formatting or parsing.*
| Format Pattern | Result |
@ -281,10 +280,10 @@ date. So the display name used for “America/Indiana/Tell_City” before the da
> :point_right: **Note**: *Prior to ICU 3.8, a localized time zone name (except GMT format) and a time
zone ID was in 1-to-1 relationship always. Therefore, a time zone name produced
by DateFormat can be parsed back to the original time zone. This assumption no
by `DateFormat` can be parsed back to the original time zone. This assumption no
longer applies to ICU 3.8 and later releases for all time zone format types. If
you program requires to roundtrip specific time zone ID, you must use the
generic location format (“VVVV”) explained below.*
generic location format `(“VVVV”)` explained below.*
There are several different display name types available in the LDML
specification.
@ -301,41 +300,41 @@ specification.
| RFC822 GMT | A constant, specific offset from GMT in a locale insensitive format. | -0800 |
Each format type in the above table is used as a primary type or a fallback in
SimpleDateFormat. The table below explains how ICU time zone format pattern work
`SimpleDateFormat`. The table below explains how ICU time zone format pattern work
and its characteristics.
#### Time Zone Pattern Usage
| Pattern | Behavior | Round-trip time at daylight transitions(\*) | Round-trip Time Zone | Suggested Usage |
| --- | --- | --- | --- | --- |
| z, zz, zzz | Short specific non-location format (e.g. “PST”).If the localized data is not available or the short abbreviation is not commonly used for the locale, localized GMT format is used (e.g. GMT-08:00). | yes | no | For displaying a time with a user friendly time zone name. |
| zzzz | Long specific non-location format (e.g. “Pacific Standard Time”).If the localized data is not available, localized GMT format is used (e.g. GMT-08:00). | yes | no | Same as “z”, but longer format. |
| v | Short generic non-location format (e.g. “PT”).If the localized data is not available or the short abbreviation is not commonly used for the locale, generic location format (e.g. “United States(Los Angeles) Time”) is used.If the localized data comes from metazone and the GMT offset at the given time in the specified time zone differs from the preferred time zone of the metazone for the locale, generic partial location format (e.g. “PT (Canada)”) is used. | no | no | For displaying a recurring wall time (e.g. events, meetings) or anywhere people do not want to be overly specific. |
| vvvv | Long generic non-location format (e.g. “Pacific Time”).If the localized data is not available, generic location format (e.g. “United States(Los Angeles) Time”) is used. | no | no | Same as “v”, but longer format. |
| V | Same as “z”, except using the short abbreviation even it is not commonly used for the locale. | yes | no | Same as “z”. |
| VVVV | Generic location format (e.g. “United States (Los Angeles) Time”). | no | yes | For populating a choice list for time zones, because it supports 1-to-1 name/zone ID mapping and is more uniform than other text formats.Also, this is only the pattern supporting time zone round-trip. If your program requires to preserve the original time zone information, use this pattern. |
| Z, ZZ, ZZZ | Localized GMT format (e.g. “GMT-08:00”). | yes | no | For displaying a time in UI in a uniformed manner. |
| ZZZZ | RFC822 GMT format (e.g. “-0800”). | yes | no | For formatting a time for non-user-facing data. |
| `z`, `zz`, `zzz` | Short specific non-location format (e.g. “PST”). If the localized data is not available or the short abbreviation is not commonly used for the locale, localized GMT format is used (e.g. GMT-08:00). | yes | no | For displaying a time with a user friendly time zone name. |
| `zzzz` | Long specific non-location format (e.g. “Pacific Standard Time”). If the localized data is not available, localized GMT format is used (e.g. GMT-08:00). | yes | no | Same as `“z”`, but longer format. |
| `v` | Short generic non-location format (e.g. “PT”). If the localized data is not available or the short abbreviation is not commonly used for the locale, generic location format (e.g. “United States(Los Angeles) Time”) is used. If the localized data comes from metazone and the GMT offset at the given time in the specified time zone differs from the preferred time zone of the metazone for the locale, generic partial location format (e.g. “PT (Canada)”) is used. | no | no | For displaying a recurring wall time (e.g. events, meetings) or anywhere people do not want to be overly specific. |
| `vvvv` | Long generic non-location format (e.g. “Pacific Time”). If the localized data is not available, generic location format (e.g. “United States(Los Angeles) Time”) is used. | no | no | Same as `“v”`, but longer format. |
| `V` | Same as `“z”`, except using the short abbreviation even it is not commonly used for the locale. | yes | no | Same as `“z”`. |
| `VVVV` | Generic location format (e.g. “United States (Los Angeles) Time”). | no | yes | For populating a choice list for time zones, because it supports 1-to-1 name/zone ID mapping and is more uniform than other text formats. Also, this is only the pattern supporting time zone round-trip. If your program requires to preserve the original time zone information, use this pattern. |
| `Z`, `ZZ`, `ZZZ` | Localized GMT format (e.g. “GMT-08:00”). | yes | no | For displaying a time in UI in a uniformed manner. |
| `ZZZZ` | RFC822 GMT format (e.g. “-0800”). | yes | no | For formatting a time for non-user-facing data. |
\* At a transition from daylight saving time to standard time, there is a wall
time interval occurs twice.
## DateTimePatternGenerator
## `DateTimePatternGenerator`
The DateTimePatternGenerator class provides a way to map a request for a set of
The `DateTimePatternGenerator` class provides a way to map a request for a set of
date/time fields, along with their width, to a locale-appropriate format
pattern. The request is in the form of a “skeleton” which just contains pattern
letters for the desired fields using the representation for the desired width.
In a skeleton, anything other than a pattern letter is ignored, field order is
insignificant, and there are two special additional pattern letters that may be
used: 'j' requests the preferred hour-cycle type for the locale (it gets mapped
to one of 'H', 'h', 'k', or 'K'); 'J' is similar but requests no AM/PM marker
even if the locales preferred hour-cycle type is 'h' or 'K'.
used: `'j'` requests the preferred hour-cycle type for the locale (it gets mapped
to one of `'H'`, `'h'`, `'k'`, or `'K'`); `'J'` is similar but requests no AM/PM marker
even if the locales preferred hour-cycle type is `'h'` or `'K'`.
For example, a skeleton of “MMMMdjmm” might result in the following format
For example, a skeleton of `“MMMMdjmm”` might result in the following format
patterns for different locales:
| locale | format pattern for skeleton “MMMMdjmm” | example |
| locale | format pattern for skeleton `“MMMMdjmm”` | example |
| ------ | -------------------------------------- | ------------------ |
| en_US | "MMMM d  'at'  h:mm a" | April 2 at 5:00 PM |
| es_ES | "d 'de' MMMM, H:mm" | 2 de abril, 17:00 |
@ -347,41 +346,41 @@ getBestPattern.
Note that the fields in the format pattern may be adjusted as appropriate for
the locale and may not exactly match those in the skeleton. For example:
* In Russian (locale "ru"), the skeleton "yMMMM" will produce the format
pattern "LLLL y" (or "LLLL y 'г'.") since a month name without a day number
must be in nominative form, as indicated by LLLL.
* In Russian (locale "ru"), the skeleton `"yMMMM"` will produce the format
pattern `"LLLL y"` (or "LLLL y 'г'.") since a month name without a day number
must be in nominative form, as indicated by `LLLL`.
* When using the Japanese calendar in the Japanese locale (locale
"ja@calendar=japanese"), the skeleton "yMMMd" will produce the format
"ja@calendar=japanese"), the skeleton `"yMMMd"` will produce the format
pattern "Gy年M月d日" since the era must always be shown with the year in the
Japanese calendar.
## DateFormatSymbols
## `DateFormatSymbols`
DateFormatSymbols is a public class for encapsulating localizable date-time
formatting data, including time zone data. DateFormatSymbols is used by
DateFormat and SimpleDateFormat.
`DateFormatSymbols` is a public class for encapsulating localizable date-time
formatting data, including time zone data. `DateFormatSymbols` is used by
`DateFormat` and `SimpleDateFormat`.
DateFormatSymbols specifies the exact character strings to use for various parts
`DateFormatSymbols` specifies the exact character strings to use for various parts
of a date or time For example, the names of the months and days of the week, the
strings for AM and PM and the day of the week considered to be the first day of
the week (used in drawing calendar grids) are controlled by DateFormatSymbols.
the week (used in drawing calendar grids) are controlled by `DateFormatSymbols`.
Create a date-time formatter using the `createTimeInstance`, `createDateInstance`,
or `createDateTimeInstance` methods in DateFormat. Each of these methods can
or `createDateTimeInstance` methods in `DateFormat`. Each of these methods can
return a date/time formatter initialized with a default format pattern, along
with the date-time formatting data for a given or default locale. After a
formatter is created, modify the format pattern using `applyPattern`.
If you want to create a date-time formatter with a particular format pattern and
locale, use one of the SimpleDateFormat constructors:
locale, use one of the `SimpleDateFormat` constructors:
```cpp
UnicodeString aPattern("GyyyyMMddHHmmssSSZ", "");
new SimpleDateFormat(aPattern, new DateFormatSymbols(Locale::getUS())
```
This loads the appropriate date-time formatting data from the locale.s
This loads the appropriate date-time formatting data from the locale.
## Programming Examples
See [date and time formatting examples](examples.md) .
See [date and time formatting examples](examples.md).

View File

@ -10,7 +10,7 @@ parent: Formatting Numbers
License & terms of use: http://www.unicode.org/copyright.html
-->
# FormattedValue
# `FormattedValue`
{: .no_toc }
## Contents
@ -23,22 +23,22 @@ License & terms of use: http://www.unicode.org/copyright.html
## Overview
FormattedValue is an abstraction for localized strings with attributes
returned by a number of ICU formatters. APIs for FormattedValue are available
`FormattedValue` is an abstraction for localized strings with attributes
returned by a number of ICU formatters. APIs for `FormattedValue` are available
in Java, C++, and C. For more details and a list of all implementing classes,
refer to the API docs:
- [C++ FormattedValue](https://unicode-org.github.io/icu-docs/apidoc/released/icu4c/classicu_1_1FormattedValue.html)
- [C UFormattedValue](https://unicode-org.github.io/icu-docs/apidoc/released/icu4c/globals_u.html) -- search for "resultAsValue"
- [Java FormattedValue](https://unicode-org.github.io/icu-docs/apidoc/released/icu4j/com/ibm/icu/text/FormattedValue.html)
- [C++ `FormattedValue`](https://unicode-org.github.io/icu-docs/apidoc/released/icu4c/classicu_1_1FormattedValue.html)
- [C `UFormattedValue`](https://unicode-org.github.io/icu-docs/apidoc/released/icu4c/globals_u.html) -- search for "resultAsValue"
- [Java `FormattedValue`](https://unicode-org.github.io/icu-docs/apidoc/released/icu4j/com/ibm/icu/text/FormattedValue.html)
## Nested Span Fields
Certain ICU formatters, like FormattedList and FormattedDateInterval, use
Certain ICU formatters, like `FormattedList` and `FormattedDateInterval`, use
*span fields* to return information about which spans of a string correspond
to different input parameters. In C and C++, span fields are implemented
using a field category, with the field being set to the input index; in Java,
they are implemented by associating an Integer value with a SpanField
they are implemented by associating an `Integer` value with a `SpanField`
subclass.
For example, in C++, here is how you can determine which region in a formatted
@ -86,6 +86,6 @@ assertFalse("No more than one occurrence of the field",
```
A span may cover multiple primitive fields; in the above example, the span
contains both a month and a date. Using FormattedValue, those primitive
contains both a month and a date. Using `FormattedValue`, those primitive
fields will also be present, and you can check their start and end indices to
see if they are contained within a desired span.

View File

@ -105,24 +105,22 @@ is displayed as 75%.
#### Customizing Number Formats
If you need to customize a number format you can use the DecimalFormat (§) and
the DecimalFormatSymbols (§) classes in the [Formatting
Numbers](formatparse/numbers/index.md) chapter. This not usually necessary and
If you need to customize a number format you can use the `DecimalFormat` and
the `DecimalFormatSymbols` classes in the [Formatting
Numbers](numbers/index#formatting-numbers) chapter. This not usually necessary and
it makes your code much more complex, but it is available for those rare
instances where you need it. In general, you would do this by explicitly
specifying the number format pattern.
If you need to format or parse spelled-out numbers, you can use the
RuleBasedNumberFormat class (§) (see the [Formatting
Numbers](formatparse/numbers/index.md) chapter). You can instantiate a default
formatter for a locale, or by using the RuleBasedNumberFormat rule syntax,
specify your own.
`RuleBasedNumberFormat` class (see the [Formatting Numbers](numbers/index#formatting-numbers) chapter).
You can instantiate a default formatter for a locale, or by using the
`RuleBasedNumberFormat` rule syntax, specify your own.
Using NumberFormat (§) class methods (see the [Formatting
Numbers](formatparse/numbers/index.md) chapter) with a predefined locale is the
easiest and the most accurate way to format numbers, and currencies.
Using `NumberFormat` class methods (see the [Formatting Numbers](numbers/index#formatting-numbers) chapter)
with a predefined locale is the easiest and the most accurate way to format numbers, and currencies.
> :point_right: **Note**: *See [Properties and ICU Rule Syntax](strings/properties.md) for
> :point_right: **Note**: *See [Properties and ICU Rule Syntax](../strings/properties) for
information regarding syntax characters.*
### Date and Times
@ -134,8 +132,8 @@ recognize 20.4.98 as a valid date, and Americans recognize 4/20/98.
> :point_right: **Note**: *The appropriate Calendar support is required for different locales. For
example, the Buddhist calendar is the official calendar in Thailand so the
typical assumption of Gregorian Calendar usage should not be used. ICU will pick
the appropriate Calendar based on the locale you supply when opening a Calendar
or DateFormat.*
the appropriate Calendar based on the locale you supply when opening a `Calendar`
or `DateFormat`.*
### Messages
@ -154,72 +152,85 @@ and messages:
### General Formatting
* `Format`:
The abstract superclass of all format classes. It provides the basic methods
for formatting and parsing numbers, dates, strings and other objects.
* `FieldPosition`:
A concrete class for holding the field constant and the begin and end
indices for number and date fields.
* `ParsePosition`:
A concrete class for holding the parse position in a string during parsing.
* `Formattable`:
Formattable objects can be passed to the Format class or its subclasses for
`Formattable` objects can be passed to the `Format` class or its subclasses for
formatting. It encapsulates a polymorphic piece of data to be formatted and
is used with MessageFormat. Formattable is used by some formatting
is used with `MessageFormat`. `Formattable` is used by some formatting
operations to provide a single "type" that encompasses all formattable
values (e.g., it can hold a number, a date, or a string, and so on).
* `UParseError`:
UParseError is used to returned detailed information about parsing errors.
`UParseError` is used to returned detailed information about parsing errors.
It is used by the ICU parsing engines that parse long rules, patterns, or
programs. This is helpful when the text being parsed is long enough that
more information than a UErrorCode is needed to localize the error.
more information than a `UErrorCode` is needed to localize the error.
**Formatting Numbers**
* [NumberFormat](numbers/legacy-numberformat) (§)
* [`NumberFormat`](numbers/legacy-numberformat#numberformat)
The abstract superclass that provides the basic fields and methods for
formatting Number objects and number primitives to localized strings and
parsing localized strings to Number objects.
formatting `Number` objects and number primitives to localized strings and
parsing localized strings to `Number` objects.
* [DecimalFormat](numbers/legacy-numberformat#decimalformat) (§)
A concrete class for formatting Number objects and number primitives to
localized strings and parsing localized strings to Number objects, in base
10.
* [`DecimalFormat`](numbers/legacy-numberformat#decimalformat)
A concrete class for formatting `Number` objects and number primitives to
localized strings and parsing localized strings to `Number` objects, in base 10.
* [RuleBasedNumberFormat](numbers/rbnf) (§)
A concrete class for formatting Number objects and number primitives to
* [`RuleBasedNumberFormat`](numbers/rbnf)
A concrete class for formatting `Number` objects and number primitives to
localized text, especially spelled-out format such as found in check writing
(e.g. "two hundred and thirty-four"), and parsing text into Number objects.
(e.g. "two hundred and thirty-four"), and parsing text into `Number` objects.
* [DecimalFormatSymbols](numbers/legacy-numberformat#decimalformatsymbols) (§)
* [`DecimalFormatSymbols`](numbers/legacy-numberformat#decimalformatsymbols)
A concrete class for accessing localized number strings, such as the
grouping separators, decimal separator, and percent sign. Used by
DecimalFormat.
`DecimalFormat`.
**Formatting Dates and Times**
* [DateFormat](datetime/index#dateformat) (§)
* [`DateFormat`](datetime/index#dateformat)
The abstract superclass that provides the basic fields and methods for
formatting Date objects to localized strings and parsing date and time
strings to Date objects.
formatting `Date` objects to localized strings and parsing date and time
strings to `Date` objects.
* [SimpleDateFormat](datetime/index#simpledateformat) (§)
A concrete class for formatting Date objects to localized strings and
parsing date and time strings to Date objects, using a GregorianCalendar.
* [`SimpleDateFormat`](datetime/index#simpledateformat)
A concrete class for formatting `Date` objects to localized strings and
parsing date and time strings to `Date` objects, using a `GregorianCalendar`.
* [DateFormatSymbols](datetime/index#dateformatsymbols) (§)
* [`DateFormatSymbols`](datetime/index#dateformatsymbols)
A concrete class for accessing localized date-time formatting strings, such
as names of the months, days of the week and the time zone.
**Formatting Messages**
* [MessageFormat](messages/index#messageformat) (§)
* [`MessageFormat`](messages/index#messageformat)
A concrete class for producing a language-specific user message that
contains numbers, currency, percentages, date, time and string variables.
* [ChoiceFormat](messages/index) (§)
* [`ChoiceFormat`](messages/examples#choiceformat-class)
A concrete class for mapping strings to ranges of numbers and for handling
plurals and names series in user messages.

View File

@ -21,9 +21,9 @@ License & terms of use: http://www.unicode.org/copyright.html
---
## MessageFormat Class
## `MessageFormat` Class
ICU's MessageFormat class can be used to format messages in a locale-independent
ICU's `MessageFormat` class can be used to format messages in a locale-independent
manner to localize the user interface (UI) strings.
### C++
@ -269,17 +269,19 @@ public class TestMessageFormat{
}
```
## ChoiceFormat Class
## `ChoiceFormat` Class
**Important:** The following documentation is outdated. *ChoiceFormat is
probably not what you need. Please use MessageFormat with plural arguments for
**Important:** The following documentation is outdated. *`ChoiceFormat` is
probably not what you need. Please use `MessageFormat` with plural arguments for
proper plural selection, and select arguments for simple selection among a fixed
set of choices!*
ICU's ChoiceFormat class provides more flexibility than the printf() and scanf()
ICU's `ChoiceFormat` class provides more flexibility than the `printf()` and `scanf()`
style functions for formatting UI strings. This interface can be useful if you
would like a message to change according to the number of items you are
displaying. Note: Some Asian languages do not have plural words or phrases.
displaying.
Note: Some Asian languages do not have plural words or phrases.
### C++

View File

@ -35,13 +35,13 @@ and formatted elements, then translators would not be able to rearrange the
pieces, and they would have a hard time translating each of the string
fragments.
## MessageFormat
## `MessageFormat`
The ICU **MessageFormat** class uses message "pattern" strings with
The ICU **`MessageFormat`** class uses message `"pattern"` strings with
variable-element placeholders (called "arguments" in the API docs) enclosed in
{curly braces}. The argument syntax can include formatting details, otherwise a
default format is used. For details about the pattern syntax and the formatting
behavior see the MessageFormat API docs
behavior see the `MessageFormat` API docs
([Java](https://unicode-org.github.io/icu-docs/apidoc/released/icu4j/com/ibm/icu/text/MessageFormat.html),
[C++](https://unicode-org.github.io/icu-docs/apidoc/released/icu4c/classMessageFormat.html#_details),
[C](https://unicode-org.github.io/icu-docs/apidoc/released/icu4c/umsg_8h.html#_details)).
@ -49,15 +49,15 @@ behavior see the MessageFormat API docs
### Complex Argument Types
Certain types of arguments select among several choices which are nested
MessageFormat pattern strings. Keeping these choices together in one message
`MessageFormat` pattern strings. Keeping these choices together in one message
pattern string facilitates translation in context, by one single translator.
(Commercial translation systems often distribute different messages to different
translators.)
* Use a "plural" argument to select sub-messages based on a numeric value,
* Use a `"plural"` argument to select sub-messages based on a numeric value,
together with the plural rules for the specified language.
* Use a "select" argument to select sub-messages via a fixed set of keywords.
* Use of the old "choice" argument type is discouraged. It cannot handle
* Use a `"select"` argument to select sub-messages via a fixed set of keywords.
* Use of the old `"choice"` argument type is discouraged. It cannot handle
plural rules for many languages, and is clumsy for simple selection.
It is tempting to cover only a minimal part of a message string with a complex
@ -115,7 +115,7 @@ language](http://cldr.unicode.org/index/cldr-spec/plural-rules).*
If syntax characters occur in the text portions, then they need to be quoted by
enclosing the syntax in pairs of ASCII apostrophes. A pair of ASCII apostrophes
always represents one ASCII apostrophe, similar to %% in printf representing one %,
always represents one ASCII apostrophe, similar to `%%` in `printf` representing one `%`,
although this rule still applies inside quoted text. ("`This '{isn''t}' obvious`" → "`This {isn't} obvious`")
* Before ICU 4.8, ASCII apostrophes always started quoted text and had
@ -197,17 +197,17 @@ should include regional variants (e.g., “fr-CA”).
#### Custom Format Objects (discouraged)
The MessageFormat class allows setting custom Format objects to format
The `MessageFormat` class allows setting custom Format objects to format
arguments, overriding the arguments' pattern specification. This is discouraged:
For custom formatting of some values it should normally suffice to format them
externally and to provide the formatted strings to the `MessageFormat.format()`
methods.
Only the top-level arguments are accessible and settable via setFormat(),
Only the top-level arguments are accessible and settable via `setFormat()`,
`getFormat()` etc. Arguments inside nested sub-messages, inside
choice/plural/select arguments, are "invisible" via these API methods.
Some of these methods (the ones corresponding to the original JDK MessageFormat
Some of these methods (the ones corresponding to the original JDK `MessageFormat`
API) address the top-level arguments in their order of appearance in the pattern
string, which is usually not useful because it varies with translations. Newer
methods address arguments by argument number ("index") or name.
@ -234,5 +234,5 @@ was a disturbance in the Force on planet 7."
err);
```
There are several more usage examples for the MessageFormat and ChoiceFormat
There are several more usage examples for the `MessageFormat` and `ChoiceFormat`
classes in [C , C++ and Java](examples.md).

View File

@ -13,8 +13,8 @@ License & terms of use: http://www.unicode.org/copyright.html
# Formatting Numbers
Since ICU 60, the recommended mechanism for formatting numbers is
[NumberFormatter](https://unicode-org.github.io/icu-docs/apidoc/released/icu4c/numberformatter_8h.html)
([Java](https://unicode-org.github.io/icu-docs/apidoc/released/icu4j/com/ibm/icu/number/NumberFormatter.html)). NumberFormatter supports the formatting of:
[`NumberFormatter`](https://unicode-org.github.io/icu-docs/apidoc/released/icu4c/numberformatter_8h.html)
([Java](https://unicode-org.github.io/icu-docs/apidoc/released/icu4j/com/ibm/icu/number/NumberFormatter.html)). `NumberFormatter` supports the formatting of:
- Decimal Formatting
- Currencies
@ -23,8 +23,8 @@ Since ICU 60, the recommended mechanism for formatting numbers is
- Scientific Notation
- Compact Notation
For number ranges, including currency and measurement unit ranges, see [NumberRangeFormatter](https://unicode-org.github.io/icu-docs/apidoc/released/icu4c/numberrangeformatter_8h.html) ([Java](https://unicode-org.github.io/icu-docs/apidoc/released/icu4j/com/ibm/icu/number/NumberRangeFormatter.html)).
For number ranges, including currency and measurement unit ranges, see [`NumberRangeFormatter`](https://unicode-org.github.io/icu-docs/apidoc/released/icu4c/numberrangeformatter_8h.html) ([Java](https://unicode-org.github.io/icu-docs/apidoc/released/icu4j/com/ibm/icu/number/NumberRangeFormatter.html)).
For rule-based number formatting, including spellout rules and support for traditional numbering systems not covered by base-10 decimal digits, see [rbnf.md](rbnf.md).
For the classic NumberFormat class, which also includes legacy parsing support for localized number strings, see [legacy-numberformat.md](legacy-numberformat.md).
For the classic `NumberFormat` class, which also includes legacy parsing support for localized number strings, see [legacy-numberformat.md](legacy-numberformat.md).

View File

@ -10,7 +10,7 @@ parent: Formatting Numbers
License & terms of use: http://www.unicode.org/copyright.html
-->
# Legacy NumberFormat
# Legacy `NumberFormat`
{: .no_toc }
## Contents
@ -27,12 +27,12 @@ License & terms of use: http://www.unicode.org/copyright.html
>
> This page is here for reference for the older NumberFormat hierarchy in ICU4C and ICU4J.
## NumberFormat
## `NumberFormat`
[NumberFormat](https://unicode-org.github.io/icu-docs/apidoc/released/icu4c/classNumberFormat.html) is
[`NumberFormat`](https://unicode-org.github.io/icu-docs/apidoc/released/icu4c/classNumberFormat.html) is
the abstract base class for all number formats. It provides an interface for
formatting and parsing numbers. It also provides methods to determine which
locales have number formats, and what their names are. NumberFormat helps format
locales have number formats, and what their names are. `NumberFormat` helps format
and parse numbers for any locale. Your program can be written to be completely
independent of the locale conventions for decimal points or
thousands-separators. It can also be written to be independent of the particular
@ -48,42 +48,41 @@ $1234.50 //U.S. currency
### Usage
#### Formatting for a Locale
#### Formatting for a `Locale`
To format a number for the current Locale, use one of the static factory methods
To format a number for the current `Locale`, use one of the static factory methods
to create a format, then call a format method to format it. To format a number
for a different Locale, specify the Locale in the call to createInstance(). You
for a different `Locale`, specify the `Locale` in the call to `createInstance()`. You
can control the numbering system to be used for number formatting by creating a
Locale that uses the @numbers keyword defined. For example, by default, the Thai
`Locale` that uses the `@numbers` keyword defined. For example, by default, the Thai
locale "th" uses the western digits 0-9. To create a number format that uses the
native Thai digits instead, first create a locale with "@numbers=thai" defined.
native Thai digits instead, first create a locale with `"@numbers=thai"` defined.
See [the description on Locales](../../locale/index.md) for details.
> :point_right: **Note**: If you are formatting multiple numbers, save processing time
> by constructing the formatter once and then using it several times.
#### Instantiating a NumberFormat
#### Instantiating a `NumberFormat`
The following methods are used for instantiating NumberFormat objects:
The following methods are used for instantiating `NumberFormat` objects:
1. **createInstance()**
1. **`createInstance()`**
Returns the normal number format for the current locale or for a specified
locale.
2. **createCurrencyInstance()**
2. **`createCurrencyInstance()`**
Returns the currency format for the current locale or for a specified
locale.
3. **createPercentInstance()**
3. **`createPercentInstance()`**
Returns the percentage format for the current locale or for a specified
locale.
4. **createScientificInstance()**
4. **`createScientificInstance()`**
Returns the scientific number format for the current locale or for a
specified locale.
To create a format for spelled-out numbers, use a constructor on
RuleBasedNumberFormat (§).
To create a format for spelled-out numbers, use a constructor on `RuleBasedNumberFormat`.
#### Currency Formatting
@ -94,7 +93,7 @@ based on the country code in the locale ID. However, for all but trivial uses,
this is fragile because countries change currencies over time, and the locale
data for a particular country may not be available.
For proper currency formatting, both the number and the currency must be
For proper currency formatting, both number and currency must be
specified. Aside from achieving reliably correct results, this also allows to
format monetary values in any currency with the format of any locale, like in
exchange rate lists. If the locale data does not contain display symbols or
@ -112,21 +111,20 @@ the Euro currency.
In terms of APIs, the currency code is set as an attribute on a number format
object (on a currency instance), while the number value is passed into each
format() call or returned from parse() as usual.
`format()` call or returned from `parse()` as usual.
1. ICU4C (C++) NumberFormat.setCurrency() takes a Unicode string (const UChar
\*) with the 3-letter code.
1. ICU4C (C++) `NumberFormat.setCurrency()` takes a Unicode string (`const UChar*`) with the 3-letter code.
2. ICU4C (C API) allows to set the currency code via unum_setTextAttribute()
using the UNUM_CURRENCY_CODE selector.
2. ICU4C (C API) allows to set the currency code via `unum_setTextAttribute()`
using the `UNUM_CURRENCY_CODE` selector.
3. ICU4J NumberFormat.setCurrency() takes an ICU Currency object which
3. ICU4J `NumberFormat.setCurrency()` takes an ICU Currency object which
encapsulates the 3-letter code.
4. The base JDK's NumberFormat.setCurrency() takes a JDK Currency object which
4. The base JDK's `NumberFormat.setCurrency()` takes a JDK Currency object which
encapsulates the 3-letter code.
The functionality of Currency and setCurrency() is more advanced in ICU than in
The functionality of `Currency` and `setCurrency()` is more advanced in ICU than in
the base JDK. When using ICU, setting the currency automatically adjusts the
number format object appropriately, i.e., it sets not only the currency symbol
and display name, but also the correct number of fraction digits and the correct
@ -135,20 +133,20 @@ the API references for more details.
There is ICU4C sample code at
[icu4c/source/samples/numfmt/main.cpp](https://github.com/unicode-org/icu/blob/master/icu4c/source/samples/numfmt/main.cpp)
which illustrates the use of NumberFormat.setCurrency().
which illustrates the use of `NumberFormat.setCurrency()`.
#### Displaying Numbers
You can also control the display of numbers with methods such as
getMinimumFractionDigits. If you want even more control over the format or
parsing, or want to give your users more control, cast the NumberFormat returned
from the factory methods to a DecimalNumberFormat. This works for the vast
`getMinimumFractionDigits()`. If you want even more control over the format or
parsing, or want to give your users more control, cast the `NumberFormat` returned
from the factory methods to a `DecimalNumberFormat`. This works for the vast
majority of countries.
#### Working with Positions
You can also use forms of the parse and format methods with ParsePosition and
UFieldPosition to enable you to:
You can also use forms of the parse and format methods with `ParsePosition` and
`UFieldPosition` to enable you to:
1. progressively parse through pieces of a string.
@ -157,57 +155,57 @@ UFieldPosition to enable you to:
For example, you can align numbers in two ways:
1. If you are using a mono-spaced font with spacing for alignment, pass the
FieldPosition in your format call with field = INTEGER_FIELD. On output,
getEndIndex is set to the offset between the last character of the integer
and the decimal. Add (desiredSpaceCount - getEndIndex) spaces at the front
`FieldPosition` in your format call with `field = INTEGER_FIELD`. On output,
`getEndIndex` is set to the offset between the last character of the integer
and the decimal. Add `(desiredSpaceCount - getEndIndex)` spaces at the front
of the string. You can also use the space padding feature available in
DecimalFormat.
`DecimalFormat`.
2. If you are using proportional fonts, instead of padding with spaces, measure
the width of the string in pixels from the start to getEndIndex. Then move
the pen by (desiredPixelWidth - widthToAlignmentPoint) before drawing the
the width of the string in pixels from the start to `getEndIndex`. Then move
the pen by `(desiredPixelWidth - widthToAlignmentPoint)` before drawing the
text. It also works where there is no decimal, but additional characters at
the end (that is, with parentheses in negative numbers: "(12)" for -12).
#### Emulating printf
#### Emulating `printf`
NumberFormat can produce many of the same formats as printf.
`NumberFormat` can produce many of the same formats as printf.
| printf | ICU |
|--------|-----|
| Width specifier, e.g., "%5d" has a width of 5. | Use DecimalFormat. Either specify the padding, with can pad with any character, or specify a minimum integer count and a minimum fraction count, which will emit a specific number of digits, with zero padded to the left and right. |
| Precision specifier for %f and %e, e.g. "%.6f" or "%.6e". This defines the number of digits to the right of the decimal point. | Use DecimalFormat. Specify the maximum fraction digits. |
| General scientific notation, %g. This format uses either %f or %e, depending on the magnitude of the number being displayed. | Use ChoiceFormat with DecimalFormat. For example, for a typical %g, which has 6 significant digits, use a ChoiceFormat with thresholds of 1e-4 and 1e6. For values between the two thresholds, use a fixed DecimalFormat with the pattern "@#####". For values outside the thresholds, use a DecimalFormat with the pattern "@#####E0". |
| Width specifier, e.g., `"%5d"` has a width of 5. | Use `DecimalFormat`. Either specify the padding, with can pad with any character, or specify a minimum integer count and a minimum fraction count, which will emit a specific number of digits, with zero padded to the left and right. |
| Precision specifier for `%f` and `%e`, e.g. `"%.6f"` or `"%.6e"`. This defines the number of digits to the right of the decimal point. | Use `DecimalFormat`. Specify the maximum fraction digits. |
| General scientific notation, `%g`. This format uses either `%f` or `%e`, depending on the magnitude of the number being displayed. | Use `ChoiceFormat` with `DecimalFormat`. For example, for a typical `%g`, which has 6 significant digits, use a `ChoiceFormat` with thresholds of 1e-4 and 1e6. For values between the two thresholds, use a fixed `DecimalFormat` with the pattern `"@#####"`. For values outside the thresholds, use a `DecimalFormat` with the pattern `"@#####E0"`. |
## DecimalFormat
## `DecimalFormat`
DecimalFormat is a NumberFormat that converts numbers into strings using the
`DecimalFormat` is a `NumberFormat` that converts numbers into strings using the
decimal numbering system. This is the formatter that provides standard number
formatting and parsing services for most usage scenarios in most locales. In
order to access features of DecimalFormat not exposed in the NumberFormat API,
you may need to cast your NumberFormat object to a DecimalFormat. You may also
construct a DecimalFormat directly, but this is not recommended because it can
order to access features of `DecimalFormat` not exposed in the `NumberFormat` API,
you may need to cast your `NumberFormat` object to a `DecimalFormat`. You may also
construct a `DecimalFormat` directly, but this is not recommended because it can
hinder proper localization.
For a complete description of DecimalFormat, including the pattern syntax,
formatting and parsing behavior, and available API, see the [ICU4J DecimalFormat
For a complete description of `DecimalFormat`, including the pattern syntax,
formatting and parsing behavior, and available API, see the [ICU4J `DecimalFormat`
API](https://unicode-org.github.io/icu-docs/apidoc/released/icu4j/com/ibm/icu/text/DecimalFormat.html) or
[ICU4C DecimalFormat
[ICU4C `DecimalFormat`
API](https://unicode-org.github.io/icu-docs/apidoc/released/icu4c/classDecimalFormat.html) documentation.
## DecimalFormatSymbols
## `DecimalFormatSymbols`
[DecimalFormatSymbols](https://unicode-org.github.io/icu-docs/apidoc/released/icu4c/classDecimalFormatSymbols.html)
specifies the exact characters a DecimalFormat uses for various parts of a
[`DecimalFormatSymbols`](https://unicode-org.github.io/icu-docs/apidoc/released/icu4c/classDecimalFormatSymbols.html)
specifies the exact characters a `DecimalFormat` uses for various parts of a
number (such as the characters to use for the digits, the character to use as
the decimal point, or the character to use as the minus sign).
This class represents the set of symbols needed by DecimalFormat to format
numbers. DecimalFormat creates its own instance of DecimalFormatSymbols from its
locale data. The DecimalFormatSymbols can be adopted by a DecimalFormat
instance, or it can be specified when a DecimalFormat is created. If you need to
change any of these symbols, can get the DecimalFormatSymbols object from your
DecimalFormat and then modify it.
This class represents the set of symbols needed by `DecimalFormat` to format
numbers. `DecimalFormat` creates its own instance of `DecimalFormatSymbols` from its
locale data. The `DecimalFormatSymbols` can be adopted by a `DecimalFormat`
instance, or it can be specified when a `DecimalFormat` is created. If you need to
change any of these symbols, can get the `DecimalFormatSymbols` object from your
`DecimalFormat` and then modify it.
## Additional Sample Code

View File

@ -10,7 +10,7 @@ parent: Formatting Numbers
License & terms of use: http://www.unicode.org/copyright.html
-->
# RuleBasedNumberFormat Examples
# `RuleBasedNumberFormat` Examples
{: .no_toc }
## Contents
@ -21,10 +21,10 @@ License & terms of use: http://www.unicode.org/copyright.html
---
## Annotated RuleBasedNumberFormat Example
## Annotated `RuleBasedNumberFormat` Example
The following example provides a quick idea of how the rules work. The
[RuleBasedNumberFormat API
[`RuleBasedNumberFormat` API
documentation](https://unicode-org.github.io/icu-docs/apidoc/released/icu4c/classRuleBasedNumberFormat.html)
describes the rule syntax in more detail.
@ -103,9 +103,7 @@ Rulesets are invoked by first applying negative and fractional rules, then by
finding the rule whose range includes the current value and applying that rule,
recursing as directed by the rule. Again, a complete description of the rule
syntax can be found in the [API
Documentation](https://unicode-org.github.io/icu-docs/apidoc/released/icu4c/classRuleBasedNumberFormat.html)
.
Documentation](https://unicode-org.github.io/icu-docs/apidoc/released/icu4c/classRuleBasedNumberFormat.html).
More rule examples can be found in the RuleBasedNumberFormat [demo
source](https://github.com/unicode-org/icu/blob/master/icu4j/demos/src/com/ibm/icu/dev/demo/rbnf/RbnfSampleRuleSets.java)
.
More rule examples can be found in the `RuleBasedNumberFormat` [demo
source](https://github.com/unicode-org/icu/blob/master/icu4j/demos/src/com/ibm/icu/dev/demo/rbnf/RbnfSampleRuleSets.java).

View File

@ -133,6 +133,6 @@ numbers, and so are best suited for formatting numbers in ways that are
inherently recursive.
A full explanation of this example can be found in the [RuleBasedNumberFormat
examples](rbnf-examples.md) . A complete description of the rule syntax can be
examples](rbnf-examples.md). A complete description of the rule syntax can be
found in the [RuleBasedNumberFormat API
Documentation](https://unicode-org.github.io/icu-docs/apidoc/released/icu4c/classRuleBasedNumberFormat.html).

View File

@ -23,11 +23,11 @@ License & terms of use: http://www.unicode.org/copyright.html
## Overview
Number skeletons are a locale-agnostic way to configure a NumberFormatter in
ICU. Number skeletons work in MessageFormat.
Number skeletons are a locale-agnostic way to configure a `NumberFormatter` in
ICU. Number skeletons work in `MessageFormat`.
Number skeletons consist of case-sensitive tokens that correspond to settings
in ICU NumberFormatter. For example, to format a currency in compact notation
in ICU `NumberFormatter`. For example, to format a currency in compact notation
with the sign always shown, you could use this skeleton:
sign-always compact-short currency/GBP
@ -36,7 +36,7 @@ with the sign always shown, you could use this skeleton:
+! K currency/GBP
To use a skeleton in MessageFormat, use the "number" type and prefix the
To use a skeleton in `MessageFormat`, use the "number" type and prefix the
skeleton with `::`
{0, number, :: +! K currency/GBP}
@ -48,9 +48,9 @@ concise skeletons.
## Syntax
A token consists of a *stem* and zero or more *options*. The stem is what
occurs before the first "/" character in a token, and the options are each of
the subsequent "/"-delimited strings. For example, "compact-short" and
"currency" are stems, and "GBP" is an option.
occurs before the first `"/"` character in a token, and the options are each of
the subsequent `"/"`-delimited strings. For example, `"compact-short"` and
"currency" are stems, and `"GBP"` is an option.
Tokens are space-separated, with exceptions for concise skeletons listed at
the end of this document.
@ -61,7 +61,7 @@ digits, you could use the following stem:
@@#
A few examples of number skeletons are shown below. The list of available
A few examples of number skeletons are shown below. The list of available
stems and options can be found below in [Skeleton Stems and
Options](#skeleton-stems-and-options).
@ -70,11 +70,11 @@ Options](#skeleton-stems-and-options).
| Long Skeleton | Concise Skeleton | Input | en-US Output | Comments |
|---|---|---|---|---|
| `percent` | `%` | 25 | 25% |
| `.00` | `.00` | 25 | 25.00 | Equivalent to Precision::fixedFraction(2) |
| `.00` | `.00` | 25 | 25.00 | Equivalent to `Precision::fixedFraction(2)` |
| `percent .00` | `% .00` | 25 | 25.00% |
| `scale/100` | `scale/100` | 0.3 | 30 | Multiply by 100 before formatting |
| `percent scale/100` | `%x100` | 0.3 | 30% |
| `measure-unit/length-meter` | `unit/meter` | 5 | 5 m | UnitWidth defaults to Short |
| `measure-unit/length-meter` | `unit/meter` | 5 | 5 m | `UnitWidth` defaults to `Short` |
| `measure-unit/length-meter` <br/> `unit-width-full-name` | `unit/meter` <br/> `unit-width-full-name` | 5 | 5 meters |
| `currency/CAD` | `currency/CAD` | 10 | CA$10.00 |
| `currency/CAD` <br/> `unit-width-narrow` | `currency/CAD` <br/> `unit-width-narrow` | 10 | $10.00 | Use the narrow symbol variant |
@ -92,8 +92,7 @@ Options](#skeleton-stems-and-options).
## Skeleton Stems and Options
The full set of features supported by number skeletons is listed by category
below.
The full set of features supported by number skeletons is listed by category below.
### Notation
@ -103,19 +102,17 @@ Use one of the following stems to select compact or simple notation:
- `compact-long` or `KK` (concise)
- `notation-simple` (or omit since this is default)
There are two ways to select scientific or engineering notation: using long-form
syntax or concise syntax.
There are two ways to select scientific or engineering notation: using long-form syntax or concise syntax.
#### Scientific and Engineering Notation: Long Form
Start with the stem `scientific` or `engineering`. Those stems take the
following optional options:
Start with the stem `scientific` or `engineering`. Those stems take the following optional options:
- `/sign-xxx` sets the sign display option for the exponent; see [Sign](#sign).
- `/*ee` sets exponent digits to "at least 2"; use `/*eee` for at least 3 digits, etc.
- ***Prior to ICU 67***, use `/+ee` instead of `/*ee`.
For example, all of the following skeletons are valid:
For example, all the following skeletons are valid:
- `scientific`
- `scientific/sign-always`
@ -171,7 +168,7 @@ As with the `measure-unit` stem, pass the unit identifier as the option:
- `per-measure-unit/aaaa-bbbb`
Note that if the `unit` stem is used, the demonimator can be placed in the same
Note that if the `unit` stem is used, the denominator can be placed in the same
token as the numerator.
### Unit Width
@ -185,17 +182,17 @@ The unit width can be specified by the following stems:
- `unit-width-hidden`
For more details, see
[UNumberUnitWidth](https://unicode-org.github.io/icu-docs/apidoc/released/icu4c/unumberformatter_8h.html).
[`UNumberUnitWidth`](https://unicode-org.github.io/icu-docs/apidoc/released/icu4c/unumberformatter_8h.html).
### Precision
The precision category has more blueprint stems than most other categories;
they are documented in detail below. The following non-blueprint stems are
they are documented in detail below. The following non-blueprint stems are
accepted:
- `precision-integer` (round to the nearest integer) --- accepts fraction-precision options
- `precision-unlimited` (do not perform rounding; display all digits)
- `precision-increment/dddd` (round to *dddd*, a decimal number) --- see below
- `precision-increment/dddd` (round to *`dddd`*, a decimal number) --- see below
- `precision-currency-standard`
- `precision-currency-cash`
@ -288,10 +285,10 @@ integer digits):
| `integer-width/*` | - | Zero or more <br/> integer digits | `IntegerWidth::zeroFillTo(0) `
The long-form option starts with either a single `*` symbol, signaling no limit
on the number of integer digits (no *truncateAt*), or zero or more `#` symbols.
on the number of integer digits (no *`truncateAt`*), or zero or more `#` symbols.
It should then be followed by zero or more `0` symbols, indicating the minimum
integer digits (the argument to *zeroFillTo*). If there is no `*` symbol, the
maximum integer digits (the argument to *truncateAt*) is the number of `#`
integer digits (the argument to *`zeroFillTo`*). If there is no `*` symbol, the
maximum integer digits (the argument to *`truncateAt`*) is the number of `#`
symbols plus the number of `0` symbols.
The concise skeleton is simply one or more `0` characters. This supports
@ -305,14 +302,13 @@ To specify the scale, use the following stem and option:
- `scale/dddd`
where *dddd* is a decimal number. For example, the following are valid
skeletons:
where *`dddd`* is a decimal number. For example, the following are valid skeletons:
- `scale/100` (multiply by 100)
- `scale/1E2` (same as above)
- `scale/0.5` (multiply by 0.5)
The decimal number should conform to a standard decimal number syntax. In
The decimal number should conform to a standard decimal number syntax. In
C++, it is parsed using the decimal number library described in
[LocalizedNumberFormatter::formatDecimal](https://unicode-org.github.io/icu-docs/apidoc/released/icu4c/classicu_1_1number_1_1LocalizedNumberFormatter.html).
In Java, it is parsed using
@ -331,7 +327,7 @@ The grouping strategy can be specified by the following stems:
- `group-thousands` (no concise equivalent)
For more details, see
[UNumberGroupingStrategy](https://unicode-org.github.io/icu-docs/apidoc/released/icu4c/unumberformatter_8h.html).
[`UNumberGroupingStrategy`](https://unicode-org.github.io/icu-docs/apidoc/released/icu4c/unumberformatter_8h.html).
### Symbols
@ -340,7 +336,7 @@ The following stems are allowed for specifying the number symbols:
- `latin` (use Latin-script digits)
- `numbering-system/nnnn` (use the `nnnn` numbering system)
A custom NDecimalFormatSymbols instance is not supported at this time.
A custom `NDecimalFormatSymbols` instance is not supported at this time.
### Sign Display
@ -355,7 +351,7 @@ The following stems specify sign display:
- `sign-accounting-except-zero` or `()?` (concise)
For more details, see
[UNumberSignDisplay](https://unicode-org.github.io/icu-docs/apidoc/released/icu4c/unumberformatter_8h.html).
[`UNumberSignDisplay`](https://unicode-org.github.io/icu-docs/apidoc/released/icu4c/unumberformatter_8h.html).
### Decimal Separator Display
@ -365,4 +361,4 @@ The following stems specify decimal separator display:
- `decimal-always`
For more details, see
[UNumberDecimalSeparatorDisplay](https://unicode-org.github.io/icu-docs/apidoc/released/icu4c/unumberformatter_8h.html).
[`UNumberDecimalSeparatorDisplay`](https://unicode-org.github.io/icu-docs/apidoc/released/icu4c/unumberformatter_8h.html).

View File

@ -220,5 +220,5 @@ Any additional arguments will be passed to the program.
* `icurun test1.cpp *args...*`
*This feature is a work in progress. Please give feedback at [Ticket
#8481](https://unicode-org.atlassian.net/browse/ICU-8481).*
*Please give feedback to the [icu-support mailing list](http://site.icu-project.org/contacts),
and refer to [Ticket #8481](https://unicode-org.atlassian.net/browse/ICU-8481).*

View File

@ -32,8 +32,8 @@ with the kinds of problems that typically arise during critical stages of the
software life cycle.
In general, the standard process for creating globalized software includes
"internationalization," which covers generic coding and design issues, and
"localization," which involves translating and customizing a product for a
"internationalization", which covers generic coding and design issues, and
"localization", which involves translating and customizing a product for a
specific market.
Software developers must understand the intricacies of internationalization
@ -44,7 +44,7 @@ translated and customized. Therefore, software developers need to understand key
localization concepts.
From a geographic perspective, a locale is a place. From a software perspective,
a locale is an ID used to select information associated with a a language and/or
a locale is an ID used to select information associated with a language and/or
a place. ICU locale information includes the name and identifier of the spoken
language, sorting and collating requirements, currency usage, numeric display
preferences, and text direction (left-to-right or right-to-left, horizontal or
@ -112,7 +112,7 @@ program code. Also ensure that the text which the program generates on the fly
(such as numbers and dates) comes out in the right language. The text must be
formatted correctly for the targeted user community.
Make sure that the analysis and manipulation of both text and kinds of data
Make sure the analysis and manipulation of both text and kinds of data
(such as dates), is done in a manner that can be easily adapted for different
languages and user communities. This includes tasks such as alphabetizing lists
and looking for line-break positions.
@ -162,25 +162,35 @@ work correctly in 75 languages and more than 200 locales. The key is
understanding when, where, why, or how to use the APIs effectively.
The remainder of this section provides an overview of some cultural and hidden
assumptions components. (See the Table of contents for a list of topics.)
assumptions components. See a list of topics below:
* [Numbers and Dates](#numbers-and-dates)
* [Messages](#messages)
* [Measuring Units](#measuring-units)
* [Alphabetical Order of Characters](#alphabetical-order-of-characters)
* [Characters](#characters)
* [Text Input and Layout](#text-input-and-layout)
* [Text Manipulation](#text-manipulation)
* [Date/Time Formatting](#datetime-formatting)
* [Distributed Locale Support](#distributed-locale-support)
* [LayoutEngine](#layoutengine)
#### Numbers and Dates
Numbers and dates are represented in different languages. Do not implement
routines for converting numbers into strings, and do not call low-level system
interfaces like sprintf() that do not produce language-sensitive results.
Instead, see how ICU's [NumberFormat](formatparse/numbers/index.md) and
[DateFormat](formatparse/datetime/index.md) services can be used more
interfaces like `sprintf()` that do not produce language-sensitive results.
Instead, see how ICU's [NumberFormat](format_parse/numbers/index.md) and
[DateFormat](format_parse/datetime/index.md) services can be used more
effectively.
#### Messages
Be careful when formulating assumptions about how individual pieces of text are
used together to create a complete sentence (for example, when error messages
are generated) . The elements might go together in a different order if the
are generated). The elements might go together in a different order if the
message is translated into a new language. ICU provides
[MessageFormat](formatparse/messages/index.md) (§) and
[ChoiceFormat](formatparse/messages/index.md) (§) to help with these
[MessageFormat](format_parse/messages/index.md) (§) and
[ChoiceFormat](format_parse/messages/index.md) (§) to help with these
occurrences.
> :point_right: **Note**: *There also might be situations where parts of the sentence change when other
@ -195,7 +205,7 @@ the representation of $1,000 dollars. This amount can represent either U.S. or
Canadian dollar values. US dollars can be displayed as USD while Canadian
dollars can be displayed as CAD, depending on the locale. In this case, the
displayed numerical quantity might change, and the number itself might also
change. [NumberFormat](formatparse/numbers/index.md) provides some support for
change. [NumberFormat](format_parse/numbers/index.md) provides some support for
this.
#### Alphabetical Order of Characters
@ -204,14 +214,14 @@ All languages (even those using the same alphabet) do not necessarily have the
same concept of alphabetical order. Do not assume that alphabetical order is the
same as the numerical order of the character's code-point values. In practice,
'a' is distinct from 'A' and 'b' is distinct from 'B'. Each has a different code
point . This means that you can not use a bit-wise lexical comparison (such as
point . This means that you cannot use a bit-wise lexical comparison (such as
what strcmp() provides), to sort user-visible lists.
Not all languages interpret the same characters as equivalent. If a character's
case is changed it is not always a one-to-one mapping. Accent differences, the
presence or absence of certain characters, and even spelling differences might
be insignificant when determining whether two strings are equal. The[
Collator](collation/index.md) services provide significant help in this area.
be insignificant when determining whether two strings are equal. The
[Collator](collation/index.md) services provide significant help in this area.
#### Characters
@ -226,9 +236,9 @@ When checking characters for membership in a particular class, do not list the
specific characters you are interested in, and do not assume they come in any
particular order in the encoding scheme. For example, /A-Za-z/ does not mean all
letters in most European languages, and /0-9/ does not mean all digits in many
writing systems. This also holds true when using C interfaces such as isupper()
and islower. ICU provides a large group of utility functions for testing
character properties, such as u_isupper and u_islower().
writing systems. This also holds true when using C interfaces such as `isupper()`
and `islower()`. ICU provides a large group of utility functions for testing
character properties, such as `u_isupper()` and `u_islower()`.
#### Text Input and Layout
@ -243,7 +253,7 @@ characters and keystrokes all occur in real text in some languages.
Do not assume that all textual data, which the program stores and manipulates,
is in any particular language or writing system. ICU provides many methods that
help with text storage. The UnicodeString class and u_strxxx functions are
help with text storage. The `UnicodeString` class and `u_strxxx` functions are
provided for Unicode-based character manipulation. For example, when appending
an existing Unicode character buffer, characters can be removed or extracted out
of the buffer.
@ -256,13 +266,10 @@ correctly process multi-lingual text such as this correctly.
Time can be determined in many units, such as the lengths of months or years,
which day is the first day of the week, or the allowable range of values like
month and year (with DateFormat). It can also determine the time zone you are in
(with TimeZone), or when daylight-savings time starts. ICU provides the Calendar
month and year (with `DateFormat`). It can also determine the time zone you are in
(with `TimeZone`), or when daylight-savings time starts. ICU provides the Calendar
services needed to handle these issues.
This example shows how a user interface element can be used to increment or
decrement the time field value.
#### Distributed Locale Support
In most server applications, do not assume that all clients connected to the

View File

@ -26,30 +26,32 @@ Java SE 6 introduced a new feature which allows Java user code to extend locale
support in Java runtime environment. JREs shipped by Oracle or IBM come with
decent locale coverage, but some users may want more locale support. Java SE 6
includes abstract classes extending
[java.util.spi.LocaleServiceProvider](http://download.oracle.com/javase/6/docs/api/java/util/spi/LocaleServiceProvider.html).
[`java.util.spi.LocaleServiceProvider`](http://download.oracle.com/javase/6/docs/api/java/util/spi/LocaleServiceProvider.html).
Java SE 6 users can create a subclass of these abstract class to supply their
own locale support for text break, collation, date/number formatting or
providing translations for currency, locale and time zone names.
ICU4J has been providing more comprehensive locale coverage than standard JREs.
However, Java programmers have to use ICU4J's own internationalization service
APIs (com.ibm.icu.\*) to utilize the rich locale support. Sometimes, the
APIs (`com.ibm.icu.\*`) to utilize the rich locale support. Sometimes, the
migration is not an option for various reasons. For example, your code may
depend on existing Java libraries utilizing JDK internationalization service
APIs, but you have no access to the source code. In this case, it is not
possible to modify the libraries to use ICU4J APIs.
ICU4J Locale Service Provider is a component consists of classes implementing
the Java SE 6 locale sensitive service provider interfaces. Available service
providers are -
providers are:
* [BreakIteratorProvider](http://download.oracle.com/javase/6/docs/api/java/text/spi/BreakIteratorProvider.html)
* [CollatorProvider](http://download.oracle.com/javase/6/docs/api/java/text/spi/CollatorProvider.html)
* [DateFormatProvider](http://download.oracle.com/javase/6/docs/api/java/text/spi/DateFormatProvider.html)
* [DateFormatSymbolsProvider](http://download.oracle.com/javase/6/docs/api/java/text/spi/DateFormatSymbolsProvider.html)
* [DecimalFormatSymbolsProvider](http://download.oracle.com/javase/6/docs/api/java/text/spi/DecimalFormatSymbolsProvider.html)
* [NumberFormatProvider](http://download.oracle.com/javase/6/docs/api/java/text/spi/NumberFormatProvider.html)
* [CurrencyNameProvider](http://download.oracle.com/javase/6/docs/api/java/util/spi/CurrencyNameProvider.html)
* [LocaleNameProvider](http://download.oracle.com/javase/6/docs/api/java/util/spi/LocaleNameProvider.html)
* [TimeZoneNameProvider](http://download.oracle.com/javase/6/docs/api/java/util/spi/TimeZoneNameProvider.html)
* [`BreakIteratorProvider`](http://download.oracle.com/javase/6/docs/api/java/text/spi/BreakIteratorProvider.html)
* [`CollatorProvider`](http://download.oracle.com/javase/6/docs/api/java/text/spi/CollatorProvider.html)
* [`DateFormatProvider`](http://download.oracle.com/javase/6/docs/api/java/text/spi/DateFormatProvider.html)
* [`DateFormatSymbolsProvider`](http://download.oracle.com/javase/6/docs/api/java/text/spi/DateFormatSymbolsProvider.html)
* [`DecimalFormatSymbolsProvider`](http://download.oracle.com/javase/6/docs/api/java/text/spi/DecimalFormatSymbolsProvider.html)
* [`NumberFormatProvider`](http://download.oracle.com/javase/6/docs/api/java/text/spi/NumberFormatProvider.html)
* [`CurrencyNameProvider`](http://download.oracle.com/javase/6/docs/api/java/util/spi/CurrencyNameProvider.html)
* [`LocaleNameProvider`](http://download.oracle.com/javase/6/docs/api/java/util/spi/LocaleNameProvider.html)
* [`TimeZoneNameProvider`](http://download.oracle.com/javase/6/docs/api/java/util/spi/TimeZoneNameProvider.html)
ICU4J Locale Service Provider is designed to work as installed extensions in a
JRE. Once the component is configured properly, Java application running on the
@ -65,14 +67,12 @@ optional package to extend the functionality of the Java core platform. To
install an optional package, its JAR files must be placed in the Java extension
directory. The standard location is *<java-home>/lib/ext*. You can alternatively
use the system property *java.ext.dirs* to specify one or more locations where
optional packages are installed. For example, if the JRE root directry is
JAVA_HOME and you put ICU4J Locale Service Provider files in ICU_SPI_DIR, the
ICU4J Locale Service Provider is enabled by the following command.
optional packages are installed. For example, if the JRE root directory is
*JAVA_HOME* and you put ICU4J Locale Service Provider files in *ICU_SPI_DIR*, the
ICU4J Locale Service Provider is enabled by the following command:
java -Djava.ext.dirs=%JAVA_HOME%\\lib\\ext;%ICU_SPI_DIR% <your_java_app>
\[Microsoft Windows\]
java -Djava.ext.dirs=$JAVA_HOME/lib/ext:$ICU_SPI_DIR <your_java_app> \[Linux,
Solaris and other unix like platforms\]
* `java -Djava.ext.dirs=%JAVA_HOME%\\lib\\ext;%ICU_SPI_DIR% <your_java_app>` \[Microsoft Windows\]
* `java -Djava.ext.dirs=$JAVA_HOME/lib/ext:$ICU_SPI_DIR <your_java_app>` \[Linux,Solaris and other unix like platforms\]
The ICU4J's implementations of Java SE 6 locale sensitive service provider
interfaces and configuration files are packaged in a single JAR file
@ -80,22 +80,22 @@ interfaces and configuration files are packaged in a single JAR file
classes and data are in the ICU4J core JAR file (*icu4j-<version>.jar*). So you
need to put the localespi JAR file along with the core JAR file in the Java
extension directory.
Once the ICU4J Locale Service Provider is installed properly, factory methods in
JDK internationalization classes look for the implementation provided by ICU4J
when a requested locale is not supported by the JDK service class. For example,
locale *af_ZA* (Afrikaans - South Africa) is not supported by JDK DateFormat in
Oracle Java SE 6. The following code snippet returns an instance of DateFormat
locale *af_ZA* (Afrikaans - South Africa) is not supported by JDK `DateFormat` in
Oracle Java SE 6. The following code snippet returns an instance of `DateFormat`
from ICU4J Locale Service Provider and prints out the current date localized for
af_ZA.
DateFormat df = DateFormat.getDateInstance(DateFormat.LONG, new Locale("af",
"ZA"));
System.out.println(df.format(new Date()));
DateFormat df = DateFormat.getDateInstance(DateFormat.LONG, new Locale("af", "ZA"));
System.out.println(df.format(new Date()));
Sample output:
2008 Junie 19 \[With ICU4J Locale Service Provider enabled\]
June 19, 2008 \[Without ICU4J Locale Service Provider\]
* `2008 Junie 19` \[With ICU4J Locale Service Provider enabled\]
* `June 19, 2008` \[Without ICU4J Locale Service Provider\]
## Optional Configuration
@ -108,10 +108,10 @@ configuration files from *META-INF/services* in the localespi JAR file. For
example, if you do not want to use ICU's time zone name service at all, you can
remove the file: *META-INF/services/java.util.spi.TimeZoneNameProvider* from the
JAR file.
**Note:** Disabling DateFormatSymbolsProvider/DecimalFormatSymbolsProvider won't
affect the localized symbols actually used by
DateFormatProvider/NumberFormatProvider by the current implementation. These
services are implemented independently.
**Note:** Disabling `DateFormatSymbolsProvider/DecimalFormatSymbolsProvider` won't
affect the localized symbols actually used by `DateFormatProvider/NumberFormatProvider`
by the current implementation. These services are implemented independently.
### Configuring the behavior of ICU4J Locale Service Provider
@ -119,34 +119,10 @@ services are implemented independently.
localespi JAR file is used for configuring the behavior of the ICU4J Locale
Service Provider implementation. There are some configuration properties
available. See the table below for each configuration in detail.
**Property** **Value** **Default** **Description**
com.ibm.icu.impl.javaspi.ICULocaleServiceProvider.enableIcuVariants "true" or
"false" "true" Whether if Locales with ICU's variant suffix will be included in
getAvailableLocales. The current Java SE 6 locale sensitive service does not
allow user provided provider implementations to override locales supported by
JRE itself. When this property is "true"(default), ICU4J Locale Service Provider
includes Locales with the
suffix(com.ibm.icu.impl.javaspi.ICULocaleServiceProvider.icuVariantSuffix) in
the variant field. For example, the ICU4J provider includes locales fr_FR and
fr_FR_ICU4J in the available locale list. So JDK API user can still access the
internationalization service object created by the ICU4J provider by the special
locale fr_FR_ICU4J.
com.ibm.icu.impl.javaspi.ICULocaleServiceProvider.icuVariantSuffix *Any String*
"ICU4J" (49 or later)
"ICU" (before 49)
Suffix string used in Locale's variant field to specify the ICU implementation.
com.ibm.icu.impl.javaspi.ICULocaleServiceProvider.enableIso3Languages "true" or
"false" "true" Whether if 3-letter language Locales are included in
getAvailabeLocales. Use of 3-letter language codes in java.util.Locale is not
supported by the API reference document. However, the implementation does not
check the length of language code, so there is no practical problem with it.
com.ibm.icu.impl.javaspi.ICULocaleServiceProvider.useDecimalFormat "true" or
"false" "false" Whether if java.text.DecimalFormat subclass is used for
NumberFormat#getXXXInstance.
DecimalFormat#format(Object,StringBuffer,FieldPosition) is declared as final, so
ICU cannot override the implementation. As a result, some number types such as
BigInteger/BigDecimal are not handled by the ICU implementation. If a client
expects NumberFormat#getXXXInstance returns a DecimalFormat (for example, need
to manipulate decimal format patterns), he/she can set true to this setting.
However, in this case, BigInteger/BigDecimal support is not done by ICU's
implementation.
|**Property**|**Value**|**Default**|**Description**|
|:---|:---:|:---:|:---|
|`com.ibm.icu.impl.javaspi.ICULocaleServiceProvider.enableIcuVariants`|`"true"` or `"false"`|`"true"`|Whether if Locales with ICU's variant suffix will be included in `getAvailableLocales`. The current Java SE 6 locale sensitive service does not allow user provided provider implementations to override locales supported by JRE itself. When this property is `"true"` (default), ICU4J Locale Service Provider includes Locales with the suffix (`com.ibm.icu.impl.javaspi.ICULocaleServiceProvider.icuVariantSuffix`) in the variant field. For example, the ICU4J provider includes locales fr_FR and fr_FR_ICU4J in the available locale list. So JDK API user can still access the internationalization service object created by the ICU4J provider by the special locale fr_FR_ICU4J|
|`com.ibm.icu.impl.javaspi.ICULocaleServiceProvider.icuVariantSuffix`|*Any String*|`"ICU4J"` (49 or later) `"ICU"` (before 49)|Suffix string used in Locale's variant field to specify the ICU implementation.|
|`com.ibm.icu.impl.javaspi.ICULocaleServiceProvider.enableIso3Languages`|`"true"` or `"false"`|`"true"`|Whether if 3-letter language locales are included in `getAvailabeLocales`. Use of 3-letter language codes in `java.util.Locale` is not supported by the API reference document. However, the implementation does not check the length of language code, so there is no practical problem with it.|
|`com.ibm.icu.impl.javaspi.ICULocaleServiceProvider.useDecimalFormat`|`"true"` or `"false"`|`"false"`|Whether if `java.text.DecimalFormat` subclass is used for `NumberFormat#getXXXInstance`. `DecimalFormat#format(Object,StringBuffer,FieldPosition)` is declared as final, so ICU cannot override the implementation. As a result, some number types such as `BigInteger`/`BigDecimal` are not handled by the ICU implementation. If a client expects `NumberFormat#getXXXInstance` returns a `DecimalFormat` (for example, need to manipulate decimal format patterns), he/she can set true to this setting. However, in this case, `BigInteger`/`BigDecimal` support is not done by ICU's implementation.|

View File

@ -132,16 +132,16 @@ happens only once per process, at the time an ICU data item is first requested.
3. Dynamically load (memory map, typically) a common format (.dat) file
containing the default ICU data. Loading is described in the section
[How Data Loading Works](icudata.md#how-data-loading-works). The path to
[How Data Loading Works](icudata#how-data-loading-works). The path to
the data is of the form "icudt\<version\>\<flag\>", where \<version\> is
the two-digit ICU version number, and \<flag\> is a letter indicating the
internal format of the file (see the
[Sharing ICU Data Between Platforms](icudata.md#sharing-icu-data-between-platforms)
[Sharing ICU Data Between Platforms](icudata#sharing-icu-data-between-platforms)
section).
Once the default ICU data has been located, loading of individual data items
proceeds as described in the section
[How Data Loading Works](icudata.md#how-data-loading-works).
[How Data Loading Works](icudata#how-data-loading-works).
## Building and Linking against ICU data
@ -252,7 +252,7 @@ and set with `udata_setCommonData()` or `udata_setAppData()`, or they must be
in a known location that is included in the ICU data directory string. This
requires the application installer, or the application itself at runtime, to
locate the ICU and/or application data by setting the ICU data directory (see
the [ICU Data Directory](icudata.md#icu-data-directory) section above) or by
the [ICU Data Directory](icudata#icu-data-directory) section above) or by
loading the data and providing it to one of the `udata_setXYZData()` functions.
Unlike shared libraries, .dat package files can be taken apart into separate
@ -261,7 +261,7 @@ modification of a package file. The `gencmn` and `pkgdata` ICU tools can then be
used to reassemble the .dat package file.
For more information about .dat package files see the section [Sharing ICU Data
Between Platforms](icudata.md#sharing-icu-data-between-platforms) below.
Between Platforms](icudata#sharing-icu-data-between-platforms) below.
### Data Overriding vs. Loading Performance
@ -275,7 +275,7 @@ all directories and matching .dat files mentioned before checking in
already-loaded package files. This allows overriding of packaged data items with
single files after installation but costs some time for filesystem accesses.
This is usually done only once per data item; see
[User Data Caching](icudata.md#user-data-caching) below.
[User Data Caching](icudata#user-data-caching) below.
### Single Data Files vs. Packages
@ -311,7 +311,7 @@ are not handled here.)
> :point_right: **Note**: The ICU data directory string itself may
contain multiple directories and path/filenames to .dat package files. See the
[ICU Data Directory](icudata.md#icu-data-directory) section.
[ICU Data Directory](icudata#icu-data-directory) section.
It is recommended to not include the directory in the path argument but to make
sure via setting the application data or the ICU data directory string that the
@ -403,8 +403,8 @@ There is no mechanism for removing or updating cached data files.
## Directory Separator Characters
If a directory separator (generally '/' or '\\') is needed in a path parameter,
use the form that is native to the platform. The ICU header "putil.h" defines
U_FILE_SEP_CHAR appropriately for the platform.
use the form that is native to the platform. The ICU header `"putil.h"` defines
`U_FILE_SEP_CHAR` appropriately for the platform.
> :point_right: **Note**: On Windows, the directory separator must be '\\' for
any paths passed to ICU APIs. This is different from native Windows APIs, which
@ -584,7 +584,7 @@ above. The ICU project provides a large number of additional locales in its
[locale
repository](https://github.com/unicode-org/icu/blob/master/icu4c/source/data/locales/)
on the web. Most of this locale data is derived from the CLDR ([Common Locale
Data Repository](http://www.unicode.org/cldr/) ) project.
Data Repository](http://www.unicode.org/cldr/)) project.
Dropping the txt file into the correct place in the source tree is sufficient to
add it to your ICU build. You will need to re-configure in order to pick it up.
@ -768,7 +768,7 @@ loadable data objects.)
#### ICU.dat package files
* Source format: (list of files provided as input to the icupkg tool, or
on the gencmn tool command line)
* Binary format: .dat: [source/tools/toolutil/pkg_gencmn.c](../../icu4c/source/tools/toolutil/pkg_gencmn.cpp)
* Binary format: .dat: [source/tools/toolutil/pkg_gencmn.cpp](../../icu4c/source/tools/toolutil/pkg_gencmn.cpp)
* Generator tool: [icupkg](../../icu4c/source/tools/icupkg) or
[gencmn](../../icu4c/source/tools/gencmn)

View File

@ -62,7 +62,7 @@ Time Zone Data Version: 2011g
#### I'm using ICU4J X, but planning to upgrade ICU4J version to X+1 soon. What should I do for the migration?
See the user guide section
[Version Numbers in ICU](../design.md#version-numbers-in-icu)
[Version Numbers in ICU](../design#version-numbers-in-icu)
for the details about the meaning of the version number parts and how the ICU
version number changes.
@ -70,7 +70,7 @@ In general, two different reference releases are not binary compatible (i.e.
drop-in jar file replacement would not work). To use a new reference version of
ICU4J, you should rebuild your application with the new ICU4J library. ICU
project has the
[API compatibility policy](../design.md#icu-api-compatibility)
[API compatibility policy](../design#icu-api-compatibility)
long as you're using ICU APIs marked as @stable in the API reference
documentation, your application should successfully compile with the new
reference version of ICU4J library without any source code modifications. (Note:

View File

@ -88,7 +88,7 @@ your topic.
#### What is the ICU binary compatibility policy?
Please see the section on
[binary compatibility](../design.md#icu-binary-compatibility-using-icu-as-an-operating-system-level-library)
[binary compatibility](../design#icu-binary-compatibility)
in the [design chapter](../design.md).
#### How is ICU licensed?
@ -120,7 +120,7 @@ upgrade-friendly.
* **API:** ensure that you are not using draft APIs which may have changed in
a future release. See the section on
[API compatibility](../design.md#icu-api-compatibility) in the
[API compatibility](../design#icu-api-compatibility) in the
[design chapter](../design.md).
* **Unicode:** See the release notes for particular versions of Unicode to
ensure that your code is not affected by property changes or other
@ -133,7 +133,7 @@ upgrade-friendly.
currencies, types of calenders
* **Building/Deploying your Application (ICU4C):** ICU4C usually builds with
symbol renaming (See:
[binary compatibility](../design.md#icu-binary-compatibility-using-icu-as-an-operating-system-level-library)
[binary compatibility](../design#icu-binary-compatibility)
in the [design chapter](../design.md)). Be sure that you build your
application with the updated ICU header files, so that it will link against
the current ICU. Also, don't hard-code the names of ICU libraries in your
@ -185,7 +185,7 @@ Support List](http://icu-project.org/contacts.html) .
Use the [Data Customizer](https://unicode-org.atlassian.net/browse/ICU-12835)
or see
[Customizing ICU's Data Library](../icudata.md#customizing-icus-data-library)
[Customizing ICU's Data Library](../icudata#customizing-icus-data-library)
in the [ICU Data Management](../icudata.md) chapter of this User's Guide.
#### Why am I seeing a small ( only a few K ) instead of a large ( several megabytes ) data shared library (icudt)?
@ -203,7 +203,7 @@ Studio.
#### Can I add or remove a converter from ICU?
Yes. Please see [Customizing ICU's Data Library](../icudata.md#customizing-icus-data-library)
Yes. Please see [Customizing ICU's Data Library](../icudata#customizing-icus-data-library)
in the [ICU Data Management](../icudata.md) of this User's Guide. You can also
get extra converters from <http://www.icu-project.org/charts/charset/> or use
the [ICU Data Customizer](https://unicode-org.atlassian.net/browse/ICU-12835)
@ -268,7 +268,7 @@ ICU4C (ICU) is written in C and C++, and ICU4J is written in Java™.
#### How are the APIs documented for deprecation?
Please read the [ICU API compatibility](../design.md#icu-api-compatibility)
Please read the [ICU API compatibility](../design#icu-api-compatibility)
section in the [ICU Design](../design.md) chapter.
#### What version of Unicode standard does ICU support?

View File

@ -96,7 +96,7 @@ Download ICU in one of the following ways:
1. From the download page, <http://www.icu-project.org/download/>, for
packaged stable releases of ICU.
2. From the source code repository, <http://www.icu-project.org/repository/>,
2. From the source code repository, <https://github.com/unicode-org/icu/releases/>,
for the latest development versions.
After downloading, see the included README file for information on what is
@ -106,4 +106,4 @@ included, building, installing, etc.
Current license: <https://github.com/unicode-org/icu/blob/master/icu4c/LICENSE>
See also <https://github.com/unicode-org/icu/blob/userguide-migration/docs/userguide/icufaq/index.md#how-is-the-icu-licensed>
See also FAQ section ["How is the ICU licensed"](icufaq/index#how-is-icu-licensed)

View File

@ -126,7 +126,7 @@ Features:
Defined at: [icu-docs/master/design/bnf_rb.txt](https://raw.githubusercontent.com/unicode-org/icu-docs/master/design/bnf_rb.txt)
To use with ICU4C, see the [Resource Bundle APIs](resources.md#resource-bundle-apis) section of this userguide.
To use with ICU4C, see the [Resource Bundle APIs](resources#resource-bundle-apis) section of this userguide.
Example: (`de.txt`)
@ -145,7 +145,7 @@ capabilities of the .txt and binary ICU4C resource bundles in XML form. However,
we have decided to drop the format for lack of use and instead adopt standard
XLIFF format for localization. For more information on XLIFF format, see the
following section. For examples on using ICU tools to produce and read XLIFF
format see the XLIFF Usage (§) section in the [resource management chapter](resources.md#using-xliff-for-localization).
format see the XLIFF Usage section in the [resource management chapter](resources#using-xliff-for-localization).
## XLIFF
@ -203,7 +203,7 @@ http://www.oasis-open.org/committees/xliff/documents/xliff-core-1.1.xsd'>
```
For examples on using ICU tools to produce and read XLIFF format see the XLIFF
Usage (§) section in the [resource management chapter](resources.md#using-xliff-for-localization).
Usage (§) section in the [resource management chapter](resources#using-xliff-for-localization).
## DITA

View File

@ -44,7 +44,7 @@ with -DU_STATIC_IMPLEMENTATION. Also see [How To Use ICU](../howtouseicu.md).
### Reduce the number of libraries used
ICU consists of a number of different libraries. The library dependency chart in the [Design](../design.md#Library_Dependencies_C)
ICU consists of a number of different libraries. The library dependency chart in the [Design](../design#library-dependencies-c)
chapter can be used to understand and
determine the exact set of libraries needed.
@ -207,5 +207,5 @@ DLLs will be copied with names such as 'icuuc55.dll'.
The services which are now known as ICU were written to provide operating
system-level and application environment-level services. Several operating
systems include ICU as a standard or optional package.
See [ICU Binary Compatibility](../design.md#ICU_Binary_Compatibility) for
See [ICU Binary Compatibility](../design#icu-binary-compatibility) for
more details.

View File

@ -45,7 +45,7 @@ in the future.
#### ISO C
The ISO C standard provides two basic character types (char and wchar_t) and
The ISO C standard provides two basic character types (`char` and `wchar_t`) and
defines strings as arrays of units of these types. The standard allows nearly
arbitrary character and string character sets and encodings, which was necessary
when there was no single character set that worked everywhere.
@ -80,7 +80,7 @@ character and string literals.
few of the string encodings used with it are documented.
5. See also [What size wchar_t do I need for
Unicode?](http://icu-project.org/docs/papers/unicode_wchar_t.html) .
Unicode?](http://icu-project.org/docs/papers/unicode_wchar_t.html)
6. A program based on this model must be recompiled for each platform. Usually,
it must be recompiled for each supported language or family of languages.
@ -97,26 +97,24 @@ character and string literals.
ICU always processes Unicode text. Unicode covers all languages and allows safe
hard coding of character codes, in addition to providing many standard or
recommended algorithms and a lot of useful character property data. See the
chapters about [Unicode Basics](unicode.md) and [Strings](strings/index.md) and
others.
chapters about [Unicode Basics](unicode.md) and [Strings](strings/index.md) and others.
ICU uses the 16-bit encoding form of Unicode (UTF-16) for processing, making it
fully interoperable with most Unicode-aware software. (See [UTF-16 for
Processing](http://www.unicode.org/notes/tn12/) .) In the case of ICU4J, this is
fully interoperable with most Unicode-aware software. See [UTF-16 for
Processing](http://www.unicode.org/notes/tn12/). In the case of ICU4J, this is
naturally the case because the Java language and the JDK use UTF-16.
ICU uses and/or provides direct access to all of the [Unicode
properties](strings/properties.md) which provide a much finer-grained
classification of characters than [C/POSIX character
classes](https://htmlpreview.github.io/?https://github.com/unicode-org/icu-docs/blob/master/design/posix_classes.html)
.
classes](https://htmlpreview.github.io/?https://github.com/unicode-org/icu-docs/blob/master/design/posix_classes.html).
In C/C++ source code character and string literals, ICU uses only "invariant"
characters. They are the subset of graphic ASCII characters that are almost
always encoded with the same byte values on all systems. (One set of byte values
for ASCII-based systems, and another such set of byte values for EBCDIC
systems.) See
[utypes.h](https://github.com/unicode-org/icu/blob/master/icu4c/source/common/unicode/utypes.h)
[`utypes.h`](https://github.com/unicode-org/icu/blob/master/icu4c/source/common/unicode/utypes.h)
for the set of "invariant" characters.
With the use of Unicode, the implementation of many of the Unicode standard
@ -127,7 +125,7 @@ portable, and reliable text processing.
#### ISO C
The standard C functions tolower(), towupper(), etc. take and return one
The standard C functions `tolower()`, `toupper()`, etc. take and return one
character code each.
#### Problems
@ -201,9 +199,8 @@ locale's respective character class.
this repertoire is not possible with such an implementation.
7. For more about the problems with POSIX character classes in a Unicode
context see [Annex C: Compatibility Properties in Unicode Technical Standard
#18: Unicode Regular
Expressions](http://www.unicode.org/reports/tr18/#Compatibility_Properties)
context see [Annex C: Compatibility Properties in Unicode
Technical Standard #18: Unicode Regular Expressions](http://www.unicode.org/reports/tr18/#Compatibility_Properties)
and see the mailing list archives for the unicode list (on unicode.org). See
also the ICU design document about [C/POSIX character
classes](https://htmlpreview.github.io/?https://github.com/unicode-org/icu-docs/blob/master/design/posix_classes.html).
@ -212,8 +209,8 @@ locale's respective character class.
ICU provides locale-independent access to all [Unicode
properties](strings/properties.md) (except Unihan.txt properties), as well as to
the POSIX character classes, via functions defined in uchar.h and in ICU4J's
UCharacter class (see API references) as well as via UnicodeSet. The POSIX
the POSIX character classes, via functions defined in `uchar.h` and in ICU4J's
`UCharacter` class (see API references) as well as via `UnicodeSet`. The POSIX
character classes are implemented according to the recommendations in UTS #18.
The Unicode Character Database defines more than 70 character properties, their
@ -228,10 +225,10 @@ of Unicode properties.
#### POSIX
The strfmon() function is used to format monetary values. The default format and
The `strfmon()` function is used to format monetary values. The default format and
the currency display symbol or display name are selected by the LC_MONETARY
locale ID. The number formatting can also be controlled with a formatting string
resembling what printf() uses.
resembling what `printf()` uses.
#### Problems
@ -244,7 +241,7 @@ resembling what printf() uses.
2. Using a single locale ID for the whole format makes it very difficult to
format values for multiple currencies with the same number format (for
example, for an exchange rate list or for showing the price of an item
adjusted for several currencies). strfmon() allows to specify the number
adjusted for several currencies). `strfmon()` allows to specify the number
format fully, but then the application cannot use a country's default number
format.
@ -258,4 +255,4 @@ resembling what printf() uses.
ICU number formatting APIs have separate, orthogonal settings for the number
format, which can be selected with a locale ID, and the currency, which is
specified with an ISO code. See the [Formatting
Numbers](formatparse/numbers/index.md) chapter for details.
Numbers](format_parse/numbers/index.md) chapter for details.

View File

@ -35,17 +35,17 @@ ICU provides basic Unicode support for the following:
* [Unicode strings](strings/index.md)
ICU includes type definitions for UTF-16 strings and code points. It also
contains many C u_string functions and the C++ UnicodeString class with many
contains many C `u_string` functions and the C++ `UnicodeString` class with many
additional string functions.
* [Unicode properties](strings/properties.md)
ICU includes the C definitions and functions found in uchar.h as well as
some macros found in utf.h. It also includes the C++ Unicode class.
ICU includes the C definitions and functions found in `uchar.h` as well as
some macros found in `utf.h`. It also includes the C++ Unicode class.
* [Unicode string iteration](strings/characteriterator.md)
In C, ICU uses the macros in utf.h for the iteration of strings. In C++, ICU
In C, ICU uses the macros in `utf.h` for the iteration of strings. In C++, ICU
uses the characterIterator and its subclasses.
### Conversion Basics
@ -70,15 +70,15 @@ is an abstract concept that is typically expressed by one of the following:
A locale ID specifies a language and region enabling the software to support
culturally and linguistically appropriate information for each user. A locale
object represents a specific geographical, political, or cultural region. As a
programmatic expression of locale IDs, ICU provides the C++ locale class. In C,
Application Programming Interfaces (APIs) use simple C strings for locale IDs.
programmatic expression of locale IDs, ICU provides the C++ `Locale` class. In C,
Application Programming Interfaces (APIs) use simple C `string` for locale IDs.
ICU stores locale-specific data in resource bundles, which provide a general
mechanism to access strings and other objects for ICU services to perform
according to locale conventions. ICU contains data for its services to support
many locales. Resource bundles contain the locale data of applications that use
ICU. In C++, the **ResourceBundle** implements the locale data. In C, this
feature is provided by the **ures_** interface.
ICU. In C++, the `**ResourceBundle**` implements the locale data. In C, this
feature is provided by the `**ures_**` interface.
In addition to storing system-level data in ICU's resource bundles, applications
typically also need to use resource bundles of their own to store
@ -91,7 +91,7 @@ separate file instead of being embedded in the lines of the program.*
### Locales and Services
The interaction between locales and services is fundamental to ICU. Please refer
to [Locales and Services](./locale/index.md#Locales_and_Services).
to [Locales and Services](./locale/index#locales-and-services).
### Transliteration
@ -105,10 +105,12 @@ each step of the process can use a UnicodeSet to restrict the characters that
are affected. There are two basic types of transliterators:
Most natural language transliterators (such as Greek-Latin) are written a
rule-based transliterators. Transliterators can be written as text files using a
rule-based transliterators.
Transliterators can be written as text files using a
simple language that is similar to regular expression syntax.
### Date and Time Classes
### `Date` and `Time` Classes
Date and time routines manage independent date and time functions in
milliseconds since January 1, 1970 (0:00:00.000 UTC). Points in time before then
@ -117,19 +119,21 @@ are represented as negative numbers.
ICU provides the following [classes](datetime/index.md) to support calendars and
time zones:
* [Calendar](datetime/calendar/index.md#calendar)
The abstract superclass for extracting calendar-related attributes from a
Date value.
* [`Calendar`](datetime/calendar/index#calendar)
The abstract superclass for extracting calendar-related attributes from a `Date` value.
* [Gregorian Calendar](datetime/calendar/index.md#gregorian-calendar)
* [`GregorianCalendar`](datetime/calendar/index#gregoriancalendar)
A concrete class for representing a Gregorian calendar.
* [TimeZone](datetime/timezone/index.md)
* [`TimeZone`](datetime/timezone/index.md)
An abstract superclass for representing a time zone.
* [SimpleTimeZone](datetime/timezone/index.md)
A concrete class for representing a time zone for use with a Gregorian
calendar.
* [`SimpleTimeZone`](datetime/timezone/index.md)
A concrete class for representing a time zone for use with a Gregorian calendar.
> :point_right: **Note**: *C classes provide the same functionality as the C++ classes with the exception
of subclassing.*
@ -152,55 +156,60 @@ numbers, formatting dates and times, and formatting messages:
#### General Formatting
See [Formatting and Parsing Classes](formatparse/index.md#formatting-and-parsing-classes) for an introduction to the following:
See [Formatting and Parsing Classes](format_parse/index#formatting-and-parsing-classes) for an introduction to the following:
* Format
* FieldPosition
* ParsePosition
* Formattable
* `Format`
* `FieldPosition`
* `ParsePosition`
* `Formattable`
#### Formatting Numbers
* [NumberFormat](formatparse/numbers/index.md#numberformat)
* [`NumberFormat`](format_parse/numbers/index#formatting-numbers)
NumberFormat provides the basic fields and methods to format number objects
and number primitives into localized strings and parse localized strings to
number objects.
* [DecimalFormat](formatparse/numbers/index.md#decimalformat)
* [`DecimalFormat`](format_parse/numbers/index#decimalformat)
DecimalFormat provides the methods used to format number objects and number
primitives into localized strings and parse localized strings into number
objects in base 10.
* [DecimalFormatSymbols](formatparse/numbers/index.md#decimalformatsymbols)
* [`DecimalFormatSymbols`](formatparse/numbers/index#decimalformatsymbols)
DecimalFormatSymbols is a concrete class used by DecimalFormat to access
localized number strings such as the grouping separators, the decimal
separator, and the percent sign.
#### Formatting Dates and Times
* [DateFormat](formatparse/datetime/index.md) (§)
DateFormat provides the basic fields and methods for formatting date objects
* [`DateFormat`](formatparse/datetime/index.md)
`DateFormat` provides the basic fields and methods for formatting date objects
to localized strings and parsing date and time strings to date objects.
* [SimpleDateFormat](formatparse/datetime/index.md) (§)
SimpleDateFormat is a concrete class used to format date objects to
* [`SimpleDateFormat`](formatparse/datetime/index.md)
`SimpleDateFormat` is a concrete class used to format date objects to
localized strings and to parse date and time strings to date objects using a
GregorianCalendar.
`GregorianCalendar`.
* [DateFormatSymbols](formatparse/datetime/index.md) (§)
DateFormatSymbols is a concrete class used to access localized date and time
* [`DateFormatSymbols`](formatparse/datetime/index.md)
`DateFormatSymbols` is a concrete class used to access localized date and time
formatting strings, such as names of the months, days of the week, and the
time zone.
#### Formatting Messages
* [MessageFormat](formatparse/messages/index.md) (§)
MessageFormat is a concrete class used to produce a language-specific user
* [`MessageFormat`](formatparse/messages/index.md)
`MessageFormat` is a concrete class used to produce a language-specific user
message that contains numbers, currency, percentages, date, time, and string
variables.
* [ChoiceFormat](formatparse/messages/index.md) (§)
ChoiceFormat is a concrete class used to map strings to ranges of numbers
* [`ChoiceFormat`](formatparse/messages/index.md)
`ChoiceFormat` is a concrete class used to map strings to ranges of numbers
and to handle plural words and name series in user messages.
> :point_right: **Note**: *C classes provide the same functionality as the C++ classes with the exception
@ -223,34 +232,34 @@ within the same language:
* In some cases, an accented letter is treated as if it were two letters. In
traditional German, for example, ä is compared as if it were ae.
Searching and sorting is done through collation using the Collator class and its
sub-classes RuleBasedCollator and CollationElementIterator as well as the
CollationKey object. Collation determines the proper sort sequence for two or
Searching and sorting is done through collation using the `Collator` class and its
sub-classes `RuleBasedCollator` and `CollationElementIterator` as well as the
`CollationKey` object. Collation determines the proper sort sequence for two or
more natural language strings. It also can determine if two strings are
equivalent for the purpose of searching.
The Collator class and its sub-class RuleBasedCollator perform locale-sensitive
The `Collator` class and its sub-class `RuleBasedCollator` perform locale-sensitive
string comparisons to create sorting and searching routines for natural language
text. Collator and RuleBasedCollator can distinguish between characters
text. `Collator` and `RuleBasedCollator` can distinguish between characters
associated with base characters (such as 'a' and 'b'), accent marks (such as
'ò', 'ó'), and uppercase or lowercase properties (such as 'a' and 'A').
ICU provides the following collation classes for sorting and searching natural
language text according to locale-specific rules:
* [Collator](collation/architecture.md) is the abstract base class of all classes that compare strings.
* [`Collator`](collation/architecture.md) is the abstract base class of all classes that compare strings.
* [CollationElementIterator](collation/architecture.md) is a concrete iterator class that provides an
* [`CollationElementIterator`](collation/architecture.md) is a concrete iterator class that provides an
iterator for stepping through each character of a locale-specific string
according to the rules of a specific collator object.
* [RuleBasedCollator](collation/architecture.md) is the only built-in
* [`RuleBasedCollator`](collation/architecture.md) is the only built-in
implementation of the collator. It
provides a sophisticated mechanism for comparing strings in a
language-specific manner, and an interface that allows the user to
specifically customize the sorting order.
* [CollationKey](collation/architecture.md) is an object that enables the fast sorting of strings by
* [`CollationKey`](collation/architecture.md) is an object that enables the fast sorting of strings by
representing a string as a sort key under the rules of a specific collator
object.
@ -288,28 +297,31 @@ The BreakIterator services were designed and developed around an "iterator" or
text. You can move the pointer forward or backward to search the text for
boundaries.
The BreakIterator class makes it possible to iterate over user characters. A
BreakIterator can find the location of a character, word, sentence or potential
The `BreakIterator` class makes it possible to iterate over user characters. A
`BreakIterator` can find the location of a character, word, sentence or potential
line-break boundary. This makes it possible for a software program to properly
select characters for text operations such as highlighting a character, cutting
a word, moving to the next sentence, or wrapping words at a line ending.
BreakIterator performs these operations in a locale-sensitive manner, meaning
`BreakIterator` performs these operations in a locale-sensitive manner, meaning
that it recognizes text boundaries according to the particular locale ID.
ICU provides the following classes for iterating over locale-specific text:
* [BreakIterator](boundaryanalysis/index.md)
* [`BreakIterator`](boundaryanalysis/index.md)
The abstract base class that defines the operations for finding and getting
the positions of logical breaks in a string of text: characters, words,
sentences, and potential line breaks.
* [CharacterIterator](strings/characteriterator.md)
* [`CharacterIterator`](strings/characteriterator.md)
The abstract base class for forward and backward iteration over a string of
Unicode characters.
* [StringCharacterIterator](strings/index.md)
* [`StringCharacterIterator`](strings/index.md)
A concrete class for forward and backward iteration over a string of Unicode
characters. StringCharacterIterator inherits from CharacterIterator.
characters. `StringCharacterIterator` inherits from `CharacterIterator`.
### Paragraph Layout
@ -322,26 +334,26 @@ different one for each locale.
| C API | C++ Class | Description |
|----------|--------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| ubrk_ | BreakIterator | The BreakIterator class implements methods to find the location of boundaries in the text. |
| ucal_ | Calendar | The Calendar class is an abstract base class that converts between a UDate object and a set of integer fields such as YEAR, MONTH, DAY, HOUR, and so on. |
| umsg.h | ChoiceFormat | A ChoiceFormat class enables you to attach a format to a range of numbers. |
| ucol_ | CollationElementIterator | The CollationElementIterator class is used as an iterator to walk through each character of an international string. |
| ucol_ | CollationKey | The Collator class generates the Collation keys. |
| ucol_ | Collator | The Collator class performs locale-sensitive string comparison. |
| udat_ | DateFormat | DateFormat is an abstract class for a family of classes. DateFormat converts dates and times from their internal representations to a textual form that is language-independent, and then back to their internal representations. |
| udat_ | DateFormatSymbols | DateFormatSymbols is a public class that encapsulates localized date and time formatting data. This information includes time zone information. |
| unum_ | DecimalFormatSymbols | This class represents the set of symbols needed by DecimalFormat to format numbers. |
| umsg.h | Format | The Format class is the base class for all formats. |
| ucal_ | GregorianCalendar | GregorianCalendar is a concrete class that provides the standard calendar used in many locations. |
| uloc_ | Locale | A Locale object represents a specific geographical, political, or cultural region. |
| umsg.h | MessageFormat | MessageFormat provides a means to produce concatenated messages in language-neutral way. |
| unum_ | NumberFormat | NumberFormat is an abstract base class for all number formats. |
| ures_ | ResourceBundle | ResourceBundle provides a means to access a collection of locale-specific information. |
| ucol_ | RuleBasedCollator | The RuleBasedCollator provides the implementation of the Collator class using data-driven tables. |
| udat_ | SimpleDateFormat | SimpleDateFormat is a concrete class used to format and parse dates in a language-independent way. |
| ucal_ | SimpleTimeZone | SimpleTimeZone is a concrete subclass of TimeZone that represents a time zone for use with a Gregorian calendar. |
| usearch_ | StringSearch | StringSearch provides a way to search text in a locale sensitive manner. |
| ucal_ | TimeZone | TimeZone represents a time zone offset, and also determines daylight savings time settings. |
| `ubrk_` | `BreakIterator` | The `BreakIterator` class implements methods to find the location of boundaries in the text. |
| `ucal_` | `Calendar` | The `Calendar` class is an abstract base class that converts between a `UDate` object and a set of integer fields such as `YEAR`, `MONTH`, `DAY`, `HOUR`, and so on. |
| `umsg.h` | `ChoiceFormat` | A `ChoiceFormat` class enables you to attach a format to a range of numbers. |
| `ucol_` | `CollationElementIterator` | The `CollationElementIterator` class is used as an iterator to walk through each character of an international string. |
| `ucol_` | `CollationKey` | The `Collator` class generates the Collation keys. |
| `ucol_` | `Collator` | The `Collator` class performs locale-sensitive string comparison. |
| `udat_` | `DateFormat` | `DateFormat` is an abstract class for a family of classes. `DateFormat` converts dates and times from their internal representations to a textual form that is language-independent, and then back to their internal representations. |
| `udat_` | `DateFormatSymbols` | `DateFormatSymbols` is a public class that encapsulates localized date and time formatting data. This information includes time zone information. |
| `unum_` | `DecimalFormatSymbols` | This class represents the set of symbols needed by `DecimalFormat` to format numbers. |
| `umsg.h` | `Format` | The `Format` class is the base class for all formats. |
| `ucal_` | `GregorianCalendar` | `GregorianCalendar` is a concrete class that provides the standard calendar used in many locations. |
| `uloc_` | `Locale` | A `Locale` object represents a specific geographical, political, or cultural region. |
| `umsg.h` | `MessageFormat` | `MessageFormat` provides a means to produce concatenated messages in language-neutral way. |
| `unum_` | `NumberFormat` | `NumberFormat` is an abstract base class for all number formats. |
| `ures_` | `ResourceBundle` | `ResourceBundle` provides a means to access a collection of locale-specific information. |
| `ucol_` | `RuleBasedCollator` | The `RuleBasedCollator` provides the implementation of the `Collator` class using data-driven tables. |
| `udat_` | `SimpleDateFormat` | `SimpleDateFormat` is a concrete class used to format and parse dates in a language-independent way. |
| `ucal_` | `SimpleTimeZone` | `SimpleTimeZone` is a concrete subclass of `TimeZone` that represents a time zone for use with a Gregorian calendar. |
| `usearch_` | `StringSearch` | `StringSearch` provides a way to search text in a locale sensitive manner. |
| `ucal_` | `TimeZone` | `TimeZone` represents a time zone offset, and also determines daylight savings time settings. |
## Locale-Independent Operations
@ -350,18 +362,18 @@ locale-independent services and users do not need to specify a locale ID:
| C API | C++ Class | Description |
|-----------|-------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| ubidi_ | | UBiDi is used for implementing the Unicode BiDi algorithm. |
| utf.h | CharacterIterator | CharacterIterator is an abstract class that defines an API for iteration on text objects. It is an interface for forward and backward iteration and for the random access of a text object. Also, it provides backward compatibility to the Java and older ICU CharacterIterator classes. |
| n/a | Formattable | Formattable is a thin wrapper class that converts between the primitive numeric types (double, long, and so on) and the UDate and UnicodeString classes. Formattable objects can be passed to the Format class or its subclasses for formatting. |
| unorm_ | Normalizer | Normalizer transforms Unicode text into an equivalent composed or decomposed form to allow for easier sorting and searching of text. |
| n/a | ParsePosition | ParsePosition is a simple class used by the Format class and its subclasses to keep track of the current position during parsing. |
| uidna_ | | An implementation of the IDNA protocol as defined in RFC 3490. |
| utf.h | StringCharacterIterator | A concrete subclass of CharacterIterator that iterates over the characters (code units or code points) in a UnicodeString. |
| utf.h | UCharCharacterIterator | A concrete subclass of CharacterIterator that iterates over the characters (code units or code points) in a UChar array. |
| uchar.h | | The Unicode character properties API allows you to query the properties associated with individual Unicode character values. |
| uregex_ | RegexMatcher | RegexMatcher is a regular expressions implementation. This allows you to perform string matching based upon a pattern. |
| utrans_ | Transliterator | Transliterator is an abstract class that transliterates text from one format to another. The most common type of transliterator is a script, or an alphabet. |
| uset_ | UnicodeSet | Objects of the UnicodeSet class represent character classes used in regular expressions. These classes specify a subset of the set of all Unicode characters. This is a mutable set of Unicode characters. |
| ustring.h | UnicodeString | UnicodeString is a string class that stores Unicode characters directly. This class is a concrete implementation of the abstract class Replaceable. |
| ushape.h | | Provides operations to transform (shape) between Arabic characters and their presentation forms. |
| ucnv_ | | The Unicode conversion API allows you to convert data written in one codepage/encoding to and from UTF-16. |
| `ubidi_` | | `UBiDi` is used for implementing the Unicode BiDi algorithm. |
| `utf.h` | `CharacterIterator` | `CharacterIterator` is an abstract class that defines an API for iteration on text objects. It is an interface for forward and backward iteration and for the random access of a text object. Also, it provides backward compatibility to the Java and older ICU `CharacterIterator` classes. |
| n/a | `Formattable` | `Formattable` is a thin wrapper class that converts between the primitive numeric types (`double`, `long`, and so on) and the `UDate` and `UnicodeString` classes. `Formattable` objects can be passed to the `Format` class or its subclasses for formatting. |
| `unorm_` | `Normalizer` | `Normalizer` transforms Unicode text into an equivalent composed or decomposed form to allow for easier sorting and searching of text. |
| n/a | `ParsePosition` | `ParsePosition` is a simple class used by the `Format` class and its subclasses to keep track of the current position during parsing. |
| `uidna_` | | An implementation of the IDNA protocol as defined in RFC 3490. |
| `utf.h` | `StringCharacterIterator` | A concrete subclass of `CharacterIterator` that iterates over the characters (code units or code points) in a `UnicodeString`. |
| `utf.h` | `UCharCharacterIterator` | A concrete subclass of `CharacterIterator` that iterates over the characters (code units or code points) in a `UChar` array. |
| `uchar.h` | | The Unicode character properties API allows you to query the properties associated with individual Unicode character values. |
| `uregex_` | `RegexMatcher` | `RegexMatcher` is a regular expressions implementation. This allows you to perform string matching based upon a pattern. |
| `utrans_` | `Transliterator` | `Transliterator` is an abstract class that transliterates text from one format to another. The most common type of transliterator is a script, or an alphabet. |
| `uset_` | `UnicodeSet` | Objects of the `UnicodeSet` class represent character classes used in regular expressions. These classes specify a subset of the set of all Unicode characters. This is a mutable set of Unicode characters. |
| `ustring.h` | `UnicodeString` | `UnicodeString` is a string class that stores Unicode characters directly. This class is a concrete implementation of the abstract class `Replaceable`. |
| `ushape.h` | | Provides operations to transform (shape) between Arabic characters and their presentation forms. |
| `ucnv_` | | The Unicode conversion API allows you to convert data written in one codepage/encoding to and from UTF-16. |

View File

@ -39,7 +39,7 @@ large block of text that can be iterated as a whole, or can be broken into
smaller blocks for the purpose of iteration.
> :point_right: **Note**: *CharacterIterator is different from
[Normalizer](../transforms/normalization/index.md) in that CharacterIterator
[Normalizer](../transforms/normalization/index) in that CharacterIterator
walks through the Unicode characters without interpretation.*
Prior to ICU release 1.6, the CharacterIterator class allowed access to a single

View File

@ -215,7 +215,7 @@ Notes:
ICU does not provide the means to modify properties at runtime. The properties
are provided exactly as specified by a recent version of the Unicode Standard
(as published in the [Character
Database](http://www.unicode.org/unicode/onlinedat/online.html) ).
Database](http://www.unicode.org/onlinedat/online.html)).
For custom sets and maps, it is easiest to make UnicodeSet or
UCPTrie/CodePointTrie objects with the desired values.

View File

@ -151,7 +151,7 @@ complete a complete description of the API.
| \\W | ✓ | ✓ | Match a non-word character.
| \\x{hhhh} | ✓ | ✓ | Match the character with hex value hhhh. From one to six hex digits may be supplied.
| \\xhh | ✓ | ✓ | Match the character with two digit hex value hh.
| \\X | ✓ | | Match a [Grapheme Cluster](http://www.unicode.org/unicode/reports/tr29/#Grapheme_Cluster_Boundaries).
| \\X | ✓ | | Match a [Grapheme Cluster](http://www.unicode.org/reports/tr29/#Grapheme_Cluster_Boundaries).
| \\Z | ✓ | | Match if the current position is at the end of input, but before the final line terminator, if one exists.
| \\z | ✓ | | Match if the current position is at the end of input.
| \\*n* | ✓ | | Back Reference. Match whatever the nth capturing group matched. n must be a number > 1 and < total number of capture groups in the pattern.
@ -219,7 +219,7 @@ complete a complete description of the API.
Case insensitive matching is specified by the UREGEX_CASE_INSENSITIVE flag
during pattern compilation, or by the (?i) flag within a pattern itself. Unicode
case insensitive matching is complicated by the fact that changing the case of a
string may change its length. See <http://unicode.org/faq/casemap_charprop.html>
string may change its length. See <http://www.unicode.org/faq/casemap_charprop.html>
for more information on Unicode casing operations.
Full case-insensitive matching handles situations where the number of characters
@ -506,5 +506,5 @@ processes, can be a problem with some regular expression packages.
matches any of ```!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~```. From that list,
ICU omits ```$+<=>^\`|~``` &nbsp; &nbsp;
ICU follows the recommendations from Unicode UTS-18,
<http://unicode.org/reports/tr18/#Compatibility_Properties>. See also
<http://www.unicode.org/reports/tr18/#Compatibility_Properties>. See also
<https://unicode-org.atlassian.net/browse/ICU-20095>.

View File

@ -123,7 +123,7 @@ reordering. See the ubidi.h header file and the [BiDi API
References](https://unicode-org.github.io/icu-docs/apidoc/released/icu4c/ubidi_8h.html) .
See [Unicode Standard Annex #9: The Bidirectional
Algorithm](http://www.unicode.org/unicode/reports/tr9/) .
Algorithm](http://www.unicode.org/reports/tr9/) .
## Programming Examples in C and C++

View File

@ -63,7 +63,7 @@ character map.
A character is considered to have a lowercase, uppercase, or title case
equivalent if there is a respective "simple" case mapping specified for the
character in the [Unicode Character Database](http://unicode.org/ucd/) (UnicodeData.txt).
character in the [Unicode Character Database](http://www.unicode.org/ucd/) (UnicodeData.txt).
If a character has no mapping equivalent, the result is the character itself.
The APIs provided for the general case mapping, located in `uchar.h` file, handles

View File

@ -154,9 +154,9 @@ is the source and "Latin" is the target. The source specifier describes the
characters or strings that the transform will modify. The target specifier
describes the result of the modification. If the source is not given, then the
source is "Any", the set of all characters. Source and Target specifiers can be
[Script IDs](http://unicode.org/cldr/utility/properties.jsp#Script) (long like
[Script IDs](http://www.unicode.org/cldr/utility/properties.jsp#Script) (long like
"Latin" or short like "Latn"), [Unicode language
Identifiers](http://unicode.org/reports/tr35/#Unicode_Language_and_Locale_Identifiers)
Identifiers](http://www.unicode.org/reports/tr35/#Unicode_Language_and_Locale_Identifiers)
(like fr, en_US, or zh_Hant), or special tags (like Any or Hex). For example:
1. Katakana-Latin

View File

@ -34,7 +34,7 @@ in particular. For further information, see:
1. [The Web site of the Unicode consortium](http://www.unicode.org/)
2. [What is
Unicode?](http://www.unicode.org/unicode/standard/WhatIsUnicode.html)
Unicode?](https://www.unicode.org/standard/WhatIsUnicode.html)
3. [IBM® Globalization](http://www.ibm.com/software/globalization/)
@ -138,7 +138,7 @@ more complexity into the Unicode standard, but far less than managing a large
number of different encodings.
Starting with Unicode 2.0 (published in 1996), the Unicode standard began
assigning numbers from 0 to 10ffff16, which requires 21 bits but does not use
assigning numbers from 0 to 10ffff<sub>16</sub>,which requires 21 bits but does not use
them completely. This gives more than enough room for all written languages in
the world. The original repertoire covered all major languages commonly used in
computing. Unicode continues to grow, and it includes more scripts.
@ -181,7 +181,7 @@ syntactically important characters are the same.*
## Character Encoding Forms and Schemes for Unicode
Unicode assigns characters a number from 0 to 10FFFF16, giving enough elbow room
Unicode assigns characters a number from 0 to 10FFFF<sub>16</sub>, giving enough elbow room
to allow for unambiguous encoding of every character in common use. Such a
character number is called a "code point".
@ -195,9 +195,9 @@ differ in their byte ordering. The difference between an encoding form and an
encoding scheme is that an encoding form maps the character set codes to values
that fit into internal data types (like a short in C), while an encoding scheme
maps to bits and bytes. For traditional encodings, they are the same since the
encoding forms already map to bytes
encoding forms already map to bytes.
. The different Unicode encoding forms are optimized for a variety of different
The different Unicode encoding forms are optimized for a variety of different
uses:
1. UTF-16, the default encoding form, maps a character code point to either one
@ -207,18 +207,18 @@ uses:
ASCII-based, byte-oriented APIs and protocols. A character is stored with 1,
2, 3, or 4 bytes.
3. UTF-32 is the simplest but most memory-intensive encoding form: It uses one
3. UTF-32 is the simplest, but most memory-intensive encoding form: It uses one
32-bit integer per Unicode character.
4. SCSU is an encoding scheme that provides a simple compression of Unicode
text. It is designed only for input and output, not for internal use.
ICU uses UTF-16 internally. ICU 2.0 fully supports supplementary characters
(with code points 1000016..10FFFF16. Older versions of ICU provided only partial
(with code points 10000<sub>16</sub>..10FFFF<sub>16</sub>). Older versions of ICU provided only partial
support for supplementary characters.
For input/output, character encoding schemes define a byte serialization of
text. UTF-8 is itself both an encoding form and an encoding scheme because it is
text. UTF-8 is itself both an encoding form, and an encoding scheme because it is
byte-based. For each of UTF-16 and UTF-32, there are two variants defined: one
that serializes the code units in big-endian byte order (most significant byte
first), and one that serializes the code units in little-endian byte order
@ -230,24 +230,23 @@ either to character encoding forms where 16/32-bit words are processed and are
naturally stored in the platform endianness, or they refer to the
IANA-registered charset names, i.e., to character encoding schemes or byte
serializations. In addition to simple byte serialization, the charsets with
these names also use optional Byte Order Marks (see Serialized Formats (§)
below).*
these names also use optional Byte Order Marks (see [Serialized Formats](#serialized-formats) below).*
## Overview of UTF-16
The default encoding form of the Unicode Standard uses 16-bit code units. Code
point values for the most common characters are in the range of 0 to FFFF16 and
point values for the most common characters are in the range of 0 to FFFF<sub>16</sub> and
are encoded with just one 16-bit unit of the same value. Code points from
1000016 to 10FFFF16 are encoded with two code units that are often called
10000<sub>16</sub> to 10FFFF<sub>16</sub> are encoded with two code units that are often called
"surrogates", and they are called a "surrogate pair" when, together, they
correctly encode one Unicode character. The first surrogate in a pair must be in
the range D80016 to DBFF16, and the second one must be in the range DC0016 to
DFFF16. Every Unicode code point has only one possible UTF-16 encoding with
the range D800<sub>16</sub> to DBFF<sub>16</sub>, and the second one must be in the range DC00<sub>16</sub> to
DFFF<sub>16</sub>. Every Unicode code point has only one possible UTF-16 encoding with
either one code unit that is not a surrogate or with a correct pair of
surrogates. The code point values D80016 to DFFF16 are set aside just for this
surrogates. The code point values D800<sub>16</sub> to DFFF<sub>16</sub> are set aside just for this
mechanism and will never, by themselves, be assigned any characters.
Most commonly used characters have code points below FFFF16, but Unicode 3.1
Most commonly used characters have code points below FFFF<sub>16</sub>, but Unicode 3.1
assigns more than 40,000 supplementary characters that make use of surrogate
pairs in UTF-16.
@ -265,26 +264,26 @@ To meet the requirements of byte-oriented, ASCII-based systems, the Unicode
Standard defines UTF-8. UTF-8 is a variable-length, byte-based encoding that
preserves ASCII transparency.
UTF-8 maintains transparency for all of the ASCII code values (0..127). These
UTF-8 maintains transparency for all the ASCII code values (0..127). These
values do not appear in any byte of a transformed result except as the direct
representation of the ASCII values. Thus, ASCII text is also UTF-8 text.
Characteristics of UTF-8 include:
1. Unicode code points 0 to 7F16 are each encoded with a single byte of the
1. Unicode code points 0 to 7F<sub>16</sub> are each encoded with a single byte of the
same value. Therefore, ASCII characters take up 50% less space with UTF-8
encoding than with UTF-16.
2. All other code points are encoded with multibyte sequences, with the first
byte (lead byte) indicating the number of bytes that follow (trail bytes).
This results in very efficient parsing. The lead bytes are in the range c016
to fd16, the trail bytes are in the range 8016 to bf16. The byte values fe16
and FF16 are never used.
This results in very efficient parsing. The lead bytes are in the range c0<sub>16</sub>
to fd<sub>16</sub>, the trail bytes are in the range 80<sub>16</sub> to bf<sub>16</sub>. The byte values fe<sub>16</sub>
and FF<sub>16</sub> are never used.
3. UTF-8 is relatively compact and resource conservative in its use of the
bytes required for encoding text in European scripts, but uses 50% more
space than UTF-16 for East Asian text. Code points up to 7FF16 take up two
bytes, code points up to FFFF16 take up three (50% more memory than UTF-16),
space than UTF-16 for East Asian text. Code points up to 7FF<sub>16</sub> take up two
bytes, code points up to FFFF<sub>16</sub> take up three (50% more memory than UTF-16),
and all others four.
4. Binary comparisons of UTF-8 strings based on their bytes result in the same
@ -297,12 +296,12 @@ point. This results in a very simple encoding.
The drawback is its memory consumption: Since code point values use only 21
bits, one-third of the memory is always unused, and since most commonly used
characters have code point values of up to FFFF16, they take up only one 16-bit
characters have code point values of up to FFFF<sub>16</sub>, they take up only one 16-bit
unit in UTF-16 (50% less) and up to three bytes in UTF-8 (25% less).
UTF-32 is mainly used in APIs that are defined with the same data type for both
code points and code units. Modern versions of the C standard library that
support Unicode use a 32-bit wchar_t with UTF-32 semantics.
support Unicode use a 32-bit `wchar_t` with UTF-32 semantics.
## Overview of SCSU
@ -316,7 +315,7 @@ makes it unsuitable for internal processing. It also uses all possible byte
values, which might require additional processing for protocols such as SMTP
(email).
See also <http://www.unicode.org/unicode/reports/tr6/> .
See also <https://www.unicode.org/reports/tr6/> .
## Other Unicode Encodings
@ -376,7 +375,7 @@ lists these implications:
points for string searching. False matches never occurs since the end of one
sequence is never the same as the start of another sequence. Overlap is one
of the biggest problems with common multi-byte encodings like Shift-JIS. All
of the UTFs avoid this problem.
the UTFs avoid this problem.
4. Uses simple iteration. Getting the next or previous code point is
straightforward, and only takes a small number of machine instructions.
@ -397,11 +396,11 @@ Conversion between different UTFs is very fast. Unlike converting to and from
legacy encodings like Latin-2, conversion between UTFs does not require table
look-ups.
ICU provides two basic data type definitions for Unicode. UChar32 is a 32-bit
ICU provides two basic data type definitions for Unicode. `UChar32` is a 32-bit
type for code points, and used for single Unicode characters. It may be signed
or unsigned. It is the same as wchar_t if it is 32 bits wide. UChar is an
or unsigned. It is the same as `wchar_t` if it is 32 bits wide. `UChar` is an
unsigned 16-bit integer for UTF-16 code units. It is the base type for strings
(`UChar *`), and it is the same as wchar_t if it is 16 bits wide.
(`UChar *`), and it is the same as `wchar_t` if it is 16 bits wide.
Some higher-level APIs, used especially for formatting, use characters closer to
a representation for a glyph. Such "user characters" are also called "graphemes"
@ -428,7 +427,7 @@ format-dependent way. However, the Unicode standard offers a mechanism for
tagging text files with a "signature" for cases where protocols do not identify
character encoding schemes.
The character ZERO WIDTH NO-BREAK SPACE (FEFF16) can be used as a signature by
The character ZERO WIDTH NO-BREAK SPACE (FEFF<sub>16</sub>) can be used as a signature by
prepending it to a file or stream. The alternative function of U+FEFF as a
format control character has been copied to U+2060 WORD JOINER, and U+FEFF
should only be used for Unicode signatures.
@ -454,7 +453,7 @@ sequences for U+FEFF:
9. UTF-EBCDIC: DD 73 66 73
ICU provides the function ucnv_detectUnicodeSignature() for Unicode signature
ICU provides the function `ucnv_detectUnicodeSignature()` for Unicode signature
detection.
*There is no signature for CESU-8 separate from the one for UTF-8. UTF-8 and
@ -465,7 +464,7 @@ CESU-8 should only be used in limited, closed, specific environments.*
In UTF-16 and UTF-32, where the signature also distinguishes between big-endian
and little-endian byte orders, it is also called a byte order mark (BOM). The
signature works for UTF-16 since the code point that has the byte-swapped
encoding, FFFE16, will never be a valid Unicode character. (It is a
encoding, FFFE<sub>16</sub>, will never be a valid Unicode character. (It is a
"non-character" code point.) In Internet protocols, if an encoding specification
of "UTF-16" or "UTF-32" is used, it is expected that there is a signature byte
sequence (BOM) that identifies the byte ordering, which is not the case for the
@ -510,9 +509,9 @@ standard. Both standards have the same character repertoire and the same
encoding forms and schemes.
One difference used to be that the ISO standard defined code point values to be
from 0 to 7FFFFFFF16, not just up to 10FFFF16. The ISO work group decided to add
from 0 to 7FFFFFFF<sub>16</sub>, not just up to 10FFFF<sub>16</sub>. The ISO work group decided to add
an amendment to the standard. The amendment removes this difference by declaring
that no characters will ever be assigned code points above 10FFFF16. The main
that no characters will ever be assigned code points above 10FFFF<sub>16</sub>. The main
reason for the ISO work group's decision is interoperability between the UTFs.
UTF-16 can not encode any code points above this limit.
@ -526,7 +525,7 @@ specifies sequences of five and six bytes to cover that whole range.
Another difference is that the ISO standard defines encoding forms "UCS-4" and
"UCS-2". UCS-4 is essentially UTF-32 with a theoretical upper limit of
7FFFFFFF16, using 31 out of the 32 bits. However, in practice, the ISO committee
7FFFFFFF<sub>16</sub>, using 31 out of the 32 bits. However, in practice, the ISO committee
has accepted that the characters above 10FFFF will not be encoded, so there is
essentially no difference between the forms. The "4" stands for "four-byte
form".
@ -542,5 +541,5 @@ The standards differ in what kind of information they provide: The Unicode
standard provides more character properties and describes algorithms etc., while
the ISO standard defines collections, subsets and similar.
The standards are synchronized and the respective committees work together to
The standards are synchronized, and the respective committees work together to
add new characters and assign code point values.