ICU-20088 fix code, tables, typos, broken links, add code styling
This commit is contained in:
parent
9c965ff757
commit
5a714ae8f7
@ -8,7 +8,7 @@ Updating ICU's built-in Break Iterator rules
|
||||
|
||||
Here are instructions for updating ICU's built-in break iterator rules, for Grapheme, Word, Line and Sentence breaks.
|
||||
|
||||
The ICU rules implement the boundary behavior from Unicode [UAX-14](https://unicode.org/reports/tr14/) and [UAX-29](https://unicode.org/reports/tr29/), with tailorings from CLDR and some ICU-specific enhancements. ICU rules updates are needed in response to changes from Unicode or CLDR, or for bug fixes. Often ideas for CLDR or UAX updates are prototyped in ICU first, before becoming official.
|
||||
The ICU rules implement the boundary behavior from Unicode [UAX-14](https://www.unicode.org/reports/tr14/) and [UAX-29](https://www.unicode.org/reports/tr29/), with tailorings from CLDR and some ICU-specific enhancements. ICU rules updates are needed in response to changes from Unicode or CLDR, or for bug fixes. Often ideas for CLDR or UAX updates are prototyped in ICU first, before becoming official.
|
||||
|
||||
This is not a cook book process. Familiarity with ICU break iterator behavior and rules is needed. Sets of break rules often interact in subtle and difficult to understand ways. Expect some bumps.
|
||||
|
||||
@ -50,7 +50,7 @@ The rule updates are done first for ICU4C, and then ported (code changes) or mov
|
||||
|
||||
Add tests to to spot check the basics of the changes, to verify that some simple, straight forward cases work as expected. There is no need to thoroughly check corner cases; the goal at this step is a quick sanity check that will fail before the rule update and pass afterwards.
|
||||
|
||||
The [Unicode Utilities](http://unicode.org/cldr/utility/) can be very helpful at this point, for showing what characters
|
||||
The [Unicode Utilities](http://www.unicode.org/cldr/utility/) can be very helpful at this point, for showing what characters
|
||||
match a UnicodeSet expression, and for listing the properties of a particular character.
|
||||
|
||||
Tests added for the above example:
|
||||
@ -146,7 +146,7 @@ The rule updates are done first for ICU4C, and then ported (code changes) or mov
|
||||
cd icu4c/source/test/intltest
|
||||
LD_LIBRARY_PATH=../../lib:../../stubdata:../../tools/ctestfw:$LD_LIBRARY_PATH ./intltest rbbi/RBBITest/TestUnicodeFiles
|
||||
|
||||
The test files are from the Unicode Consortium. The official, released versions are at https://www.unicode.org/Public/UCD/latest/ucd/auxiliary/ . The files are copied, unmodified, into the ICU source tree to make them accessible to the ICU tests.
|
||||
The test files are from the Unicode Consortium. The official, released versions are at https://www.unicode.org/Public/UCD/latest/ucd/auxiliary/. The files are copied, unmodified, into the ICU source tree to make them accessible to the ICU tests.
|
||||
|
||||
If the update is for a new Unicode version, or for a new CLDR tailoring of the root Unicode rules, it should include updated test data files. If they're missing, ask whoever is requesting or providing the updated rules for help. The test data is generated by CLDR tooling.
|
||||
|
||||
@ -209,7 +209,7 @@ The rule updates are done first for ICU4C, and then ported (code changes) or mov
|
||||
by this test have not yet been.
|
||||
|
||||
Make the updates to the test rules and re-run. The rule syntax is described in
|
||||
icu4c/source/test/testdata/break_rules/README.md.
|
||||
[icu4c/source/test/testdata/break_rules/README.md](https://github.com/unicode-org/icu/blob/master/icu4c/source/test/testdata/break_rules/README.md)
|
||||
The test reference rules are in this same directory.
|
||||
|
||||
Again, after everything appears to be working, let the test run for an extended length of time. Long runs are especially important with the more complex break rule sets, such as line break.
|
||||
|
@ -25,8 +25,8 @@ License & terms of use: http://www.unicode.org/copyright.html
|
||||
ICU locates boundary positions within text by means of rules, which are a form
|
||||
of regular expressions. The form of the rules is similar, but not identical,
|
||||
to the boundary rules from the Unicode specifications
|
||||
[ [UAX-14](https://unicode.org/reports/tr14/),
|
||||
[UAX-29](https://unicode.org/reports/tr29/) ], and there is a reasonably close
|
||||
[[UAX-14](https://www.unicode.org/reports/tr14/),
|
||||
[UAX-29](https://www.unicode.org/reports/tr29/)], and there is a reasonably close
|
||||
correspondence between the two.
|
||||
|
||||
Taken as a set, the ICU rules describe how to move forward to the next boundary,
|
||||
|
@ -66,8 +66,8 @@ ICU `BreakIterator`s can be used to locate the following kinds of text boundarie
|
||||
|
||||
Each type of boundary is found in accordance with the rules specified by Unicode
|
||||
Standard Annex #29, *Unicode Text Segmentation*
|
||||
(<https://unicode.org/reports/tr29/> ) or Unicode Standard Annex #14, *Unicode
|
||||
Line Breaking Algorithm* (<https://unicode.org/reports/tr14/>)
|
||||
(<https://www.unicode.org/reports/tr29/> ) or Unicode Standard Annex #14, *Unicode
|
||||
Line Breaking Algorithm* (<https://www.unicode.org/reports/tr14/>)
|
||||
|
||||
### Character Boundary
|
||||
|
||||
@ -237,7 +237,7 @@ French, Italian and Portuguese.
|
||||
|
||||
ICU's break iterators are based on the default boundary rules described in the
|
||||
Unicode Standard Annexes [14](https://www.unicode.org/reports/tr14/) and
|
||||
[29](https://www.unicode.org/unicode/reports/tr29/) . These are relatively
|
||||
[29](https://www.unicode.org/reports/tr29/). These are relatively
|
||||
simple boundary rules that can be implemented efficiently, and are sufficient
|
||||
for many purposes and languages. However, some languages and applications will
|
||||
require a more sophisticated linguistic analysis of the text in order to find
|
||||
|
@ -112,7 +112,7 @@ try {
|
||||
|
||||
If the ICU predefined collators are not appropriate for your intended usage, you
|
||||
can define your own set of rules and instantiate a collator that uses them. For more
|
||||
details, please see [the section on collation customization](customization/index.md).
|
||||
details, please see [the section on collation customization](customization/index).
|
||||
|
||||
This example demonstrates the instantiation of a collator.
|
||||
|
||||
@ -272,7 +272,7 @@ The Java API provides only the `Collator.getCollationKey` method, which produces
|
||||
|
||||
Sort keys are generally only useful in databases or other circumstances where
|
||||
function calls are extremely expensive. See [Sortkeys vs
|
||||
Comparison](concepts.md#sortkeys-vs-comparison).
|
||||
Comparison](concepts#sortkeys-vs-comparison).
|
||||
|
||||
### Sort Key Features
|
||||
|
||||
@ -354,7 +354,7 @@ of memory needed to store the resulting sort key. This mode is automatically
|
||||
activated if the output buffer size passed is set to zero. Should the sort key
|
||||
become longer than the buffer provided, function again slips into preflighting
|
||||
mode. The overall performance is poorer than if the function is called with a
|
||||
zero output buffer . If the size of the sort key returned is greater than the
|
||||
zero output buffer. If the size of the sort key returned is greater than the
|
||||
size of the buffer provided, the content of the result buffer is undefined. In
|
||||
that case, the result buffer could be reallocated to its proper size and the
|
||||
sort key generator function can be used again.
|
||||
@ -693,7 +693,7 @@ API is provided. Each attribute has its own setter API of the form
|
||||
## References
|
||||
|
||||
1. Ken Whistler, Markus Scherer: "Unicode Technical Standard #10, Unicode Collation
|
||||
Algorithm" (<http://www.unicode.org/unicode/reports/tr10/>)
|
||||
Algorithm" (<http://www.unicode.org/reports/tr10/>)
|
||||
|
||||
2. ICU Design doc: "Collation v2" (<http://site.icu-project.org/design/collation/v2>)
|
||||
|
||||
@ -701,11 +701,11 @@ API is provided. Each attribute has its own setter API of the form
|
||||
(<https://htmlpreview.github.io/?https://github.com/unicode-org/icu-docs/blob/master/design/collation/ICU_collation_design.htm>)
|
||||
|
||||
3. The Unicode Standard, chapter 5, "Implementation guidelines"
|
||||
(<http://www.unicode.org/unicode/uni2book/ch05.pdf>)
|
||||
(<http://www.unicode.org/uni2book/ch05.pdf>)
|
||||
|
||||
4. Laura Werner: "Efficient text searching in Java: Finding the right string in
|
||||
any language"
|
||||
(<http://icu-project.org/docs/papers/efficient_text_searching_in_java.html>)
|
||||
|
||||
5. Mark Davis, Martin Dürst: "Unicode Standard Annex #15: Unicode Normalization
|
||||
Forms" (<http://www.unicode.org/unicode/reports/tr15/>).
|
||||
Forms" (<http://www.unicode.org/reports/tr15/>).
|
||||
|
@ -223,7 +223,7 @@ generated sort keys are longer than supplied buffers too often
|
||||
A good strategy
|
||||
for this problem would be to manually manage a large buffer for storing sortkeys
|
||||
and keep a list of indices to sort keys in this buffer (see the "large buffers"
|
||||
[Collation Example](examples.md#using-large-buffers-to-manage-sort-keys)
|
||||
[Collation Example](examples#using-large-buffers-to-manage-sort-keys)
|
||||
for more details).
|
||||
|
||||
Here are some rules of a thumb, please do not rely on them. If you are looking
|
||||
|
@ -131,7 +131,7 @@ Following is a list of the names for each level and an example usage:
|
||||
5. Identical Level: When all other levels are equal, the identical level is
|
||||
used as a tiebreaker. The Unicode code point values of the NFD form of each
|
||||
string are compared at this level, just in case there is no difference at
|
||||
levels 1-4 . For example, Hebrew cantillation marks are only distinguished
|
||||
levels 1-4. For example, Hebrew cantillation marks are only distinguished
|
||||
at this level. This level should be used sparingly, as only code point
|
||||
value differences between two strings is an extremely rare occurrence.
|
||||
Using this level substantially decreases the performance for
|
||||
|
@ -48,7 +48,7 @@ See also
|
||||
* [UCA: Variable
|
||||
Weighting](http://www.unicode.org/reports/tr10/#Variable_Weighting)
|
||||
* [LDML: Setting
|
||||
Options](http://unicode.org/repos/cldr/trunk/specs/ldml/tr35-collation.html#Setting_Options)
|
||||
Options](https://htmlpreview.github.io/?https://github.com/unicode-org/cldr/blob/master/docs/ldml/tr35-collation.html#Setting_Options)
|
||||
|
||||
Here is an overview of the sorting results with these options.
|
||||
|
||||
|
@ -820,10 +820,10 @@ These do not cause **c** to sort as if it were **ae**, nor should they.
|
||||
The Unicode Collation Algorithm specifies that all text sort as if it were first
|
||||
normalized into NFD. For performance reasons, ICU collation data is
|
||||
pre-processed so that there is no need to perform normalization on strings that
|
||||
are in [FCD](http://unicode.org/notes/tn5/#FCD) and do not contain any composite
|
||||
are in [FCD](http://www.unicode.org/notes/tn5/#FCD) and do not contain any composite
|
||||
combining marks. Composite combining marks are: { U+0344, U+0F73, U+0F75, U+0F81
|
||||
}
|
||||
[`[[:^lccc=0:]&[:toNFD=/../:]]`](http://unicode.org/cldr/utility/list-unicodeset.jsp?a=%5B%3A%5Elccc%3D0%3A%5D%26%5B%3AtoNFD%3D%2F..%2F%3A%5D&abb=on&g=)
|
||||
[`[[:^lccc=0:]&[:toNFD=/../:]]`](http://www.unicode.org/cldr/utility/list-unicodeset.jsp?a=%5B%3A%5Elccc%3D0%3A%5D%26%5B%3AtoNFD%3D%2F..%2F%3A%5D&abb=on&g=)
|
||||
(These characters must be decomposed for discontiguous contractions to work
|
||||
properly. Use of these characters is discouraged by the Unicode Standard.). The
|
||||
vast majority of strings are in this form.
|
||||
|
@ -27,7 +27,7 @@ are also provided the facilities to customize orderings.
|
||||
|
||||
Starting in release 1.8, the ICU Collation Service is compliant to the Unicode
|
||||
Collation Algorithm (UCA) ([Unicode Technical Standard
|
||||
#10](http://www.unicode.org/unicode/reports/tr10/)) and based on the Default
|
||||
#10](http://www.unicode.org/reports/tr10/)) and based on the Default
|
||||
Unicode Collation Element Table (DUCET) which defines the same sort order as ISO
|
||||
14651.
|
||||
|
||||
@ -132,15 +132,15 @@ The basic ICU Collation Service is provided by two main categories of APIs:
|
||||
use index tables of sort keys to index strings quickly. Note, however, that
|
||||
this only improves performance for large numbers of strings because sorting
|
||||
via the comparison functions is very fast. For more information, see
|
||||
[Sortkeys vs Comparison](concepts.md#sortkeys-vs-comparison).
|
||||
[Sortkeys vs Comparison](concepts#sortkeys-vs-comparison).
|
||||
|
||||
ICU provides an AlphabeticIndex API for generating language-appropriate
|
||||
sorted-section labels like in dictionaries and phone books.
|
||||
|
||||
ICU also provides a higher-level [string search](string-search.md)
|
||||
ICU also provides a higher-level [string search](string-search)
|
||||
API which can be used, for example, for case-insensitive or accent-insensitive
|
||||
search in an editor or in a web page. ICU string search is based on the
|
||||
low-level [collation element iteration](architecture.md).
|
||||
low-level [collation element iteration](architecture).
|
||||
|
||||
## Programming Examples
|
||||
|
||||
|
@ -155,7 +155,7 @@ match is returned. However, the behavior can be overwritten by supplying a word
|
||||
|
||||
The minimum unit of match is aligned to an extended grapheme cluster in the ICU
|
||||
string search service implementation defined by [UAX #29 Unicode Text
|
||||
Segmentation](http://unicode.org/reports/tr29/). Therefore, all matches will
|
||||
Segmentation](http://www.unicode.org/reports/tr29/). Therefore, all matches will
|
||||
begin and end on extended grapheme cluster boundaries. If the given input search
|
||||
pattern starts with non-base character, no matches will be returned.
|
||||
When there are contractions in the collation sequence and the contraction
|
||||
@ -264,7 +264,7 @@ service. Therefore, all the performance implications that apply to a collator
|
||||
are also applicable to the string search service. To obtain the best
|
||||
performance, use the default collator attributes described in the Performance
|
||||
and Storage Implications on Attributes section in the [Collation Service
|
||||
Architecture](architecture.md#-performance-and-storage-implications-on-attributes)
|
||||
Architecture](architecture#performance-and-storage-implications-of-attributes)
|
||||
chapter. In addition, users need to be aware of
|
||||
the following `StringSearch` specific considerations:
|
||||
|
||||
|
@ -38,7 +38,7 @@ Unicode (usually in the Basic Latin range).
|
||||
For additional detail about the compression algorithm, which has been approved
|
||||
by the Unicode Consortium, please refer to [Unicode Technical Report #6 (A
|
||||
Standard Compression Scheme for
|
||||
Unicode)](https://www.unicode.org/unicode/reports/tr6/).
|
||||
Unicode)](https://www.unicode.org/reports/tr6/).
|
||||
|
||||
The Standard Compression Scheme for Unicode (SCSU) is used to:
|
||||
|
||||
|
@ -607,7 +607,7 @@ character preceding a Variation Selector.
|
||||
Unicode has a character property to identify such characters, as well as
|
||||
currently-unassigned code points that are intended to be used for similar
|
||||
purposes: Default_Ignorable_Code_Point, or "DI" for short:
|
||||
http://unicode.org/cldr/utility/list-unicodeset.jsp?a=[:DI:]
|
||||
http://www.unicode.org/cldr/utility/list-unicodeset.jsp?a=[:DI:]
|
||||
|
||||
Most charsets do not have most or any of these characters.
|
||||
|
||||
|
@ -118,10 +118,10 @@ used in ICU. For related material, please see:
|
||||
|
||||
1. [ICU character set collection](http://icu-project.org/charts/charset/)
|
||||
|
||||
2. [Unicode Technical Report 22](http://www.unicode.org/unicode/reports/tr22/)
|
||||
2. [Unicode Technical Report 22](http://www.unicode.org/reports/tr22/)
|
||||
|
||||
3. "Cross Mapping Tables" in [Unicode Online
|
||||
Data](http://www.unicode.org/unicode/onlinedat/online.html)
|
||||
Data](http://www.unicode.org/onlinedat/online.html)
|
||||
|
||||
## ICU Mapping Table Data Files
|
||||
|
||||
|
@ -91,7 +91,7 @@ Unicode.
|
||||
of such problems, see for example ["Character Conversions and Mapping
|
||||
Tables"](http://icu-project.org/docs/papers/conversions_and_mappings_iuc19.ppt)
|
||||
on <http://icu-project.org/docs/> and the [XML Japanese
|
||||
Profile](http://www.w3.org/TR/japanese-xml/) .)
|
||||
Profile](http://www.w3.org/TR/japanese-xml/)).
|
||||
|
||||
1. Use UTF-8 and UTF-16.
|
||||
|
||||
@ -141,7 +141,7 @@ Unicode.
|
||||
references](https://unicode-org.github.io/icu-docs/apidoc/released/icu4c/ucnv_8h.html) .
|
||||
|
||||
7. For data exchange (rather than pure display), turn off fallback
|
||||
mappings: ucnv_setFallback(cnv, FALSE);
|
||||
mappings: `ucnv_setFallback(cnv, FALSE)`;
|
||||
|
||||
8. For some text formats, especially XML and HTML, it is possible to set an
|
||||
"escape callback" function that turns unmappable Unicode code points
|
||||
|
@ -25,15 +25,17 @@ License & terms of use: http://www.unicode.org/copyright.html
|
||||
ICU has two main calendar classes used for parsing and formatting Calendar
|
||||
information correctly:
|
||||
|
||||
1. Calendar
|
||||
1. `Calendar`
|
||||
|
||||
An abstract base class that defines the calendar API. This API supports
|
||||
UDate to fields conversion and field arithmetic.
|
||||
|
||||
2. GregorianCalendar
|
||||
A concrete subclass of Calendar that implements the standard calendar used
|
||||
2. `GregorianCalendar`
|
||||
|
||||
A concrete subclass of `Calendar` that implements the standard calendar used
|
||||
today internationally.
|
||||
|
||||
In addition to these, ICU has other Calendar sub classes to support
|
||||
In addition to these, ICU has other `Calendar` subclasses to support
|
||||
non-gregorian calendars including:
|
||||
|
||||
* Japanese
|
||||
@ -56,55 +58,51 @@ non-gregorian calendars including:
|
||||
|
||||
* Ethiopic
|
||||
|
||||
The Calendar class is designed to support additional calendar systems in the
|
||||
future.
|
||||
The `Calendar` class is designed to support additional calendar systems in the future.
|
||||
|
||||
> :point_right: **Note**: *Calendar classes are related to UDate, the TimeZone classes, and the DateFormat
|
||||
> :point_right: **Note**: *`Calendar` classes are related to `UDate`, the `TimeZone` classes, and the `DateFormat`
|
||||
classes.*
|
||||
|
||||
### Calendar locale and keyword handling
|
||||
|
||||
When a calendar object is created, via either Calendar::create(), or
|
||||
ucal_open(), or indirectly within a date formatter, ICU looks up the 'default'
|
||||
When a calendar object is created, via either `Calendar::create()`, or
|
||||
`ucal_open()`, or indirectly within a date formatter, ICU looks up the 'default'
|
||||
calendar type for that locale. At present, all locales default to a Gregorian
|
||||
calendar, except for the compatibility locales th_TH_TRADITIONAL and
|
||||
ja_JP_TRADITIONAL. If the "calendar" keyword is supplied, this value will
|
||||
override the default for that locale.
|
||||
|
||||
For instance, Calendar::createInstance("fr_FR", status) will create a Gregorian
|
||||
calendar, but Calendar::createInstance("fr_FR@calendar=buddhist") will create a
|
||||
Buddhist calendar.
|
||||
For instance, `Calendar::createInstance("fr_FR", status)` will create a Gregorian calendar,
|
||||
but `Calendar::createInstance("fr_FR@calendar=buddhist")` will create a Buddhist calendar.
|
||||
|
||||
It is an error to use an invalid calendar type. It will produce a missing
|
||||
resource error.
|
||||
It is an error to use an invalid calendar type. It will produce a missing resource error.
|
||||
|
||||
> :point_right: **Note**: *As of ICU 2.8, the above description applies to ICU4J only. ICU4J will have
|
||||
this behavior in 3.0*
|
||||
|
||||
## Usage
|
||||
|
||||
This section discusses how to use the Calendar class and the GregorianCalendar
|
||||
subclass.
|
||||
This section discusses how to use the `Calendar` class and the `GregorianCalendar` subclass.
|
||||
|
||||
### Calendar
|
||||
|
||||
Calendar is an abstract base class. It defines common protocols for a hierarchy
|
||||
of classes. Concrete subclasses of Calendar, for example the GregorianCalendar
|
||||
`Calendar` is an abstract base class. It defines common protocols for a hierarchy
|
||||
of classes. Concrete subclasses of `Calendar`, for example the `GregorianCalendar`
|
||||
class, define specific operations that correspond to a real-world calendar
|
||||
system. Calendar objects (instantiations of concrete subclasses of Calendar),
|
||||
system. `Calendar` objects (instantiations of concrete subclasses of `Calendar`),
|
||||
embody state that represents a specific context. They correspond to a real-world
|
||||
locale. They also contain state that specifies a moment in time.
|
||||
|
||||
The API defined by Calendar encompasses multiple functions:
|
||||
The API defined by `Calendar` encompasses multiple functions:
|
||||
|
||||
1. Representation of a specific time as a UDate
|
||||
1. Representation of a specific time as a `UDate`
|
||||
|
||||
2. Representation of a specific time as a set of integer fields, such as YEAR,
|
||||
MONTH, HOUR, etc.
|
||||
2. Representation of a specific time as a set of integer fields, such as `YEAR`,
|
||||
`MONTH`, `HOUR`, etc.
|
||||
|
||||
3. Conversion from UDate to fields
|
||||
3. Conversion from `UDate` to fields
|
||||
|
||||
4. Conversion from fields to UDate
|
||||
4. Conversion from fields to `UDate`
|
||||
|
||||
5. Field arithmetic, including adding, rolling, and field difference
|
||||
|
||||
@ -116,49 +114,49 @@ The API defined by Calendar encompasses multiple functions:
|
||||
|
||||
#### Representation and Conversion
|
||||
|
||||
The basic function of the Calendar class is to convert between a UDate value and
|
||||
a set of integer fields. A UDate value is stored as UTC time in milliseconds,
|
||||
which means it is calendar and time zone independent. UDate is the most compact
|
||||
and portable way to store and transmit a date and time. Integer field values, on
|
||||
The basic function of the `Calendar` class is to convert between a `UDate` value and
|
||||
a set of integer fields. A `UDate` value is stored as UTC time in milliseconds,
|
||||
which means it is calendar and time zone independent. `UDate` is the most compact
|
||||
and portable way to store and transmit a date and time. `Integer` field values, on
|
||||
the other hand, depend on the calendar system (that is, the concrete subclass of
|
||||
Calendar) and the calendar object's context state.
|
||||
`Calendar`) and the calendar object's context state.
|
||||
|
||||
> :point_right: **Note**: *Integer field values are needed when implementing a human interface that must
|
||||
> :point_right: **Note**: *`Integer` field values are needed when implementing a human interface that must
|
||||
display or input a date and/or time.*
|
||||
|
||||
At any given time, a calendar object uses (when DateFormat is not sufficient)
|
||||
either its internal UDate or its integer fields (depending on which has been set
|
||||
most recently via setTime() or set()), to represent a specific date and time.
|
||||
Whatever the current internal representation, when the caller requests a UDate
|
||||
At any given time, a calendar object uses (when `DateFormat` is not sufficient)
|
||||
either its internal `UDate` or its integer fields (depending on which has been set
|
||||
most recently via `setTime()` or `set()`), to represent a specific date and time.
|
||||
Whatever the current internal representation, when the caller requests a `UDate`
|
||||
or an integer field it is computed if necessary. The caller need never trigger
|
||||
the conversion explicitly. The caller must perform a conversion to set either
|
||||
the UDate or the integer fields, and then retrieve the desired data. This also
|
||||
the `UDate` or the integer fields, and then retrieve the desired data. This also
|
||||
applies in situations where the caller has some integer fields and wants to
|
||||
obtain others.
|
||||
|
||||
#### Field Arithmetic
|
||||
|
||||
Arithmetic with UDate values is straightforward. Since the values are
|
||||
Arithmetic with `UDate` values is straightforward. Since the values are
|
||||
millisecond scalar values, direct addition and subtraction is all that is
|
||||
required. Arithmetic with integer fields is more complicated. For example, what
|
||||
is the date June 4, 1999 plus 300 days? Calendar defines three basic methods (in
|
||||
several variants) that perform field arithmetic: add(), roll(), and
|
||||
fieldDifference().
|
||||
is the date June 4, 1999 plus 300 days? `Calendar` defines three basic methods (in
|
||||
several variants) that perform field arithmetic: `add()`, `roll()`, and
|
||||
`fieldDifference()`.
|
||||
|
||||
The add() method adds positive or negative values to a specified field. For
|
||||
example, calling add(Calendar::MONTH, 2) on a GregorianCalendar object set to
|
||||
March 15, 1999 sets the calendar to May 15, 1999. The roll() method is similar,
|
||||
The `add()` method adds positive or negative values to a specified field. For
|
||||
example, calling `add(Calendar::MONTH, 2)` on a `GregorianCalendar` object set to
|
||||
March 15, 1999 sets the calendar to May 15, 1999. The `roll()` method is similar,
|
||||
but does not modify fields that are larger. For example, calling
|
||||
roll(Calendar::HOUR, n) changes the hour that a calendar is set to without
|
||||
changing the day. Calling roll(Calendar::MONTH, n) changes the month without
|
||||
`roll(Calendar::HOUR, n)` changes the hour that a calendar is set to without
|
||||
changing the day. Calling `roll(Calendar::MONTH, n)` changes the month without
|
||||
changing the year.
|
||||
|
||||
The fieldDifference() method is the inverse of the add() method. It computes the
|
||||
difference between a calendar's currently set time and a specified UDate in
|
||||
terms of a specified field. Repeated calls to fieldDifference() compute the
|
||||
difference between two UDates in terms of whatever fields the caller specifies
|
||||
(for example, years, months, days, and hours). If the add() method is called
|
||||
with the results of fieldDifference(when, n) , then the calendar is moved toward
|
||||
The `fieldDifference()` method is the inverse of the `add()` method. It computes the
|
||||
difference between a calendar's currently set time and a specified `UDate` in
|
||||
terms of a specified field. Repeated calls to `fieldDifference()` compute the
|
||||
difference between two `UDate` objects in terms of whatever fields the caller specifies
|
||||
(for example, years, months, days, and hours). If the `add()` method is called
|
||||
with the results of `fieldDifference(when, n)`, then the calendar is moved toward
|
||||
field by field.
|
||||
|
||||
This is demonstrated in the following example:
|
||||
@ -176,9 +174,9 @@ int dayDiff = cal.fieldDifference(date, Calendar.DAY_OF_MONTH); // dayDiff ;<=19
|
||||
|
||||
#### Context Management
|
||||
|
||||
A calendar object performs its computations within a specific context. The
|
||||
A `Calendar` object performs its computations within a specific context. The
|
||||
context affects the results of conversions and arithmetic computations. When a
|
||||
calendar object is created, it establishes its context using either default
|
||||
`Calendar` object is created, it establishes its context using either default
|
||||
values or values specified by the caller:
|
||||
|
||||
1. Locale-specific week data, including the first day of the week and the
|
||||
@ -186,29 +184,28 @@ values or values specified by the caller:
|
||||
resource data for the specified locale, or if none is specified, for the
|
||||
default locale.
|
||||
|
||||
2. A TimeZone object. Initially, this is set to the specified zone object, or
|
||||
if none is specified, the default TimeZone.
|
||||
2. A `TimeZone` object. Initially, this is set to the specified zone object, or
|
||||
if none is specified, the default `TimeZone`.
|
||||
|
||||
The context of a calendar object can be queried after the calendar is created
|
||||
using calls such as getMinimalDaysInFirstWeek(), getFirstDayOfWeek(), and
|
||||
getTimeZone(). The context can be changed using calls such as
|
||||
setMinimalDaysInFirstWeek(), setFirstDayOfWeek(), and setTimeZone().
|
||||
The context of a `Calendar` object can be queried after the calendar is created
|
||||
using calls such as `getMinimalDaysInFirstWeek()`, `getFirstDayOfWeek()`, and
|
||||
`getTimeZone()`. The context can be changed using calls such as
|
||||
`setMinimalDaysInFirstWeek()`, `setFirstDayOfWeek()`, and `setTimeZone()`.
|
||||
|
||||
#### Factory Methods
|
||||
|
||||
Like other format classes, the best way to create a calendar object is by using
|
||||
one of the factory methods. These are static methods on the Calendar class that
|
||||
one of the factory methods. These are static methods on the `Calendar` class that
|
||||
create and return an instance of a concrete subclass. Factory methods should be
|
||||
used to enable the code to obtain the correct calendar for a locale without
|
||||
having to know specific details. The factory methods on Calendar are named
|
||||
createInstance().
|
||||
having to know specific details. The factory methods on `Calendar` are named
|
||||
`createInstance()`.
|
||||
|
||||
***MONTH field***
|
||||
> :point_right: **Note**: *Calendar numbers months starting from zero, so calling cal.set(1998, 3, 5)
|
||||
***`MONTH` field***
|
||||
> :point_right: **Note**: *Calendar numbers months starting from zero, so calling `cal.set(1998, 3, 5)`
|
||||
sets cal to April 15, 1998, not March 15, 1998. This follows the Java
|
||||
convention. To avoid mistakes, use the constants defined in the Calendar class
|
||||
for the months and days of the week. For example, cal.set(1998, Calendar::APRIL,
|
||||
15).*
|
||||
convention. To avoid mistakes, use the constants defined in the `Calendar` class
|
||||
for the months and days of the week. For example, `cal.set(1998, Calendar::APRIL, 15)`.*
|
||||
|
||||
#### Ambiguous Wall Clock Time Resolution
|
||||
|
||||
@ -227,34 +224,34 @@ this case, the ICU Calendar resolves the time using the UTC offset after the
|
||||
transition by default. For example, 1:30 AM on the date is resolved as 1:30 AM
|
||||
standard time.
|
||||
Ambiguous wall clock time resolution behaviors can be customized by Calendar
|
||||
APIs setRepeatedWallTimeOption() and setSkippedWallTimeOption(). These APIs are
|
||||
APIs `setRepeatedWallTimeOption()` and `setSkippedWallTimeOption()`. These APIs are
|
||||
available in ICU 49 or later versions.
|
||||
|
||||
### Gregorian Calendar
|
||||
### `GregorianCalendar`
|
||||
|
||||
The GregorianCalendar class implements two calendar systems, the Gregorian
|
||||
The `GregorianCalendar` class implements two calendar systems, the Gregorian
|
||||
calendar and the Julian calendar. These calendar systems are closely related,
|
||||
differing mainly in their definition of the leap year. The Julian calendar has
|
||||
leap years every four years; the Gregorian calendar refines this by excluding
|
||||
century years that are not divisible by 400. GregorianCalendar defines two eras,
|
||||
century years that are not divisible by 400. `GregorianCalendar` defines two eras,
|
||||
BC (B.C.E.) and AD (C.E.).
|
||||
|
||||
Historically, most western countries used the Julian calendar until the 16th to
|
||||
20th century, depending on the country. They then switched to the Gregorian
|
||||
calendar. The GregorianCalendar class mirrors this behavior by defining a
|
||||
calendar. The `GregorianCalendar` class mirrors this behavior by defining a
|
||||
cut-over date. Before this date, the Julian calendar algorithms are used. After
|
||||
it, the Gregorian calendar algorithms are used. By default, the cut-over date is
|
||||
set to October 4, 1582 C.E., which reflects the time when countries first began
|
||||
adopting the Gregorian calendar. The GregorianCalendar class does not attempt
|
||||
adopting the Gregorian calendar. The `GregorianCalendar` class does not attempt
|
||||
historical accuracy beyond this behavior, and does not vary its cut-over date by
|
||||
locale. However, users can modify the cut-over date by using the
|
||||
setGregorianChange() method.
|
||||
`setGregorianChange()` method.
|
||||
|
||||
Code that is written correctly instantiates calendar objects using the Calendar
|
||||
factory methods, and therefore holds a Calendar* pointer, Such code can not
|
||||
directly access the GregorianCalendar-specific methods not present in Calendar.
|
||||
factory methods, and therefore holds a `Calendar*` pointer. Such code cannot
|
||||
directly access the GregorianCalendar-specific methods not present in `Calendar`.
|
||||
The correct way to handle this is to perform a dynamic cast, after testing the
|
||||
type of the object using getDynamicClassID(). For example:
|
||||
type of the object using `getDynamicClassID()`. For example:
|
||||
|
||||
```c++
|
||||
void setCutover(Calendar *cal, UDate myCutover) {
|
||||
@ -270,8 +267,8 @@ with the factory methods.*
|
||||
|
||||
### Disambiguation
|
||||
|
||||
When computing a UDate from fields, some special circumstances can arise. There
|
||||
might be insufficient information to compute the UDate (such as only year and
|
||||
When computing a `UDate` from fields, some special circumstances can arise. There
|
||||
might be insufficient information to compute the `UDate` (such as only year and
|
||||
month but no day in the month), there might be inconsistent information (such as
|
||||
"Tuesday, July 15, 1996" -— July 15, 1996, is actually a Monday), or the input
|
||||
time might be ambiguous because of time zone transition.
|
||||
@ -279,21 +276,21 @@ time might be ambiguous because of time zone transition.
|
||||
1. **Insufficient Information**
|
||||
ICU Calendar uses the default field values to specify missing fields. The
|
||||
default for a field is the same as that of the start of the epoch (that is,
|
||||
YEAR = 1970, MONTH = JANUARY, DAY_OF_MONTH = 1).
|
||||
`YEAR = 1970`, `MONTH = JANUARY`, `DAY_OF_MONTH = 1`).
|
||||
|
||||
2. **Inconsistent Information**
|
||||
If fields conflict, the calendar gives preference to fields set more
|
||||
recently. For example, when determining the day, the calendar looks for one
|
||||
of the following combinations of fields:
|
||||
MONTH + DAY_OF_MONTH
|
||||
MONTH + WEEK_OF_MONTH + DAY_OF_WEEK
|
||||
MONTH + DAY_OF_WEEK_IN_MONTH + DAY_OF_WEEK
|
||||
DAY_OF_YEAR
|
||||
DAY_OF_WEEK + WEEK_OF_YEAR
|
||||
`MONTH + DAY_OF_MONTH`
|
||||
`MONTH + WEEK_OF_MONTH + DAY_OF_WEEK`
|
||||
`MONTH + DAY_OF_WEEK_IN_MONTH + DAY_OF_WEEK`
|
||||
`DAY_OF_YEAR`
|
||||
`DAY_OF_WEEK + WEEK_OF_YEAR`
|
||||
For the time of day, the calendar looks for one of the following
|
||||
combinations of fields:
|
||||
HOUR_OF_DAY
|
||||
AM_PM + HOUR
|
||||
`HOUR_OF_DAY`
|
||||
`AM_PM + HOUR`
|
||||
|
||||
3. **Ambiguous Wall Clock Time**
|
||||
When time offset from UTC has changed, it produces ambiguous time slot
|
||||
@ -312,14 +309,14 @@ time might be ambiguous because of time zone transition.
|
||||
1:30 AM standard time.
|
||||
|
||||
***Options for Ambiguous Time Resolution***
|
||||
> :point_right: **Note**: *Ambiguous wall clock time resolution behaviors can be customized by Calendar APIs setRepeatedTimeOption() and setSkippedTimeOption(). These methods are available in ICU 49 or later versions.*
|
||||
> :point_right: **Note**: *Ambiguous wall clock time resolution behaviors can be customized by Calendar APIs `setRepeatedTimeOption()` and `setSkippedTimeOption()`. These methods are available in ICU 49 or later versions.*
|
||||
|
||||
***WEEK_OF_YEAR field***
|
||||
> :point_right: **Note**: *Values calculated for the WEEK_OF_YEAR field range from 1 to 53. Week 1 for a year is the first week that contains at least getMinimalDaysInFirstWeek() days from that year. It depends on the values of getMinimalDaysInFirstWeek(), getFirstDayOfWeek(), and the day of the week of January 1. Weeks between week 1 of one year and week 1 of the following year are numbered sequentially from 2 to 52 or 53 (if needed).
|
||||
For example, January 1, 1998 was a Thursday. If getFirstDayOfWeek() is MONDAY
|
||||
and getMinimalDaysInFirstWeek() is 4 (these are the values reflecting ISO 8601
|
||||
***`WEEK_OF_YEAR` field***
|
||||
> :point_right: **Note**: *Values calculated for the `WEEK_OF_YEAR` field range from 1 to 53. Week 1 for a year is the first week that contains at least `getMinimalDaysInFirstWeek()` days from that year. It depends on the values of `getMinimalDaysInFirstWeek()`, `getFirstDayOfWeek()`, and the day of the week of January 1. Weeks between week 1 of one year and week 1 of the following year are numbered sequentially from 2 to 52 or 53 (if needed).
|
||||
For example, January 1, 1998 was a Thursday. If `getFirstDayOfWeek()` is `MONDAY`
|
||||
and `getMinimalDaysInFirstWeek()` is `4` (these are the values reflecting ISO 8601
|
||||
and many national standards), then week 1 of 1998 starts on December 29, 1997,
|
||||
and ends on January 4, 1998. However, if getFirstDayOfWeek() is SUNDAY, then
|
||||
and ends on January 4, 1998. However, if `getFirstDayOfWeek()` is `SUNDAY`, then
|
||||
week 1 of 1998 starts on January 4, 1998, and ends on January 10, 1998. The
|
||||
first three days of 1998 are then part of week 53 of 1997.*
|
||||
|
||||
|
@ -30,7 +30,7 @@ point at which the offset changes and the amount by which it changes. Thus, a
|
||||
time zone is described by the following information:
|
||||
|
||||
* An identifying string, or ID. This consists only of invariant characters
|
||||
(see the file utypes.h). It typically has the format continent / city. The
|
||||
(see the file `utypes.h`). It typically has the format continent / city. The
|
||||
city chosen is not the only city in which the zone applies, but rather a
|
||||
representative city for the region. Some IDs consist of three or four
|
||||
uppercase letters; these are legacy zone names that are aliases to standard
|
||||
@ -54,7 +54,7 @@ If DST is observed, then three additional pieces of information are needed:
|
||||
|
||||
### System and User Time Zones
|
||||
|
||||
ICU supports local time zones through the classes TimeZone and SimpleTimeZone in
|
||||
ICU supports local time zones through the classes `TimeZone` and `SimpleTimeZone` in
|
||||
the C++ API. In the C API, time zones are designated by their ID strings.
|
||||
|
||||
Users can construct their own time zone objects by specifying the above
|
||||
@ -65,12 +65,12 @@ GMT offset and in alphabetical order of ID.
|
||||
|
||||
Since this list changes one or more times a year, *this document only represents
|
||||
a snapshot*. For the most current list of ICU system zones, use the method
|
||||
TimeZone::getAvailableIDs().
|
||||
`TimeZone::getAvailableIDs()`.
|
||||
|
||||
*The zones are listed in binary sort order (that is, 'A' through 'Z' come before
|
||||
'a' through 'z'). This is the same order in which the zones are stored
|
||||
internally, and the same order in which they are returned by
|
||||
TimeZone::getAvailableIDs(). The reason for this is that ICU locates zones using
|
||||
`TimeZone::getAvailableIDs()`. The reason for this is that ICU locates zones using
|
||||
a binary search, and the binary search relies on this sort order.*
|
||||
*You might notice that zones such as Etc/GMT+1 appear to have the wrong sign for
|
||||
their GMT offset. In fact, their sign is inverted since the the Etc zones follow
|
||||
@ -81,72 +81,76 @@ details.*
|
||||
### References
|
||||
|
||||
The ICU system time zones are derived from the tz database (also known as the
|
||||
“Olson” database) at [ftp://elsie.nci.nih.gov/pub](ftp://elsie.nci.nih.gov/pub)
|
||||
. This is the data used across much of the industry, including by UNIX systems,
|
||||
“Olson” database) at [ftp://elsie.nci.nih.gov/pub](ftp://elsie.nci.nih.gov/pub).
|
||||
This is the data used across much of the industry, including by UNIX systems,
|
||||
and is usually updated several times each year. ICU (since version 2.8) and base
|
||||
Java (since Java 1.4) contain code and tz data supporting both current and
|
||||
historic time zone usage.
|
||||
|
||||
## How ICU Represents Dates/Times
|
||||
|
||||
ICU represents dates and times using UDates. A UDate is a scalar value that
|
||||
ICU represents dates and times using `UDate`s. A `UDate` is a scalar value that
|
||||
indicates a specific point in time, independent of calendar system and local
|
||||
time zone. It is stored as the number of milliseconds from a reference point
|
||||
known as the epoch. The epoch is midnight Universal Time Coordinated (UTC)
|
||||
January 1, 1970 A.D. Negative UDate values indicate times before the epoch.
|
||||
January 1, 1970 A.D. Negative `UDate` values indicate times before the epoch.
|
||||
|
||||
*These classes have the same architecture as the Java classes.*
|
||||
|
||||
Most people only need to use the DateFormat classes for parsing and formatting
|
||||
Most people only need to use the `DateFormat` classes for parsing and formatting
|
||||
dates and times. However, for those who need to convert dates and times or
|
||||
perform numeric calculations, the services described in this section can be very
|
||||
useful.
|
||||
|
||||
To translate a UDate to a useful form, a calendar system and local time zone
|
||||
must be specified. These are specified in the form of objects of the Calendar
|
||||
and TimeZone classes. Once these two objects are specified, they can be used to
|
||||
convert the UDate to and from its corresponding calendar fields. The different
|
||||
fields are defined in the Calendar class and include the year, month, day, hour,
|
||||
To translate a `UDate` to a useful form, a calendar system and local time zone
|
||||
must be specified. These are specified in the form of objects of the `Calendar`
|
||||
and `TimeZone` classes. Once these two objects are specified, they can be used to
|
||||
convert the `UDate` to and from its corresponding calendar fields. The different
|
||||
fields are defined in the `Calendar` class and include the year, month, day, hour,
|
||||
minute, second, and so on.
|
||||
|
||||
Specific Calendar objects correspond to calendar systems (such as Gregorian) and
|
||||
Specific `Calendar` objects correspond to calendar systems (such as Gregorian) and
|
||||
conventions (such as the first day of the week) in use in different parts of the
|
||||
world. To obtain a Calendar object for France, for example, call
|
||||
Calendar::createInstance(Locale::getFrance(), status).
|
||||
world. To obtain a `Calendar` object for France, for example, call
|
||||
`Calendar::createInstance(Locale::getFrance(), status)`.
|
||||
|
||||
The TimeZone class defines the conversion between universal coordinated time
|
||||
(UTC),, and local time, according to real-world rules. Different TimeZone
|
||||
The `TimeZone` class defines the conversion between universal coordinated time
|
||||
(UTC), and local time, according to real-world rules. Different `TimeZone`
|
||||
objects correspond to different real-world time zones. For example, call
|
||||
TimeZone::createTimeZone("America/Los_Angeles") to obtain an object that
|
||||
`TimeZone::createTimeZone("America/Los_Angeles")` to obtain an object that
|
||||
implements the U.S. Pacific time zone, both Pacific Standard Time (PST) and
|
||||
Pacific Daylight Time (PDT).
|
||||
|
||||
As previously mentioned, the Calendar and TimeZone objects must be specified
|
||||
As previously mentioned, the `Calendar` and `TimeZone` objects must be specified
|
||||
correctly together. One way of doing so is to create each independently, then
|
||||
use the Calendar::setTimeZone() method to associate the time zone with the
|
||||
calendar. Another is to use the Calendar::createInstance() method that takes a
|
||||
TimeZone object. For example, call Calendar::createInstance(
|
||||
TimeZone::createInstance( "America/Los_Angeles"), Locale:getUS(), status) to
|
||||
obtain a Calendar appropriate for use in the U.S. Pacific time zone.
|
||||
use the `Calendar::setTimeZone()` method to associate the time zone with the
|
||||
calendar. Another is to use the `Calendar::createInstance()` method that takes a
|
||||
`TimeZone` object. For example, call `Calendar::createInstance(
|
||||
TimeZone::createInstance( "America/Los_Angeles"), Locale:getUS(), status)` to
|
||||
obtain a `Calendar` appropriate for use in the U.S. Pacific time zone.
|
||||
|
||||
ICU has four classes pertaining to calendars and timezones:
|
||||
|
||||
* [Calendar](calendar/index.md)
|
||||
Calendar is an abstract base class that represents a calendar system.
|
||||
Calendar objects map UDate values to and from the individual fields used in
|
||||
a particular calendar system. Calendar also performs field computations such
|
||||
* [`Calendar`](calendar/index.md)
|
||||
|
||||
`Calendar` is an abstract base class that represents a calendar system.
|
||||
`Calendar` objects map `UDate` values to and from the individual fields used in
|
||||
a particular calendar system. `Calendar` also performs field computations such
|
||||
as advancing a date by two months.
|
||||
|
||||
* [Gregorian Calendar](calendar/index.md) (§)
|
||||
GregorianCalendar is a concrete subclass of Calendar that implements the
|
||||
* [`Gregorian Calendar`](calendar/index.md)
|
||||
|
||||
`GregorianCalendar` is a concrete subclass of `Calendar` that implements the
|
||||
rules of the Julian calendar and the Gregorian calendar, which is the common
|
||||
calendar in use internationally today.
|
||||
|
||||
* [TimeZone](timezone/index.md)
|
||||
TimeZone is an abstract base class that represents a time zone. TimeZone
|
||||
* [`TimeZone`](timezone/index.md)
|
||||
|
||||
`TimeZone` is an abstract base class that represents a time zone. `TimeZone`
|
||||
objects map between universal coordinated time (UTC) and local time.
|
||||
|
||||
* [SimpleTimeZone](timezone/index.md) (§)
|
||||
SimpleTimeZone is a concrete subclass of TimeZone that implements standard
|
||||
* [`SimpleTimeZone`](timezone/index.md)
|
||||
|
||||
`SimpleTimeZone` is a concrete subclass of `TimeZone` that implements standard
|
||||
time and daylight savings time according to real-world rules. Individual
|
||||
SimpleTimeZone objects correspond to real-world time zones.
|
||||
`SimpleTimeZone` objects correspond to real-world time zones.
|
||||
|
@ -36,23 +36,23 @@ Binary datetimes differ in a number of ways: the data type, the unit, and the
|
||||
epoch (origin). We'll refer to these as time scales. For example: (Sorted by
|
||||
epoch and unit, descending. In Java, `int64_t`=`long` and `int32_t`=`int`.)
|
||||
|
||||
| Source | Data Type | Epoch | Unit |
|
||||
| ------------------------------------------ | -------------------------------------------------------------------------------- | ----------- | ------------------------------------------------------- |
|
||||
| MacOS X (CFDate/NSDate) | double (1.0=1s but fractional seconds are used as well; imprecise for 0.1s etc.) | 2001-Jan-01 | seconds (and fractions thereof) |
|
||||
| Unix time_t | int32_t or int64_t (signed int32_t limited to 1970..2038) | 1970-Jan-01 | seconds |
|
||||
| Java Date | int64_t | 1970-Jan-01 | milliseconds |
|
||||
| Joda DateTime | int64_t | 1970-Jan-01 | milliseconds |
|
||||
| ICU4C UDate | double (does not use fractional milliseconds) | 1970-Jan-01 | milliseconds |
|
||||
| JavaScript Date | double (does not use fractional milliseconds; JavaScript Number stores a double) | 1970-Jan-01 | milliseconds |
|
||||
| Unix struct timeval (as in gettimeofday) | struct: time_t (seconds); suseconds_t (microseconds) | 1970-Jan-01 | microseconds |
|
||||
| Gnome g_get_real_time() | gint64 | 1970-Jan-01 | microseconds |
|
||||
| Unix struct timespec (as in clock_gettime) | struct: time_t (seconds); long (nanoseconds) | 1970-Jan-01 | nanoseconds |
|
||||
| MacOS (old) | uint32_t (1904..2040) | 1904-Jan-01 | seconds |
|
||||
| Excel | ? | 1899-Dec-31 | days |
|
||||
| DB2 | ? | 1899-Dec-31 | days |
|
||||
| Windows FILETIME | int64_t | 1601-Jan-01 | ticks (100 nanoseconds; finest granularity in industry) |
|
||||
| .NET DateTime | uint62 (only 0001-9999; only 62 bits; also 2-bit field for UTC/local) | 0001-Jan-01 | ticks (100 nanoseconds; finest granularity in industry) |
|
||||
| ICU Universal Time Scale | int64_t | 0001-Jan-01 | same as .Net but allows 29000BC..29000AD |
|
||||
| Source | Data Type | Epoch | Unit |
|
||||
| ---------------------------------------------- | ---------------------------------------------------------------------------------- | ----------- | ------------------------------------------------------- |
|
||||
| MacOS X (`CFDate/NSDate`) | `double` (1.0=1s but fractional seconds are used as well; imprecise for 0.1s etc.) | 2001-Jan-01 | seconds (and fractions thereof) |
|
||||
| Unix `time_t` | `int32_t` or `int64_t` (`signed int32_t` limited to 1970..2038) | 1970-Jan-01 | seconds |
|
||||
| Java `Date` | `int64_t` | 1970-Jan-01 | milliseconds |
|
||||
| Joda `DateTime` | `int64_t` | 1970-Jan-01 | milliseconds |
|
||||
| ICU4C `UDate` | `double` (does not use fractional milliseconds) | 1970-Jan-01 | milliseconds |
|
||||
| JavaScript `Date` | `double` (does not use fractional milliseconds; JavaScript Number stores a double) | 1970-Jan-01 | milliseconds |
|
||||
| Unix `struct timeval (as in gettimeofday)` | `struct: time_t` (seconds); suseconds_t (microseconds) | 1970-Jan-01 | microseconds |
|
||||
| Gnome `g_get_real_time()` | `gint64` | 1970-Jan-01 | microseconds |
|
||||
| Unix `struct timespec` (as in `clock_gettime`) | `struct: time_t` (seconds); long (nanoseconds) | 1970-Jan-01 | nanoseconds |
|
||||
| MacOS (old) | `uint32_t` (1904..2040) | 1904-Jan-01 | seconds |
|
||||
| Excel | ? | 1899-Dec-31 | days |
|
||||
| DB2 | ? | 1899-Dec-31 | days |
|
||||
| Windows `FILETIME` | `int64_t` | 1601-Jan-01 | ticks (100 nanoseconds; finest granularity in industry) |
|
||||
| .NET `DateTime` | `uint62` (only 0001-9999; only 62 bits; also 2-bit field for UTC/local) | 0001-Jan-01 | ticks (100 nanoseconds; finest granularity in industry) |
|
||||
| ICU Universal Time Scale | `int64_t` | 0001-Jan-01 | same as .Net but allows 29000BC..29000AD |
|
||||
|
||||
All of the epochs start at 00:00 am (the earliest possible time on the day in
|
||||
question), and are usually assumed to be UTC.
|
||||
@ -81,7 +81,7 @@ Gregorian calendar. That is, the Gregorian calendar's leap year rules are used
|
||||
for all times, even before 1582 when it was introduced. (This is different from
|
||||
the default ICU calendar which switches from the Julian to the Gregorian
|
||||
calendar in 1582. See `GregorianCalendar::setGregorianChange()` and
|
||||
`ucal_setGregorianChange()`.)
|
||||
`ucal_setGregorianChange()`).
|
||||
|
||||
ICU provides conversion functions to and from all other major time scales,
|
||||
allowing datetimes in any time scale to be converted to the universal time
|
||||
@ -91,12 +91,12 @@ scale, safely manipulated, and converted back to any other datetime time scale.
|
||||
|
||||
So how did we decide what to use for the universal time scale? Java time has
|
||||
plenty of range, but cannot represent a .NET `System.DateTime` value without
|
||||
severe loss of precision. ICU4C time addresses this by using a double that is
|
||||
severe loss of precision. ICU4C time addresses this by using a `double` that is
|
||||
otherwise equivalent to the Java time. However, there are disadvantages with
|
||||
doubles. They provide for much more graceful degradation in arithmetic
|
||||
operations. But they only have 53 bits of accuracy, which means that they will
|
||||
lose precision when converting back and forth to ticks. What would really be
|
||||
nice would be a long double (80 bits -- 64 bit mantissa), but that is not
|
||||
nice would be a `long double` (80 bits -- 64 bit mantissa), but that is not
|
||||
supported on most systems.
|
||||
|
||||
The Unix extended time uses a structure with two components: time in seconds and
|
||||
@ -109,7 +109,7 @@ have a fixed size.
|
||||
Because of these issues, we concluded that the .NET `System.DateTime` is the best
|
||||
timescale to use. However, we use the full range allowed by the data type,
|
||||
allowing for datetimes back to 29,000 BC and up to 29,000 AD. (`System.DateTime`
|
||||
uses only 62 bits and only supports dates from 0001 AD to 9999 AD.) This time
|
||||
uses only 62 bits and only supports dates from 0001 AD to 9999 AD). This time
|
||||
scale is very fine grained, does not lose precision, and covers a range that
|
||||
will meet almost all requirements. It will not handle the range that Java times
|
||||
do, but frankly, being able to handle dates before 29,000 BC or after 29,000 AD
|
||||
@ -123,31 +123,31 @@ information about a particular timescale. In all of these routines, the
|
||||
timescales are referenced using an integer constant, according to the following
|
||||
table:
|
||||
|
||||
| Source | ICU4C | ICU4J |
|
||||
| ---------------------- | --------------------------- | ---------------------- |
|
||||
| Java | UDTS_JAVA_TIME | JAVA_TIME |
|
||||
| Unix | UDTS_UNIX_TIME | UNIX_TIME |
|
||||
| ICU4C | UDTS_ICU4C_TIME | ICU4C_TIME |
|
||||
| Windows FILETIME | UDTS_WINDOWS_FILE_TIME | WINDOWS_FILE_TIME |
|
||||
| .NET DateTime | UDTS_DOTNET_DATE_TIME | DOTNET_DATE_TIME |
|
||||
| Macintosh (old) | UDTS_MAC_OLD_TIME | MAC_OLD_TIME |
|
||||
| Macintosh | UDTS_MAC_TIME | MAC_TIME |
|
||||
| Excel | UDTS_EXCEL_TIME | EXCEL_TIME |
|
||||
| DB2 | UDTS_DB2_TIME | DB2_TIME |
|
||||
| Unix with microseconds | UDTS_UNIX_MICROSECONDS_TIME | UNIX_MICROSECONDS_TIME |
|
||||
| Source | ICU4C | ICU4J |
|
||||
| ---------------------- | ----------------------------- | ------------------------ |
|
||||
| Java | `UDTS_JAVA_TIME` | `JAVA_TIME` |
|
||||
| Unix | `UDTS_UNIX_TIME` | `UNIX_TIME` |
|
||||
| ICU4C | `UDTS_ICU4C_TIME` | `ICU4C_TIME` |
|
||||
| Windows FILETIME | `UDTS_WINDOWS_FILE_TIME` | `WINDOWS_FILE_TIME` |
|
||||
| .NET DateTime | `UDTS_DOTNET_DATE_TIME` | `DOTNET_DATE_TIME` |
|
||||
| Macintosh (old) | `UDTS_MAC_OLD_TIME` | `MAC_OLD_TIME` |
|
||||
| Macintosh | `UDTS_MAC_TIME` | `MAC_TIME` |
|
||||
| Excel | `UDTS_EXCEL_TIME` | `EXCEL_TIME` |
|
||||
| DB2 | `UDTS_DB2_TIME` | `DB2_TIME` |
|
||||
| Unix with microseconds | `UDTS_UNIX_MICROSECONDS_TIME` | `UNIX_MICROSECONDS_TIME` |
|
||||
|
||||
The routine that gets a particular piece of information about a timescale takes
|
||||
an integer constant that identifies the particular piece of information,
|
||||
according to the following table:
|
||||
|
||||
| Value | ICU4C | ICU4J |
|
||||
| -------------------- | ----------------------- | ------------------ |
|
||||
| Precision | UTSV_UNITS_VALUE | UNITS_VALUE |
|
||||
| Epoch offet | UTSV_EPOCH_OFFSET_VALUE | EPOCH_OFFSET_VALUE |
|
||||
| Minimum "from" value | UTSV_FROM_MIN_VALUE | FROM_MIN_VALUE |
|
||||
| Maximum "from" value | UTSV_FROM_MAX_VALUE | FROM_MAX_VALUE |
|
||||
| Minimum "to" value | UTSV_TO_MIN_VALUE | TO_MIN_VALUE |
|
||||
| Maximum "to" value | UTSV_TO_MAX_VALUE | TO_MAX_VALUE |
|
||||
| Value | ICU4C | ICU4J |
|
||||
| -------------------- | -------------------------- | -------------------- |
|
||||
| Precision | `UTSV_UNITS_VALUE` | `UNITS_VALUE` |
|
||||
| Epoch offset | `UTSV_EPOCH_OFFSET_VALUE` | `EPOCH_OFFSET_VALUE` |
|
||||
| Minimum "from" value | `UTSV_FROM_MIN_VALUE` | `FROM_MIN_VALUE` |
|
||||
| Maximum "from" value | `UTSV_FROM_MAX_VALUE` | `FROM_MAX_VALUE` |
|
||||
| Minimum "to" value | `UTSV_TO_MIN_VALUE` | `TO_MIN_VALUE` |
|
||||
| Maximum "to" value | `UTSV_TO_MAX_VALUE` | `TO_MAX_VALUE` |
|
||||
|
||||
Here is what the values mean:
|
||||
|
||||
|
@ -30,13 +30,13 @@ the introduction of namespace in C++.
|
||||
The JDK internationalization components and ICU components both share the same
|
||||
common basic architectures with regard to the following:
|
||||
|
||||
1. locales
|
||||
2. data-driven services
|
||||
3. ICU threading models and the open and close model
|
||||
4. cloning customization
|
||||
5. error handling
|
||||
6. extensibility
|
||||
7. resource bundle inheritance model
|
||||
1. [Locales](#locales)
|
||||
2. [Data-driven services](#data-driven-services)
|
||||
3. [ICU threading models and the open and close model](#icu-threading-model-and-open-and-close-model)
|
||||
4. [Cloning customization](#cloning-customization)
|
||||
5. [Error handling](#error-handling)
|
||||
6. [Extensibility](#extensibility)
|
||||
7. [Resource bundle inheritance model](#resource-bundle-inheritance-model)
|
||||
|
||||
There are design features in ICU4C that are not in the Java Development Kit
|
||||
(JDK) due
|
||||
@ -46,10 +46,10 @@ to programming language restrictions. These features include the following:
|
||||
|
||||
Locale IDs are composed of language, country, and variant information. The
|
||||
following links provide additional useful information regarding ISO standards:
|
||||
[ISO-639](http://lcweb.loc.gov/standards/iso639-2/englangn.html) , and an ISO
|
||||
[ISO-639](http://lcweb.loc.gov/standards/iso639-2/englangn.html), and an ISO
|
||||
Country Code,
|
||||
[ISO-3166](http://www.iso.org/iso/en/prods-services/iso3166ma/02iso-3166-code-lists/list-en1.html)
|
||||
. For example, Italian, Italy, and Euro are designated as: it_IT_EURO.
|
||||
[ISO-3166](http://www.iso.org/iso/en/prods-services/iso3166ma/02iso-3166-code-lists/list-en1.html).
|
||||
For example, Italian, Italy, and Euro are designated as: it_IT_EURO.
|
||||
|
||||
### Data-driven Services
|
||||
|
||||
@ -150,7 +150,7 @@ Thus, the normal mode of operation is to:
|
||||
> :point_right: **Note**: These service instances may be closed in any sequence.
|
||||
The preceding steps are given as an example.
|
||||
|
||||
#### Cloning Customization
|
||||
### Cloning Customization
|
||||
|
||||
Typically, the services supplied with ICU cover the vast majority of usages.
|
||||
However, there are circumstances where the service needs to be customized for a
|
||||
@ -206,14 +206,14 @@ service objects, organized according to locale. Then, if a particular locale's
|
||||
formatter is in high demand, that formatter can be used, and then returned to
|
||||
the pool.
|
||||
|
||||
### ICU Memory Usage
|
||||
#### ICU Memory Usage
|
||||
|
||||
ICU4C APIs are designed to allow separate heaps for its libraries vs. the
|
||||
application. This is achieved by providing functions to allocate and release
|
||||
objects owned by ICU4C using only ICU4C library functions. For more details see
|
||||
the Memory Usage section in the [Coding Guidelines](dev/codingguidelines.md).
|
||||
|
||||
### ICU Initialization and Termination
|
||||
#### ICU Initialization and Termination
|
||||
|
||||
The ICU library does not normally require any explicit initialization prior to
|
||||
use. An application begins use simply by calling any ICU API in the usual way.
|
||||
@ -389,21 +389,21 @@ methods available:
|
||||
|
||||
**Lookup chain** : Searching for a resource bundle.
|
||||
|
||||
1. `en_US_<some-variant>`
|
||||
2. `en_US`
|
||||
3. `en`
|
||||
4. `<defaultLang>_<defaultCountry>`
|
||||
5. `<defaultLang>`
|
||||
6. `root`
|
||||
en_US_<some-variant>
|
||||
en_US
|
||||
en
|
||||
<defaultLang>_<defaultCountry>
|
||||
<defaultLang>
|
||||
root
|
||||
|
||||
**Lookup chain** : Searching for a \<key, value> pair after
|
||||
`en_US_<some-variant>` has ben loaded. ICU does not use the default locale in
|
||||
this case.
|
||||
|
||||
1. `en_US_<some-variant>`
|
||||
2. `en_US`
|
||||
3. `en`
|
||||
4. `root`
|
||||
en_US_<some-variant>
|
||||
en_US
|
||||
en
|
||||
root
|
||||
|
||||
## Other ICU Design Principles
|
||||
|
||||
@ -413,7 +413,7 @@ usage.
|
||||
### Version Numbers in ICU
|
||||
|
||||
Version changes show clients when parts of ICU change. ICU; its components (such
|
||||
as Collator); each resource bundle, including all the locale data resource
|
||||
as `Collator`); each resource bundle, including all the locale data resource
|
||||
bundles; and individual tagged items within a resource bundle, have their own
|
||||
version numbers. Version numbers numerically and lexically increase as changes
|
||||
are made.
|
||||
@ -443,7 +443,7 @@ The interpretation of version numbers depends on what is being described.
|
||||
The first version number field contains the ICU release version number, for
|
||||
example 49. Each new version might contain new features, new locale data, and
|
||||
modified behavior. (See below for more information on
|
||||
[ICU Binary Compatibility](###icu-binary-compatibility).)
|
||||
[ICU Binary Compatibility](#icu-binary-compatibility)).
|
||||
|
||||
The second field is 1 for the initial release (e.g., 49.1). The second and
|
||||
sometimes third fields are incremented for binary compatible maintenance
|
||||
@ -475,7 +475,7 @@ published semi-formal “enhancement” releases with odd second-field numbers
|
||||
Library filenames and some other internal uses already used a concatenation of
|
||||
the first two fields ("48" for 4.8).
|
||||
|
||||
Resource Bundles and Elements
|
||||
#### Resource Bundles and Elements
|
||||
|
||||
The data stored in resource bundles is tagged with version numbers. A resource
|
||||
bundle can contain a tagged string named "Version" that declares the version
|
||||
@ -483,8 +483,8 @@ number in dotted-integer format. For example,
|
||||
|
||||
```text
|
||||
en {
|
||||
Version { "1.0.3.5" }
|
||||
...
|
||||
Version { "1.0.3.5" }
|
||||
...
|
||||
}
|
||||
```
|
||||
|
||||
@ -652,7 +652,7 @@ list members a chance to review upcoming changes, and to discuss them. A
|
||||
proposal often changes significantly as a result of discussion. Most proposals
|
||||
will eventually find consensus among list members; otherwise, the ICU-TC decides
|
||||
what to do. If the addition or change of APIs would affect you, please subscribe
|
||||
to the main [icu-design mailing list](http://icu-project.org/contacts.html) .
|
||||
to the main [icu-design mailing list](http://icu-project.org/contacts.html).
|
||||
|
||||
When a **new API** is added to ICU, it **is marked as draft with a `@draft ICU
|
||||
x.y` label in the API documentation, **where x.y is the ICU version when the
|
||||
|
@ -113,7 +113,7 @@ If the API function is non-const, then it should have a `UErrorCode` parameter.
|
||||
Default C++ assignment operators and copy constructors should not be used (they
|
||||
should be declared private and not implemented). Instead, define an `assign(Class
|
||||
&other, UErrorCode &errorCode)` function. Normal constructors are fine, and
|
||||
should have a UErrorCode parameter.
|
||||
should have a `UErrorCode` parameter.
|
||||
|
||||
### Warning Codes
|
||||
|
||||
@ -241,7 +241,7 @@ Example in `ucnv.h`:
|
||||
ucnv_open( ... ) ...
|
||||
```
|
||||
|
||||
This cites code in icu4c/source/samples/ucnv/convsamp.cpp as follows:
|
||||
This cites code in `icu4c/source/samples/ucnv/convsamp.cpp` as follows:
|
||||
|
||||
```c++
|
||||
//! [ucnv_open]
|
||||
@ -261,12 +261,12 @@ ICU C++ classes and methods as well as the ICU C methods.
|
||||
|
||||
In C/C++, we enclose `@draft` and such APIs with `#ifndef U_HIDE_DRAFT_API` or
|
||||
similar as appropriate. When a draft API becomes stable, we need to remove the
|
||||
surrounding #ifndef.
|
||||
surrounding `#ifndef`.
|
||||
|
||||
Note: The `@system` tag is *in addition to* the
|
||||
`@draft`/`@stable`/`@deprecated`/`@obsolete` status tag.
|
||||
|
||||
Copy/paste the appropriate #ifndef..#endif pair from the following:
|
||||
Copy/paste the appropriate `#ifndef..#endif` pair from the following:
|
||||
|
||||
```c++
|
||||
#ifndef U_HIDE_DRAFT_API
|
||||
@ -289,7 +289,7 @@ We `#ifndef` `@draft`/`@deprecated`/... APIs as much as possible, including C
|
||||
functions, many C++ class methods (see exceptions below), enum constants (see
|
||||
exceptions below), whole enums, whole classes, etc.
|
||||
|
||||
We do not #ifndef APIs where that would be problematic:
|
||||
We do not `#ifndef` APIs where that would be problematic:
|
||||
|
||||
* struct/class members where that would modify the object layout (non-static
|
||||
struct/class fields, virtual methods)
|
||||
@ -478,19 +478,19 @@ Use Java doc-style in-file documentation created with
|
||||
|
||||
#### Multiple Statements
|
||||
|
||||
Place multiple statements in multiple lines. if() or loop heads must not be
|
||||
Place multiple statements in multiple lines. `if()` or loop heads must not be
|
||||
followed by their bodies on the same line.
|
||||
|
||||
#### Placements of {} Curly Braces
|
||||
#### Placements of `{}` Curly Braces
|
||||
|
||||
Place curly braces {} in reasonable and consistent locations. Each of us
|
||||
Place curly braces `{}` in reasonable and consistent locations. Each of us
|
||||
subscribes to different philosophies. It is recommended to use the style of a
|
||||
file, instead of mixing different styles. It is requested, however, to not have
|
||||
if() and loop bodies without curly braces.
|
||||
`if()` and loop bodies without curly braces.
|
||||
|
||||
#### if() {...} and Loop Bodies
|
||||
#### `if() {...}` and Loop Bodies
|
||||
|
||||
Use curly braces for if() and else as well as loop bodies, etc., even if there
|
||||
Use curly braces for `if()` and else as well as loop bodies, etc., even if there
|
||||
is only one statement.
|
||||
|
||||
#### Function Declarations
|
||||
@ -563,7 +563,7 @@ U_CDECL_END
|
||||
|
||||
Determine if two headers are needed. If the same functionality is provided with
|
||||
both a C and a C++ API, then there can be two headers, one for each language,
|
||||
even if one uses the other. For example, there can be umsg.h for C and `msgfmt.h`
|
||||
even if one uses the other. For example, there can be `umsg.h` for C and `msgfmt.h`
|
||||
for C++.
|
||||
|
||||
Not all functionality has or needs both kinds of API. More and more
|
||||
@ -638,19 +638,19 @@ happened!)
|
||||
When casting an enum value to an integer type, make sure that the enum value's
|
||||
numeric value is within range of the integer type.
|
||||
|
||||
#### Do not check for this!=NULL, do not check for NULL references
|
||||
#### Do not check for `this!=NULL`, do not check for `NULL` references
|
||||
|
||||
In public APIs, assume this!=0 and assume that references are not 0. In C code,
|
||||
"this" is the "service object" pointer, such as `set` in
|
||||
`uset_add(USet* set, UChar32 c)` — don't check for set!=NULL.
|
||||
In public APIs, assume `this!=0` and assume that references are not 0. In C code,
|
||||
`"this"` is the "service object" pointer, such as `set` in
|
||||
`uset_add(USet* set, UChar32 c)` — don't check for `set!=NULL`.
|
||||
|
||||
We do usually check all other (non-this) pointers for NULL, in those cases when
|
||||
NULL is not valid. (Many functions allow a NULL string or buffer pointer if the
|
||||
We do usually check all other (non-this) pointers for `NULL`, in those cases when
|
||||
`NULL` is not valid. (Many functions allow a `NULL` string or buffer pointer if the
|
||||
length or capacity is 0.)
|
||||
|
||||
Rationale: "this" is not really an argument, and checking it costs a little bit
|
||||
Rationale: `"this"` is not really an argument, and checking it costs a little bit
|
||||
of code size and runtime. Other libraries also commonly do not check for valid
|
||||
"this", and resulting failures are fairly obvious.
|
||||
`"this"`, and resulting failures are fairly obvious.
|
||||
|
||||
### Memory Usage
|
||||
|
||||
@ -666,7 +666,7 @@ details.
|
||||
|
||||
Exception: Most C++ API functions that return a `StringEnumeration` (by pointer
|
||||
which the caller must delete) are named `getXyz()` rather than `createXyz()`
|
||||
because "get" is much more natural. (These are not factory methods in the sense
|
||||
because `"get"` is much more natural. (These are not factory methods in the sense
|
||||
of `NumberFormat::createScientificInstance()`.) For example,
|
||||
`static StringEnumeration *Collator::``get``Keywords(UErrorCode &)`. We should document
|
||||
clearly in the API comments that the caller must delete the returned
|
||||
@ -870,7 +870,7 @@ U_NAMESPACE_USE
|
||||
Locale loc("fi");
|
||||
```
|
||||
|
||||
U_NAMESPACE_USE (expands to using namespace icu_M_N; when available) is
|
||||
`U_NAMESPACE_USE` (expands to using namespace icu_M_N; when available) is
|
||||
automatically done when `utypes.h` is included, so that all ICU classes are
|
||||
immediately usable. However, we recommend that you turn this off via
|
||||
`CXXFLAGS="-DU_USING_ICU_NAMESPACE=0"`.
|
||||
@ -954,7 +954,7 @@ implementation.
|
||||
subclass as well (copy implementations from existing C++ APIs).
|
||||
* If a class is a new, immediate subclass of `UObject` (e.g.,
|
||||
`Normalizer2`), creating a whole new class hierarchy, then declare a
|
||||
*private* `getDynamicClassID()` and define it to return NULL (to
|
||||
*private* `getDynamicClassID()` and define it to return `NULL` (to
|
||||
override the pure virtual version in `UObject`); copy the relevant lines
|
||||
from `normalizer2.h` and `normalizer2.cpp`
|
||||
(`UOBJECT_DEFINE_NO_RTTI_IMPLEMENTATION(className)`). Do not add any
|
||||
@ -1012,7 +1012,7 @@ adopt-on-success):
|
||||
must be set to `U_MEMORY_ALLOCATION_ERROR`.)
|
||||
|
||||
**Pitfall**: If you allocate/construct via "`ClassName *p = new ClassName(adoptee);`"
|
||||
and the memory allocation failed (p==NULL), then the
|
||||
and the memory allocation failed (`p==NULL`), then the
|
||||
constructor has not been called, the adoptee has not been adopted, and you
|
||||
are still responsible for deleting it!
|
||||
|
||||
@ -1183,7 +1183,7 @@ constructor relies on loading data), then either it must use and set a
|
||||
like `UnicodeString` and `UnicodeSet`, and the constructor needs to set the object
|
||||
to bogus if it fails.
|
||||
|
||||
#### UVector, UVector32, or UVector64
|
||||
#### `UVector`, `UVector32`, or `UVector64`
|
||||
|
||||
Use `UVector` to store arrays of `void *`; use `UVector32` to store arrays of
|
||||
`int32_t`; use `UVector64` to store arrays of `int64_t`. Historically, `UVector`
|
||||
@ -1438,8 +1438,8 @@ ICU Java classes and methods.
|
||||
|
||||
The standard order for modifier keywords on APIs is:
|
||||
|
||||
* public static final synchronized strictfp
|
||||
* public abstract
|
||||
* `public static final synchronized strictfp`
|
||||
* `public abstract`
|
||||
|
||||
Do not use wild card import, such as "`import java.util.*`". The sort order of
|
||||
import statements is `java` / `javax` / `org` / `com`. Within each top level package
|
||||
@ -1493,7 +1493,7 @@ Avoid creating new interfaces unless you know you need to mix the interface into
|
||||
two or more classes that have separate inheritance. Interfaces are impossible to
|
||||
modify later in a backwards-compatible way. Abstract classes, on the other hand,
|
||||
can add new methods with default behavior. Use interfaces only if it is required
|
||||
by the arcitecture, not just for expediency.
|
||||
by the architecture, not just for expediency.
|
||||
|
||||
Current releases of ICU4J (since ICU 63) are restricted to use Java SE 7 APIs
|
||||
and language features.
|
||||
@ -1842,7 +1842,7 @@ Type just `intltest -h` to see the usage:
|
||||
|
||||
The "Fake Time" capability allows ICU4C to be tested as if the hardware clock is
|
||||
set to a specific time. This section documents how to use this facility.
|
||||
Note that this facility requires the POSIX 'gettimeofday' function to be
|
||||
Note that this facility requires the POSIX `'gettimeofday'` function to be
|
||||
operable.
|
||||
|
||||
This facility affects all ICU 'current time' calculations, including date,
|
||||
@ -2013,10 +2013,10 @@ format).
|
||||
* Example: **`CINTLTST_OPTS=/tscoll`** in the cintltst directory provides
|
||||
arguments to the cintltest test upon make check, to only run collation
|
||||
tests.
|
||||
* intltest: INTLTEST_OPTS
|
||||
* cintltst: CINTLTST_OPTS
|
||||
* iotest: IOTEST_OPTS
|
||||
* icuinfo: ICUINFO_OPTS
|
||||
* intltest: `INTLTEST_OPTS`
|
||||
* cintltst: `CINTLTST_OPTS`
|
||||
* iotest: `IOTEST_OPTS`
|
||||
* icuinfo: `ICUINFO_OPTS`
|
||||
* (letest does not have an OPTS variable as of ICU 4.6.)
|
||||
|
||||
### Windows/Microsoft Visual Studio
|
||||
|
@ -41,37 +41,37 @@ synchronization primitives.
|
||||
|
||||
**Low Level Atomics**, a set of platform or compiler dependent typedefs and
|
||||
inlines. Provided in the internal header file
|
||||
[umutex.h](../../../../icu4c/source/common/umutex.h).
|
||||
[`umutex.h`](../../../../icu4c/source/common/umutex.h).
|
||||
|
||||
| Type/Function | Description |
|
||||
|-------------------------------------------------------|-----------------------------------------------------------------------------|
|
||||
| typedef u_atomic_int32_t | A 32 bit integer that will work with low level atomic operations. (typedef) |
|
||||
| umtx_loadAcquire(u_atomic_int32_t &var) | |
|
||||
| umtx_storeRelease(u_atomic_int32_t &var, int32_t val) | |
|
||||
| umtx_atomic_inc(u_atomic_int32_t &var) | |
|
||||
| umtx_atomic_dec(u_atomic_int32_t &var) | |
|
||||
| Type/Function | Description |
|
||||
|---------------------------------------------------------|-------------------------------------------------------------------------------|
|
||||
| `typedef u_atomic_int32_t` | A 32 bit integer that will work with low level atomic operations. (`typedef`) |
|
||||
| `umtx_loadAcquire(u_atomic_int32_t &var)` | |
|
||||
| `umtx_storeRelease(u_atomic_int32_t &var, int32_t val)` | |
|
||||
| `umtx_atomic_inc(u_atomic_int32_t &var)` | |
|
||||
| `umtx_atomic_dec(u_atomic_int32_t &var)` | |
|
||||
|
||||
**Mutexes**. Type declarations for ICU mutex wrappers. Provided in a header file.
|
||||
|
||||
| Type | Description |
|
||||
|---------------------|---------------------------------------------------------------------------------------------------|
|
||||
| struct UMutex | An ICU mutex. All instances will be static. Typically just contains an underlying platform mutex. |
|
||||
| U_MUTEX_INITIALIZER | A C style initializer for a static instance of a UMutex. |
|
||||
| Type | Description |
|
||||
|-----------------------|---------------------------------------------------------------------------------------------------|
|
||||
| `struct UMutex` | An ICU mutex. All instances will be static. Typically just contains an underlying platform mutex. |
|
||||
| `U_MUTEX_INITIALIZER` | A C style initializer for a static instance of a `UMutex`. |
|
||||
|
||||
**Mutex and InitOnce implementations**. Out-of-line platform-specific code.
|
||||
Provided in a .cpp file.
|
||||
|
||||
| Function | Description |
|
||||
|---------------------------------------|------------------------------------------|
|
||||
| umtx_lock(UMutex *mutex) | Lock a mutex. |
|
||||
| umtx_unlock(UMutex* mutex) | Unlock a mutex. |
|
||||
| umtx_initImplPreInit(UInitOnce &uio) | umtx_initOnce() implementation function. |
|
||||
| umtx_initImplPostInit(UInitOnce &uio) | umtx_initOnce() implementation function. |
|
||||
| Function | Description |
|
||||
|-----------------------------------------|--------------------------------------------|
|
||||
| `umtx_lock(UMutex *mutex)` | Lock a mutex. |
|
||||
| `umtx_unlock(UMutex* mutex)` | Unlock a mutex. |
|
||||
| `umtx_initImplPreInit(UInitOnce &uio)` | `umtx_initOnce()` implementation function. |
|
||||
| `umtx_initImplPostInit(UInitOnce &uio)` | `umtx_initOnce()` implementation function. |
|
||||
|
||||
`UInitOnce` and `umtx_initOnce()` are used internally by ICU for thread-safe
|
||||
one-time initialization. Their implementation is split into a
|
||||
platform-independent part (contained in
|
||||
[umutex.h](../../../../icu4c/source/common/umutex.h)),
|
||||
[`umutex.h`](../../../../icu4c/source/common/umutex.h)),
|
||||
and the pair of platform-dependent implementation functions listed above.
|
||||
|
||||
**Build Setup**
|
||||
@ -81,11 +81,11 @@ in the ICU build. If defined, the files are included at the top of the normal
|
||||
platform `#ifdef` chains in the ICU sources, and effectively define a new
|
||||
platform.
|
||||
|
||||
| Macro | Description |
|
||||
|------------------|-------------------------------------------------------|
|
||||
| U_USER_ATOMICS_H | Set to the name of the low level atomics header file. |
|
||||
| U_USER_MUTEX_H | Mutexes header file. |
|
||||
| U_USER_MUTEX_CPP | Mutexes and InitOnce implementation file. |
|
||||
| Macro | Description |
|
||||
|--------------------|---------------------------------------------------------|
|
||||
| `U_USER_ATOMICS_H` | Set to the name of the low level atomics header file. |
|
||||
| `U_USER_MUTEX_H` | Mutexes header file. |
|
||||
| `U_USER_MUTEX_CPP` | Mutexes and `InitOnce` implementation file. |
|
||||
|
||||
It is possible (and reasonable) to supply only the two mutex files, while
|
||||
retaining the ICU default implementation for the low level atomics.
|
||||
@ -97,7 +97,7 @@ Example ICU configure with user mutexes specified:
|
||||
**Stability**
|
||||
|
||||
This interface may change between ICU releases. The required set of functions
|
||||
may be be extended, or details of the behavior required may be altered.
|
||||
may be extended, or details of the behavior required may be altered.
|
||||
|
||||
The types and functions defined by this interface reach deeply into the ICU
|
||||
implementation, and we need to retain the ability to make changes should the
|
||||
@ -111,7 +111,7 @@ This implementation uses C++11 language mutexes and atomics. These make for a
|
||||
convenient reference implementation because the C++11 constructs are well
|
||||
defined and straight forward to use.
|
||||
|
||||
Similar implementations for POSIX and WIndows can be found in files
|
||||
Similar implementations for POSIX and Windows can be found in files
|
||||
`common/umutex.h` and `common/umutex.cpp`, in the platform `#ifdef` chains; these are
|
||||
part of the standard ICU distribution.
|
||||
|
||||
|
@ -59,28 +59,28 @@ For examples of use, search the ICU library code.
|
||||
|
||||
**Low Level Atomics**
|
||||
|
||||
| Type/Function | Description |
|
||||
|----------------------------------------|-----------------------------------------------------------------|
|
||||
| typedef u_atomic_int32_t | A 32 bit integer type for use with low level atomic operations. |
|
||||
| umtx_atomic_inc(u_atomic_int32_t &var) | |
|
||||
| umtx_atomic_dec(u_atomic_int32_t &var) | |
|
||||
| Type/Function | Description |
|
||||
|------------------------------------------|-----------------------------------------------------------------|
|
||||
| `typedef u_atomic_int32_t` | A 32 bit integer type for use with low level atomic operations. |
|
||||
| `umtx_atomic_inc(u_atomic_int32_t &var)` | |
|
||||
| `umtx_atomic_dec(u_atomic_int32_t &var)` | |
|
||||
|
||||
**Mutexes**
|
||||
|
||||
| Type/Function | Description |
|
||||
|----------------------------|--------------------------------------------------------------------|
|
||||
| struct UMutex | An ICU mutex. All instances must be static. |
|
||||
| U_MUTEX_INITIALIZER | A C style initializer for a UMutex. |
|
||||
| umtx_lock(UMutex *mutex) | Lock a mutex. |
|
||||
| umtx_unlock(UMutex* mutex) | Unlock a mutex. |
|
||||
| class Mutex | C++ Mutex wrapper withautomatic lock & unlock. See header mutex.h. |
|
||||
| Type/Function | Description |
|
||||
|------------------------------|-----------------------------------------------------------------------|
|
||||
| `struct UMutex` | An ICU mutex. All instances must be `static`. |
|
||||
| `U_MUTEX_INITIALIZER` | A C style initializer for a `UMutex`. |
|
||||
| `umtx_lock(UMutex *mutex)` | Lock a mutex. |
|
||||
| `umtx_unlock(UMutex* mutex)` | Unlock a mutex. |
|
||||
| `class Mutex` | C++ Mutex wrapper with automatic lock & unlock. See header `mutex.h.` |
|
||||
|
||||
**One Time Initialization**
|
||||
|
||||
| Type/Function | Description |
|
||||
|-------------------------------|-----------------------------------------------------------------------------------------|
|
||||
| struct UInitOnce | Provides an efficient facility for one-time initialization of static or global objects. |
|
||||
| umtx_initOnce(UInitOnce, ...) | A family of initialization functions. |
|
||||
| Type/Function | Description |
|
||||
|---------------------------------|-----------------------------------------------------------------------------------------|
|
||||
| `struct UInitOnce` | Provides an efficient facility for one-time initialization of static or global objects. |
|
||||
| `umtx_initOnce(UInitOnce, ...)` | A family of initialization functions. |
|
||||
|
||||
All of these functions are for internal ICU implementation use only. They are
|
||||
not exported, and not intended for external use.
|
||||
|
@ -30,7 +30,7 @@ Converting from the internal representation (milliseconds since midnight,
|
||||
January 1, 1970) to text is known as "formatting," and converting from text to
|
||||
milliseconds is known as "parsing." These processes involve two mappings:
|
||||
|
||||
* A mapping between a point in time (UDate) and a set of calendar fields,
|
||||
* A mapping between a point in time (`UDate`) and a set of calendar fields,
|
||||
which in turn depends on:
|
||||
* The rules of a particular calendar system (e.g. Gregorian, Buddhist,
|
||||
Chinese Lunar)
|
||||
@ -39,17 +39,17 @@ milliseconds is known as "parsing." These processes involve two mappings:
|
||||
representation, which depends on the fields selected for display, their
|
||||
display style, and the conventions of a particular locale.
|
||||
|
||||
## DateFormat
|
||||
## `DateFormat`
|
||||
|
||||
DateFormat helps format and parse dates for any locale. Your code can be
|
||||
`DateFormat` helps format and parse dates for any locale. Your code can be
|
||||
completely independent of the locale conventions for months, days of the week,
|
||||
or calendar format.
|
||||
|
||||
### Formatting Dates
|
||||
|
||||
The DateFormat interface in ICU enables you to format a Date in milliseconds
|
||||
The `DateFormat` interface in ICU enables you to format a `Date` in milliseconds
|
||||
into a string representation of the date. It also parses the string back to the
|
||||
internal Date representation in milliseconds.
|
||||
internal `Date` representation in milliseconds.
|
||||
|
||||
```cpp
|
||||
DateFormat* df = DateFormat::createDateInstance();
|
||||
@ -61,7 +61,7 @@ for (int32_t i = 0; i < 3; ++i) {
|
||||
}
|
||||
```
|
||||
|
||||
To format a date for a different Locale, specify it in the call to:
|
||||
To format a date for a different `Locale`, specify it in the call to:
|
||||
|
||||
```cpp
|
||||
DateFormat* df = DateFormat::createDateInstance
|
||||
@ -70,7 +70,7 @@ DateFormat* df = DateFormat::createDateInstance
|
||||
|
||||
### Parsing Dates
|
||||
|
||||
Use a DateFormat to parse also:
|
||||
Use a `DateFormat` to parse also:
|
||||
|
||||
```cpp
|
||||
UErrorCode status = ZERO_ERROR;
|
||||
@ -79,7 +79,7 @@ UDate myDate = df->parse(myString, status);
|
||||
|
||||
When numeric fields abut one another directly, with no intervening delimiter
|
||||
characters, they constitute a run of abutting numeric fields. Such runs are
|
||||
parsed specially. For example, the format "HHmmss" parses the input text
|
||||
parsed specially. For example, the format `"HHmmss"` parses the input text
|
||||
"123456" to 12:34:56, parses the input text "12345" to 1:23:45, and fails to
|
||||
parse "1234". In other words, the leftmost field of the run is flexible, while
|
||||
the others keep a fixed width. If the parse fails anywhere in the run, then the
|
||||
@ -90,81 +90,80 @@ the run fails.
|
||||
|
||||
### Producing Normal Date Formats for a Locale
|
||||
|
||||
Use createDateInstance to produce the normal date format for that country. There
|
||||
are other static factory methods available. Use createTimeInstance to produce
|
||||
the normal time format for that country. Use createDateTimeInstance to produce a
|
||||
DateFormat that formats both date and time. You can pass different options to
|
||||
these factory methods to control the length of the result; from SHORT to MEDIUM
|
||||
to LONG to FULL. The exact result depends on the locale, but generally:
|
||||
Use `createDateInstance` to produce the normal date format for that country. There
|
||||
are other static factory methods available. Use `createTimeInstance` to produce
|
||||
the normal time format for that country. Use `createDateTimeInstance` to produce a
|
||||
`DateFormat` that formats both date and time. You can pass different options to
|
||||
these factory methods to control the length of the result; from `SHORT` to `MEDIUM`
|
||||
to `LONG` to `FULL`. The exact result depends on the locale, but generally:
|
||||
|
||||
1. SHORT is numeric, such as 12/13/52 or 3:30pm
|
||||
1. `SHORT` is numeric, such as 12/13/52 or 3:30pm
|
||||
|
||||
2. MEDIUM is longer, such as Jan. 12, 1952
|
||||
2. `MEDIUM` is longer, such as Jan. 12, 1952
|
||||
|
||||
3. LONG is longer, such as January 12, 1952 or 3:30:32pm
|
||||
3. `LONG` is longer, such as January 12, 1952 or 3:30:32pm
|
||||
|
||||
4. FULL is completely specified, such as Tuesday, April 12, 1952 AD or
|
||||
4. `FULL` is completely specified, such as Tuesday, April 12, 1952 AD or
|
||||
3:30:42pm PST
|
||||
|
||||
For more general flexibility, the [DateTimePatternGenerator](index.md) can map a
|
||||
custom selection of time and date fields, along with various display styles for
|
||||
those fields, to a locale-appropriate format that can then be set as the format
|
||||
to use by the DateFormat.
|
||||
For more general flexibility, the [`DateTimePatternGenerator`](index#datetimepatterngenerator)
|
||||
can map a custom selection of time and date fields, along with various display styles for
|
||||
those fields, to a locale-appropriate format that can then be set as the format to use by the `DateFormat`.
|
||||
|
||||
### Producing Relative Date Formats for a Locale
|
||||
|
||||
ICU currently provides limited support for formatting dates using a “relative”
|
||||
style, specified using RELATIVE_SHORT, RELATIVE_MEDIUM, RELATIVE_LONG. or
|
||||
RELATIVE_FULL. As currently implemented, relative date formatting only affects
|
||||
style, specified using `RELATIVE_SHORT`, `RELATIVE_MEDIUM`, `RELATIVE_LONG` or
|
||||
`RELATIVE_FULL`. As currently implemented, relative date formatting only affects
|
||||
the formatting of dates within a limited range of calendar days before or after
|
||||
the current date, based on the CLDR `<field type="day">`/`<relative>` data: For
|
||||
example, in English, "Yesterday", "Today", and "Tomorrow". Within this range,
|
||||
the specific relative style currently makes no difference. Outside of this
|
||||
range, relative dates are formatted using the corresponding non-relative style
|
||||
(SHORT, MEDIUM, etc.). Relative time styles are not currently supported, and
|
||||
(`SHORT`, `MEDIUM`, etc.). Relative time styles are not currently supported, and
|
||||
behave just like the corresponding non-relative style.
|
||||
|
||||
### Setting Time Zones
|
||||
|
||||
You can set the time zone on the format. If you want more control over the
|
||||
format or parsing, cast the DateFormat you get from the factory methods to a
|
||||
SimpleDateFormat. This works for the majority of countries.
|
||||
format or parsing, cast the `DateFormat` you get from the factory methods to a
|
||||
`SimpleDateFormat`. This works for the majority of countries.
|
||||
|
||||
> :point_right: **Note**: *Remember to check getDynamicClassID() before carrying out the cast.*
|
||||
> :point_right: **Note**: *Remember to check `getDynamicClassID()` before carrying out the cast.*
|
||||
|
||||
### Working with Positions
|
||||
|
||||
You can also use forms of the parse and format methods with ParsePosition and
|
||||
FieldPosition to enable you to:
|
||||
You can also use forms of the parse and format methods with `ParsePosition` and
|
||||
`FieldPosition` to enable you to:
|
||||
|
||||
1. Progressively parse through pieces of a string.
|
||||
|
||||
2. Align any particular field, or find out where it is for selection on the
|
||||
screen.
|
||||
|
||||
## SimpleDateFormat
|
||||
## `SimpleDateFormat`
|
||||
|
||||
SimpleDateFormat is a concrete class used for formatting and parsing dates in a
|
||||
`SimpleDateFormat` is a concrete class used for formatting and parsing dates in a
|
||||
language-independent manner. It allows for formatting, parsing, and
|
||||
normalization. It formats or parses a date or time, which is the standard
|
||||
milliseconds since 24:00 GMT, Jan. 1, 1970.
|
||||
|
||||
SimpleDateFormat is the only built-in implementation of DateFormat. It provides
|
||||
`SimpleDateFormat` is the only built-in implementation of `DateFormat`. It provides
|
||||
a programmable interface that can be used to produce formatted dates and times
|
||||
in a wide variety of formats. The formats include almost all of the most common
|
||||
ones.
|
||||
|
||||
Create a date-time formatter using the following methods rather than
|
||||
constructing an instance of SimpleDateFormat. In this way, the program is
|
||||
constructing an instance of `SimpleDateFormat`. In this way, the program is
|
||||
guaranteed to get an appropriate formatting pattern of the locale.
|
||||
|
||||
1. DateFormat::getInstance()
|
||||
1. `DateFormat::getInstance()`
|
||||
|
||||
2. getDateInstance()
|
||||
2. `getDateInstance()`
|
||||
|
||||
3. getDateTimeInstance()
|
||||
3. `getDateTimeInstance()`
|
||||
|
||||
If you need a more unusual pattern, construct a SimpleDateFormat directly and
|
||||
If you need a more unusual pattern, construct a `SimpleDateFormat` directly and
|
||||
give it an appropriate pattern.
|
||||
|
||||
### Date/Time Format Syntax
|
||||
@ -174,11 +173,11 @@ are replaced with date and time data from a calendar when formatting or used to
|
||||
generate data for a calendar when parsing.
|
||||
|
||||
The Date Field Symbol Table below contains the characters used in patterns to
|
||||
show the appropriate formats for a given locale, such as yyyy for the year.
|
||||
Characters may be used multiple times. For example, if y is used for the year,
|
||||
'yy' might produce '99', whereas 'yyyy' produces '1999'. For most numerical
|
||||
fields, the number of characters specifies the field width. For example, if h is
|
||||
the hour, 'h' might produce '5', but 'hh' produces '05'. For some characters,
|
||||
show the appropriate formats for a given locale, such as `yyyy` for the year.
|
||||
Characters may be used multiple times. For example, if `y` is used for the year,
|
||||
`'yy'` might produce '99', whereas `'yyyy'` produces '1999'. For most numerical
|
||||
fields, the number of characters specifies the field width. For example, if `h` is
|
||||
the hour, `'h'` might produce '5', but `'hh'` produces '05'. For some characters,
|
||||
the count specifies whether an abbreviated or full form should be used, but may
|
||||
have other choices, as given below.
|
||||
|
||||
@ -187,14 +186,14 @@ single quotes. Text within single quotes is not interpreted in any way (except
|
||||
for two adjacent single quotes). Otherwise all ASCII letter from a to z and A to
|
||||
Z are reserved as syntax characters, and require quoting if they are to
|
||||
represent literal characters. In addition, certain ASCII punctuation characters
|
||||
may become variable in the future (eg ":" being interpreted as the time
|
||||
separator and '/' as a date separator, and replaced by respective
|
||||
may become variable in the future (eg `":"` being interpreted as the time
|
||||
separator and `'/'` as a date separator, and replaced by respective
|
||||
locale-sensitive characters in display).
|
||||
|
||||
"Stand Alone" values refer to those designed to stand on their own, as opposed
|
||||
to being with other formatted values. "2nd quarter" would use the stand alone
|
||||
format (QQQQ), whereas "2nd quarter 2007" would use the regular format (qqqq
|
||||
yyyy).
|
||||
to being with other formatted values. "2nd quarter" would use the standalone
|
||||
format `(QQQQ)`, whereas "2nd quarter 2007" would use the regular format `(qqqq
|
||||
yyyy)`.
|
||||
|
||||
The pattern characters used in the Date Field Symbol Table are defined by CLDR;
|
||||
for more information see [CLDR Date Field Symbol Table](https://www.unicode.org/reports/tr35/tr35-dates.html#Date_Field_Symbol_Table).
|
||||
@ -204,52 +203,52 @@ Note that the examples may not reflect current CLDR data.
|
||||
|
||||
| Symbol | Meaning | Pattern | Example Output |
|
||||
| --- | --- | --- | --- |
|
||||
| G | era designator | G, GG, or GGG<br/>GGGG<br/>GGGGG | AD<br/>Anno Domini<br/>A |
|
||||
| y | year | yy<br/>y or yyyy | 96<br/>1996 |
|
||||
| Y | year of "Week of Year" | Y | 1997 |
|
||||
| u | extended year | u | 4601 |
|
||||
| U | cyclic year name, as in Chinese lunar calendar | U | 甲子 |
|
||||
| r | related Gregorian year | r | 1996 |
|
||||
| Q | quarter | Q<br/>QQ<br/>QQQ<br/>QQQQ<br/>QQQQQ | 2<br/>02<br/>Q2<br/>2nd quarter<br/>2 |
|
||||
| q | Stand Alone quarter | q<br/>qq<br/>qqq<br/>qqqq<br/>qqqqq | 2<br/>02<br/>Q2<br/>2nd quarter<br/>2 |
|
||||
| M | month in year | M<br/>MM<br/>MMM<br/>MMMM<br/>MMMMM | 9<br/>09<br/>Sep<br/>September<br/>S |
|
||||
| L | Stand Alone month in year | L<br/>LL<br/>LLL<br/>LLLL<br/>LLLLL | 9<br/>09<br/>Sep<br/>September<br/>S |
|
||||
| w | week of year | w<br/>ww | 27<br/>27 |
|
||||
| W | week of month | W | 2 |
|
||||
| d | day in month | d<br/>dd | 2<br/>02 |
|
||||
| D | day of year | D | 189 |
|
||||
| F | day of week in month | F | 2 (2nd Wed in July) |
|
||||
| g | modified julian day | g | 2451334 |
|
||||
| E | day of week | E, EE, or EEE<br/>EEEE<br/>EEEEE<br/>EEEEEE | Tue<br/>Tuesday<br/>T<br/>Tu |
|
||||
| e | local day of week<br/>example: if Monday is 1st day, Tuesday is 2nd ) | e or ee<br/>eee<br/>eeee<br/>eeeee<br/>eeeeee | 2<br/>Tue<br/>Tuesday<br/>T<br/>Tu |
|
||||
| c | Stand Alone local day of week | c or cc<br/>ccc<br/>cccc<br/>ccccc<br/>cccccc | 2<br/>Tue<br/>Tuesday<br/>T<br/>Tu |
|
||||
| a | am/pm marker | a | pm |
|
||||
| h | hour in am/pm (1~12) | h<br/>hh | 7<br/>07 |
|
||||
| H | hour in day (0~23) | H<br/>HH | 0<br/>00 |
|
||||
| k | hour in day (1~24) | k<br/>kk | 24<br/>24 |
|
||||
| K | hour in am/pm (0~11) | K<br/>KK | 0<br/>00 |
|
||||
| m | minute in hour | m<br/>mm | 4<br/>04 |
|
||||
| s | second in minute | s<br/>ss | 5<br/>05 |
|
||||
| S | fractional second - truncates (like other time fields)<br/>to the count of letters when formatting. Appends<br/>zeros if more than 3 letters specified. Truncates at<br/>three significant digits when parsing. | S<br/>SS<br/>SSS<br/>SSSS | 2<br/>23<br/>235<br/>2350 |
|
||||
| A | milliseconds in day | A | 61201235 |
|
||||
| z | Time Zone: specific non-location | z, zz, or zzz<br/>zzzz | PDT<br/>Pacific Daylight Time |
|
||||
| Z | Time Zone: ISO8601 basic hms? / RFC 822<br/>Time Zone: long localized GMT (=OOOO)<br/>TIme Zone: ISO8601 extended hms? (=XXXXX) | Z, ZZ, or ZZZ<br/>ZZZZ<br/>ZZZZZ | -0800<br/>GMT-08:00<br/>-08:00, -07:52:58, Z |
|
||||
| O | Time Zone: short localized GMT<br/>Time Zone: long localized GMT (=ZZZZ) | O<br/>OOOO | GMT-8<br/>GMT-08:00 |
|
||||
| v | Time Zone: generic non-location<br/>(falls back first to VVVV) | v<br/>vvvv | PT<br/>Pacific Time or Los Angeles Time |
|
||||
| V | Time Zone: short time zone ID<br/>Time Zone: long time zone ID<br/>Time Zone: time zone exemplar city<br/>Time Zone: generic location (falls back to OOOO) | V<br/>VV<br/>VVV<br/>VVVV | uslax<br/>America/Los_Angeles<br/>Los Angeles<br/>Los Angeles Time |
|
||||
| X | Time Zone: ISO8601 basic hm?, with Z for 0<br/>Time Zone: ISO8601 basic hm, with Z<br/>Time Zone: ISO8601 extended hm, with Z<br/>Time Zone: ISO8601 basic hms?, with Z<br/>Time Zone: ISO8601 extended hms?, with Z | X<br/>XX<br/>XXX<br/>XXXX<br/>XXXXX | -08, +0530, Z<br/>-0800, Z<br/>-08:00, Z<br/>-0800, -075258, Z<br/>-08:00, -07:52:58, Z |
|
||||
| x | Time Zone: ISO8601 basic hm?, without Z for 0<br/>Time Zone: ISO8601 basic hm, without Z<br/>Time Zone: ISO8601 extended hm, without Z<br/>Time Zone: ISO8601 basic hms?, without Z<br/>Time Zone: ISO8601 extended hms?, without Z | x<br/>xx<br/>xxx<br/>xxxx<br/>xxxxx | -08, +0530<br/>-0800<br/>-08:00<br/>-0800, -075258<br/>-08:00, -07:52:58 |
|
||||
| ' | escape for text | ' | (nothing) |
|
||||
| ' ' | two single quotes produce one | ' ' | ' |
|
||||
| `G` | era designator | `G`, `GG`, or `GGG`<br/>`GGGG`<br/>`GGGGG` | AD<br/>Anno Domini<br/>A |
|
||||
| `y` | year | yy<br/>`y` or `yyyy` | 96<br/>1996 |
|
||||
| `Y` | year of "Week of Year" | `Y` | 1997 |
|
||||
| `u` | extended year | `u` | 4601 |
|
||||
| `U` | cyclic year name, as in Chinese lunar calendar | `U` | 甲子 |
|
||||
| `r` | related Gregorian year | `r` | 1996 |
|
||||
| `Q` | quarter | `Q`<br/>`QQ`<br/>`QQQ`<br/>`QQQQ`<br/>`QQQQQ` | 2<br/>02<br/>Q2<br/>2nd quarter<br/>2 |
|
||||
| `q` | Stand Alone quarter | `q`<br/>`qq`<br/>`qqq`<br/>`qqqq`<br/>`qqqqq` | 2<br/>02<br/>Q2<br/>2nd quarter<br/>2 |
|
||||
| `M` | month in year | `M`<br/>`MM`<br/>`MMM`<br/>`MMMM`<br/>`MMMMM` | 9<br/>09<br/>Sep<br/>September<br/>S |
|
||||
| `L` | Stand Alone month in year | `L`<br/>`LL`<br/>`LLL`<br/>`LLLL`<br/>`LLLLL` | 9<br/>09<br/>Sep<br/>September<br/>S |
|
||||
| `w` | week of year | `w`<br/>`ww` | 27<br/>27 |
|
||||
| `W` | week of month | `W` | 2 |
|
||||
| `d` | day in month | `d`<br/>`dd` | 2<br/>02 |
|
||||
| `D` | day of year | `D` | 189 |
|
||||
| `F` | day of week in month | `F` | 2 (2nd Wed in July) |
|
||||
| `g` | modified julian day | `g` | 2451334 |
|
||||
| `E` | day of week | `E,` `EE`, or `EEE`<br/>`EEEE`<br/>`EEEEE`<br/>`EEEEEE` | Tue<br/>Tuesday<br/>T<br/>Tu |
|
||||
| `e` | local day of week<br/>example: if Monday is 1st day, Tuesday is 2nd ) | `e` or `ee`<br/>`eee`<br/>`eeee`<br/>`eeeee`<br/>`eeeeee` | 2<br/>Tue<br/>Tuesday<br/>T<br/>Tu |
|
||||
| `c` | Stand Alone local day of week | `c` or `cc`<br/>`ccc`<br/>`cccc`<br/>`ccccc`<br/>`cccccc` | 2<br/>Tue<br/>Tuesday<br/>T<br/>Tu |
|
||||
| `a` | am/pm marker | `a` | pm |
|
||||
| `h` | hour in am/pm (1~12) | `h`<br/>`hh` | 7<br/>07 |
|
||||
| `H` | hour in day (0~23) | `H`<br/>`HH` | 0<br/>00 |
|
||||
| `k` | hour in day (1~24) | `k`<br/>`kk` | 24<br/>24 |
|
||||
| `K` | hour in am/pm (0~11) | `K`<br/>`KK` | 0<br/>00 |
|
||||
| `m` | minute in hour | `m`<br/>`mm` | 4<br/>04 |
|
||||
| `s` | second in minute | `s`<br/>`ss` | 5<br/>05 |
|
||||
| `S` | fractional second - truncates (like other time fields)<br/>to the count of letters when formatting. Appends<br/>zeros if more than 3 letters specified. Truncates at<br/>three significant digits when parsing. | `S`<br/>`SS`<br/>`SSS`<br/>`SSSS` | 2<br/>23<br/>235<br/>2350 |
|
||||
| `A` | milliseconds in day | `A` | 61201235 |
|
||||
| `z` | Time Zone: specific non-location | `z`, `zz`, or `zzz`<br/>`zzzz` | PDT<br/>Pacific Daylight Time |
|
||||
| `Z` | Time Zone: ISO8601 basic hms? / RFC 822<br/>Time Zone: long localized GMT (=OOOO)<br/>TIme Zone: ISO8601 extended hms? (=XXXXX) | `Z`, `ZZ`, or `ZZZ`<br/>`ZZZZ`<br/>`ZZZZZ` | -0800<br/>GMT-08:00<br/>-08:00, -07:52:58, Z |
|
||||
| `O` | Time Zone: short localized GMT<br/>Time Zone: long localized GMT (=ZZZZ) | `O`<br/>`OOOO` | GMT-8<br/>GMT-08:00 |
|
||||
| `v` | Time Zone: generic non-location<br/>(falls back first to VVVV) | `v`<br/>`vvvv` | PT<br/>Pacific Time or Los Angeles Time |
|
||||
| `V` | Time Zone: short time zone ID<br/>Time Zone: long time zone ID<br/>Time Zone: time zone exemplar city<br/>Time Zone: generic location (falls back to OOOO) | `V`<br/>`VV`<br/>`VVV`<br/>`VVVV` | uslax<br/>America/Los_Angeles<br/>Los Angeles<br/>Los Angeles Time |
|
||||
| `X` | Time Zone: ISO8601 basic hm?, with Z for 0<br/>Time Zone: ISO8601 basic hm, with Z<br/>Time Zone: ISO8601 extended hm, with Z<br/>Time Zone: ISO8601 basic hms?, with Z<br/>Time Zone: ISO8601 extended hms?, with Z | `X`<br/>`XX`<br/>`XXX`<br/>`XXXX`<br/>`XXXXX` | -08, +0530, Z<br/>-0800, Z<br/>-08:00, Z<br/>-0800, -075258, Z<br/>-08:00, -07:52:58, Z |
|
||||
| `x` | Time Zone: ISO8601 basic hm?, without Z for 0<br/>Time Zone: ISO8601 basic hm, without Z<br/>Time Zone: ISO8601 extended hm, without Z<br/>Time Zone: ISO8601 basic hms?, without Z<br/>Time Zone: ISO8601 extended hms?, without Z | `x`<br/>`xx`<br/>`xxx`<br/>`xxxx`<br/>`xxxxx` | -08, +0530<br/>-0800<br/>-08:00<br/>-0800, -075258<br/>-08:00, -07:52:58 |
|
||||
| `'` | escape for text | `'` | (nothing) |
|
||||
| `' '` | two single quotes produce one | `' '` | ' |
|
||||
|
||||
> :point_right: **Note**: *Any characters in the pattern that are not in the ranges of
|
||||
['a'..'z'] and ['A'..'Z'] will be treated as quoted text. For instance,
|
||||
characters like ':', '.', ' ', '#' and '@' will appear in the resulting time
|
||||
text even they are not enclosed within single quotes.The single quote is used to
|
||||
characters like `':'`, `'.'`, `' '`, `'#'` and `'@'` will appear in the resulting time
|
||||
text even they are not enclosed within single quotes. The single quote is used to
|
||||
'escape' letters. Two single quotes in a row, whether inside or outside a quoted
|
||||
sequence, represent a 'real' single quote.*
|
||||
|
||||
> :point_right: **Note**: *A pattern containing any invalid pattern letter results in a failing UErrorCode
|
||||
> :point_right: **Note**: *A pattern containing any invalid pattern letter results in a failing `UErrorCode`
|
||||
result during formatting or parsing.*
|
||||
|
||||
| Format Pattern | Result |
|
||||
@ -281,10 +280,10 @@ date. So the display name used for “America/Indiana/Tell_City” before the da
|
||||
|
||||
> :point_right: **Note**: *Prior to ICU 3.8, a localized time zone name (except GMT format) and a time
|
||||
zone ID was in 1-to-1 relationship always. Therefore, a time zone name produced
|
||||
by DateFormat can be parsed back to the original time zone. This assumption no
|
||||
by `DateFormat` can be parsed back to the original time zone. This assumption no
|
||||
longer applies to ICU 3.8 and later releases for all time zone format types. If
|
||||
you program requires to roundtrip specific time zone ID, you must use the
|
||||
generic location format (“VVVV”) explained below.*
|
||||
generic location format `(“VVVV”)` explained below.*
|
||||
|
||||
There are several different display name types available in the LDML
|
||||
specification.
|
||||
@ -301,41 +300,41 @@ specification.
|
||||
| RFC822 GMT | A constant, specific offset from GMT in a locale insensitive format. | -0800 |
|
||||
|
||||
Each format type in the above table is used as a primary type or a fallback in
|
||||
SimpleDateFormat. The table below explains how ICU time zone format pattern work
|
||||
`SimpleDateFormat`. The table below explains how ICU time zone format pattern work
|
||||
and its characteristics.
|
||||
|
||||
#### Time Zone Pattern Usage
|
||||
|
||||
| Pattern | Behavior | Round-trip time at daylight transitions(\*) | Round-trip Time Zone | Suggested Usage |
|
||||
| --- | --- | --- | --- | --- |
|
||||
| z, zz, zzz | Short specific non-location format (e.g. “PST”).If the localized data is not available or the short abbreviation is not commonly used for the locale, localized GMT format is used (e.g. GMT-08:00). | yes | no | For displaying a time with a user friendly time zone name. |
|
||||
| zzzz | Long specific non-location format (e.g. “Pacific Standard Time”).If the localized data is not available, localized GMT format is used (e.g. GMT-08:00). | yes | no | Same as “z”, but longer format. |
|
||||
| v | Short generic non-location format (e.g. “PT”).If the localized data is not available or the short abbreviation is not commonly used for the locale, generic location format (e.g. “United States(Los Angeles) Time”) is used.If the localized data comes from metazone and the GMT offset at the given time in the specified time zone differs from the preferred time zone of the metazone for the locale, generic partial location format (e.g. “PT (Canada)”) is used. | no | no | For displaying a recurring wall time (e.g. events, meetings) or anywhere people do not want to be overly specific. |
|
||||
| vvvv | Long generic non-location format (e.g. “Pacific Time”).If the localized data is not available, generic location format (e.g. “United States(Los Angeles) Time”) is used. | no | no | Same as “v”, but longer format. |
|
||||
| V | Same as “z”, except using the short abbreviation even it is not commonly used for the locale. | yes | no | Same as “z”. |
|
||||
| VVVV | Generic location format (e.g. “United States (Los Angeles) Time”). | no | yes | For populating a choice list for time zones, because it supports 1-to-1 name/zone ID mapping and is more uniform than other text formats.Also, this is only the pattern supporting time zone round-trip. If your program requires to preserve the original time zone information, use this pattern. |
|
||||
| Z, ZZ, ZZZ | Localized GMT format (e.g. “GMT-08:00”). | yes | no | For displaying a time in UI in a uniformed manner. |
|
||||
| ZZZZ | RFC822 GMT format (e.g. “-0800”). | yes | no | For formatting a time for non-user-facing data. |
|
||||
| `z`, `zz`, `zzz` | Short specific non-location format (e.g. “PST”). If the localized data is not available or the short abbreviation is not commonly used for the locale, localized GMT format is used (e.g. GMT-08:00). | yes | no | For displaying a time with a user friendly time zone name. |
|
||||
| `zzzz` | Long specific non-location format (e.g. “Pacific Standard Time”). If the localized data is not available, localized GMT format is used (e.g. GMT-08:00). | yes | no | Same as `“z”`, but longer format. |
|
||||
| `v` | Short generic non-location format (e.g. “PT”). If the localized data is not available or the short abbreviation is not commonly used for the locale, generic location format (e.g. “United States(Los Angeles) Time”) is used. If the localized data comes from metazone and the GMT offset at the given time in the specified time zone differs from the preferred time zone of the metazone for the locale, generic partial location format (e.g. “PT (Canada)”) is used. | no | no | For displaying a recurring wall time (e.g. events, meetings) or anywhere people do not want to be overly specific. |
|
||||
| `vvvv` | Long generic non-location format (e.g. “Pacific Time”). If the localized data is not available, generic location format (e.g. “United States(Los Angeles) Time”) is used. | no | no | Same as `“v”`, but longer format. |
|
||||
| `V` | Same as `“z”`, except using the short abbreviation even it is not commonly used for the locale. | yes | no | Same as `“z”`. |
|
||||
| `VVVV` | Generic location format (e.g. “United States (Los Angeles) Time”). | no | yes | For populating a choice list for time zones, because it supports 1-to-1 name/zone ID mapping and is more uniform than other text formats. Also, this is only the pattern supporting time zone round-trip. If your program requires to preserve the original time zone information, use this pattern. |
|
||||
| `Z`, `ZZ`, `ZZZ` | Localized GMT format (e.g. “GMT-08:00”). | yes | no | For displaying a time in UI in a uniformed manner. |
|
||||
| `ZZZZ` | RFC822 GMT format (e.g. “-0800”). | yes | no | For formatting a time for non-user-facing data. |
|
||||
|
||||
\* At a transition from daylight saving time to standard time, there is a wall
|
||||
time interval occurs twice.
|
||||
|
||||
## DateTimePatternGenerator
|
||||
## `DateTimePatternGenerator`
|
||||
|
||||
The DateTimePatternGenerator class provides a way to map a request for a set of
|
||||
The `DateTimePatternGenerator` class provides a way to map a request for a set of
|
||||
date/time fields, along with their width, to a locale-appropriate format
|
||||
pattern. The request is in the form of a “skeleton” which just contains pattern
|
||||
letters for the desired fields using the representation for the desired width.
|
||||
In a skeleton, anything other than a pattern letter is ignored, field order is
|
||||
insignificant, and there are two special additional pattern letters that may be
|
||||
used: 'j' requests the preferred hour-cycle type for the locale (it gets mapped
|
||||
to one of 'H', 'h', 'k', or 'K'); 'J' is similar but requests no AM/PM marker
|
||||
even if the locale’s preferred hour-cycle type is 'h' or 'K'.
|
||||
used: `'j'` requests the preferred hour-cycle type for the locale (it gets mapped
|
||||
to one of `'H'`, `'h'`, `'k'`, or `'K'`); `'J'` is similar but requests no AM/PM marker
|
||||
even if the locale’s preferred hour-cycle type is `'h'` or `'K'`.
|
||||
|
||||
For example, a skeleton of “MMMMdjmm” might result in the following format
|
||||
For example, a skeleton of `“MMMMdjmm”` might result in the following format
|
||||
patterns for different locales:
|
||||
|
||||
| locale | format pattern for skeleton “MMMMdjmm” | example |
|
||||
| locale | format pattern for skeleton `“MMMMdjmm”` | example |
|
||||
| ------ | -------------------------------------- | ------------------ |
|
||||
| en_US | "MMMM d 'at' h:mm a" | April 2 at 5:00 PM |
|
||||
| es_ES | "d 'de' MMMM, H:mm" | 2 de abril, 17:00 |
|
||||
@ -347,41 +346,41 @@ getBestPattern.
|
||||
Note that the fields in the format pattern may be adjusted as appropriate for
|
||||
the locale and may not exactly match those in the skeleton. For example:
|
||||
|
||||
* In Russian (locale "ru"), the skeleton "yMMMM" will produce the format
|
||||
pattern "LLLL y" (or "LLLL y 'г'.") since a month name without a day number
|
||||
must be in nominative form, as indicated by LLLL.
|
||||
* In Russian (locale "ru"), the skeleton `"yMMMM"` will produce the format
|
||||
pattern `"LLLL y"` (or "LLLL y 'г'.") since a month name without a day number
|
||||
must be in nominative form, as indicated by `LLLL`.
|
||||
* When using the Japanese calendar in the Japanese locale (locale
|
||||
"ja@calendar=japanese"), the skeleton "yMMMd" will produce the format
|
||||
"ja@calendar=japanese"), the skeleton `"yMMMd"` will produce the format
|
||||
pattern "Gy年M月d日" since the era must always be shown with the year in the
|
||||
Japanese calendar.
|
||||
|
||||
## DateFormatSymbols
|
||||
## `DateFormatSymbols`
|
||||
|
||||
DateFormatSymbols is a public class for encapsulating localizable date-time
|
||||
formatting data, including time zone data. DateFormatSymbols is used by
|
||||
DateFormat and SimpleDateFormat.
|
||||
`DateFormatSymbols` is a public class for encapsulating localizable date-time
|
||||
formatting data, including time zone data. `DateFormatSymbols` is used by
|
||||
`DateFormat` and `SimpleDateFormat`.
|
||||
|
||||
DateFormatSymbols specifies the exact character strings to use for various parts
|
||||
`DateFormatSymbols` specifies the exact character strings to use for various parts
|
||||
of a date or time For example, the names of the months and days of the week, the
|
||||
strings for AM and PM and the day of the week considered to be the first day of
|
||||
the week (used in drawing calendar grids) are controlled by DateFormatSymbols.
|
||||
the week (used in drawing calendar grids) are controlled by `DateFormatSymbols`.
|
||||
|
||||
Create a date-time formatter using the `createTimeInstance`, `createDateInstance`,
|
||||
or `createDateTimeInstance` methods in DateFormat. Each of these methods can
|
||||
or `createDateTimeInstance` methods in `DateFormat`. Each of these methods can
|
||||
return a date/time formatter initialized with a default format pattern, along
|
||||
with the date-time formatting data for a given or default locale. After a
|
||||
formatter is created, modify the format pattern using `applyPattern`.
|
||||
|
||||
If you want to create a date-time formatter with a particular format pattern and
|
||||
locale, use one of the SimpleDateFormat constructors:
|
||||
locale, use one of the `SimpleDateFormat` constructors:
|
||||
|
||||
```cpp
|
||||
UnicodeString aPattern("GyyyyMMddHHmmssSSZ", "");
|
||||
new SimpleDateFormat(aPattern, new DateFormatSymbols(Locale::getUS())
|
||||
```
|
||||
|
||||
This loads the appropriate date-time formatting data from the locale.s
|
||||
This loads the appropriate date-time formatting data from the locale.
|
||||
|
||||
## Programming Examples
|
||||
|
||||
See [date and time formatting examples](examples.md) .
|
||||
See [date and time formatting examples](examples.md).
|
||||
|
@ -10,7 +10,7 @@ parent: Formatting Numbers
|
||||
License & terms of use: http://www.unicode.org/copyright.html
|
||||
-->
|
||||
|
||||
# FormattedValue
|
||||
# `FormattedValue`
|
||||
{: .no_toc }
|
||||
|
||||
## Contents
|
||||
@ -23,22 +23,22 @@ License & terms of use: http://www.unicode.org/copyright.html
|
||||
|
||||
## Overview
|
||||
|
||||
FormattedValue is an abstraction for localized strings with attributes
|
||||
returned by a number of ICU formatters. APIs for FormattedValue are available
|
||||
`FormattedValue` is an abstraction for localized strings with attributes
|
||||
returned by a number of ICU formatters. APIs for `FormattedValue` are available
|
||||
in Java, C++, and C. For more details and a list of all implementing classes,
|
||||
refer to the API docs:
|
||||
|
||||
- [C++ FormattedValue](https://unicode-org.github.io/icu-docs/apidoc/released/icu4c/classicu_1_1FormattedValue.html)
|
||||
- [C UFormattedValue](https://unicode-org.github.io/icu-docs/apidoc/released/icu4c/globals_u.html) -- search for "resultAsValue"
|
||||
- [Java FormattedValue](https://unicode-org.github.io/icu-docs/apidoc/released/icu4j/com/ibm/icu/text/FormattedValue.html)
|
||||
- [C++ `FormattedValue`](https://unicode-org.github.io/icu-docs/apidoc/released/icu4c/classicu_1_1FormattedValue.html)
|
||||
- [C `UFormattedValue`](https://unicode-org.github.io/icu-docs/apidoc/released/icu4c/globals_u.html) -- search for "resultAsValue"
|
||||
- [Java `FormattedValue`](https://unicode-org.github.io/icu-docs/apidoc/released/icu4j/com/ibm/icu/text/FormattedValue.html)
|
||||
|
||||
## Nested Span Fields
|
||||
|
||||
Certain ICU formatters, like FormattedList and FormattedDateInterval, use
|
||||
Certain ICU formatters, like `FormattedList` and `FormattedDateInterval`, use
|
||||
*span fields* to return information about which spans of a string correspond
|
||||
to different input parameters. In C and C++, span fields are implemented
|
||||
using a field category, with the field being set to the input index; in Java,
|
||||
they are implemented by associating an Integer value with a SpanField
|
||||
they are implemented by associating an `Integer` value with a `SpanField`
|
||||
subclass.
|
||||
|
||||
For example, in C++, here is how you can determine which region in a formatted
|
||||
@ -86,6 +86,6 @@ assertFalse("No more than one occurrence of the field",
|
||||
```
|
||||
|
||||
A span may cover multiple primitive fields; in the above example, the span
|
||||
contains both a month and a date. Using FormattedValue, those primitive
|
||||
contains both a month and a date. Using `FormattedValue`, those primitive
|
||||
fields will also be present, and you can check their start and end indices to
|
||||
see if they are contained within a desired span.
|
||||
|
@ -105,24 +105,22 @@ is displayed as 75%.
|
||||
|
||||
#### Customizing Number Formats
|
||||
|
||||
If you need to customize a number format you can use the DecimalFormat (§) and
|
||||
the DecimalFormatSymbols (§) classes in the [Formatting
|
||||
Numbers](formatparse/numbers/index.md) chapter. This not usually necessary and
|
||||
If you need to customize a number format you can use the `DecimalFormat` and
|
||||
the `DecimalFormatSymbols` classes in the [Formatting
|
||||
Numbers](numbers/index#formatting-numbers) chapter. This not usually necessary and
|
||||
it makes your code much more complex, but it is available for those rare
|
||||
instances where you need it. In general, you would do this by explicitly
|
||||
specifying the number format pattern.
|
||||
|
||||
If you need to format or parse spelled-out numbers, you can use the
|
||||
RuleBasedNumberFormat class (§) (see the [Formatting
|
||||
Numbers](formatparse/numbers/index.md) chapter). You can instantiate a default
|
||||
formatter for a locale, or by using the RuleBasedNumberFormat rule syntax,
|
||||
specify your own.
|
||||
`RuleBasedNumberFormat` class (see the [Formatting Numbers](numbers/index#formatting-numbers) chapter).
|
||||
You can instantiate a default formatter for a locale, or by using the
|
||||
`RuleBasedNumberFormat` rule syntax, specify your own.
|
||||
|
||||
Using NumberFormat (§) class methods (see the [Formatting
|
||||
Numbers](formatparse/numbers/index.md) chapter) with a predefined locale is the
|
||||
easiest and the most accurate way to format numbers, and currencies.
|
||||
Using `NumberFormat` class methods (see the [Formatting Numbers](numbers/index#formatting-numbers) chapter)
|
||||
with a predefined locale is the easiest and the most accurate way to format numbers, and currencies.
|
||||
|
||||
> :point_right: **Note**: *See [Properties and ICU Rule Syntax](strings/properties.md) for
|
||||
> :point_right: **Note**: *See [Properties and ICU Rule Syntax](../strings/properties) for
|
||||
information regarding syntax characters.*
|
||||
|
||||
### Date and Times
|
||||
@ -134,8 +132,8 @@ recognize 20.4.98 as a valid date, and Americans recognize 4/20/98.
|
||||
> :point_right: **Note**: *The appropriate Calendar support is required for different locales. For
|
||||
example, the Buddhist calendar is the official calendar in Thailand so the
|
||||
typical assumption of Gregorian Calendar usage should not be used. ICU will pick
|
||||
the appropriate Calendar based on the locale you supply when opening a Calendar
|
||||
or DateFormat.*
|
||||
the appropriate Calendar based on the locale you supply when opening a `Calendar`
|
||||
or `DateFormat`.*
|
||||
|
||||
### Messages
|
||||
|
||||
@ -154,72 +152,85 @@ and messages:
|
||||
### General Formatting
|
||||
|
||||
* `Format`:
|
||||
|
||||
The abstract superclass of all format classes. It provides the basic methods
|
||||
for formatting and parsing numbers, dates, strings and other objects.
|
||||
|
||||
* `FieldPosition`:
|
||||
|
||||
A concrete class for holding the field constant and the begin and end
|
||||
indices for number and date fields.
|
||||
|
||||
* `ParsePosition`:
|
||||
|
||||
A concrete class for holding the parse position in a string during parsing.
|
||||
|
||||
* `Formattable`:
|
||||
Formattable objects can be passed to the Format class or its subclasses for
|
||||
|
||||
`Formattable` objects can be passed to the `Format` class or its subclasses for
|
||||
formatting. It encapsulates a polymorphic piece of data to be formatted and
|
||||
is used with MessageFormat. Formattable is used by some formatting
|
||||
is used with `MessageFormat`. `Formattable` is used by some formatting
|
||||
operations to provide a single "type" that encompasses all formattable
|
||||
values (e.g., it can hold a number, a date, or a string, and so on).
|
||||
|
||||
* `UParseError`:
|
||||
UParseError is used to returned detailed information about parsing errors.
|
||||
|
||||
`UParseError` is used to returned detailed information about parsing errors.
|
||||
It is used by the ICU parsing engines that parse long rules, patterns, or
|
||||
programs. This is helpful when the text being parsed is long enough that
|
||||
more information than a UErrorCode is needed to localize the error.
|
||||
more information than a `UErrorCode` is needed to localize the error.
|
||||
|
||||
**Formatting Numbers**
|
||||
|
||||
* [NumberFormat](numbers/legacy-numberformat) (§)
|
||||
* [`NumberFormat`](numbers/legacy-numberformat#numberformat)
|
||||
|
||||
The abstract superclass that provides the basic fields and methods for
|
||||
formatting Number objects and number primitives to localized strings and
|
||||
parsing localized strings to Number objects.
|
||||
formatting `Number` objects and number primitives to localized strings and
|
||||
parsing localized strings to `Number` objects.
|
||||
|
||||
* [DecimalFormat](numbers/legacy-numberformat#decimalformat) (§)
|
||||
A concrete class for formatting Number objects and number primitives to
|
||||
localized strings and parsing localized strings to Number objects, in base
|
||||
10.
|
||||
* [`DecimalFormat`](numbers/legacy-numberformat#decimalformat)
|
||||
|
||||
A concrete class for formatting `Number` objects and number primitives to
|
||||
localized strings and parsing localized strings to `Number` objects, in base 10.
|
||||
|
||||
* [RuleBasedNumberFormat](numbers/rbnf) (§)
|
||||
A concrete class for formatting Number objects and number primitives to
|
||||
* [`RuleBasedNumberFormat`](numbers/rbnf)
|
||||
|
||||
A concrete class for formatting `Number` objects and number primitives to
|
||||
localized text, especially spelled-out format such as found in check writing
|
||||
(e.g. "two hundred and thirty-four"), and parsing text into Number objects.
|
||||
(e.g. "two hundred and thirty-four"), and parsing text into `Number` objects.
|
||||
|
||||
* [DecimalFormatSymbols](numbers/legacy-numberformat#decimalformatsymbols) (§)
|
||||
* [`DecimalFormatSymbols`](numbers/legacy-numberformat#decimalformatsymbols)
|
||||
|
||||
A concrete class for accessing localized number strings, such as the
|
||||
grouping separators, decimal separator, and percent sign. Used by
|
||||
DecimalFormat.
|
||||
`DecimalFormat`.
|
||||
|
||||
**Formatting Dates and Times**
|
||||
|
||||
* [DateFormat](datetime/index#dateformat) (§)
|
||||
* [`DateFormat`](datetime/index#dateformat)
|
||||
|
||||
The abstract superclass that provides the basic fields and methods for
|
||||
formatting Date objects to localized strings and parsing date and time
|
||||
strings to Date objects.
|
||||
formatting `Date` objects to localized strings and parsing date and time
|
||||
strings to `Date` objects.
|
||||
|
||||
* [SimpleDateFormat](datetime/index#simpledateformat) (§)
|
||||
A concrete class for formatting Date objects to localized strings and
|
||||
parsing date and time strings to Date objects, using a GregorianCalendar.
|
||||
* [`SimpleDateFormat`](datetime/index#simpledateformat)
|
||||
|
||||
A concrete class for formatting `Date` objects to localized strings and
|
||||
parsing date and time strings to `Date` objects, using a `GregorianCalendar`.
|
||||
|
||||
* [DateFormatSymbols](datetime/index#dateformatsymbols) (§)
|
||||
* [`DateFormatSymbols`](datetime/index#dateformatsymbols)
|
||||
|
||||
A concrete class for accessing localized date-time formatting strings, such
|
||||
as names of the months, days of the week and the time zone.
|
||||
|
||||
**Formatting Messages**
|
||||
|
||||
* [MessageFormat](messages/index#messageformat) (§)
|
||||
* [`MessageFormat`](messages/index#messageformat)
|
||||
|
||||
A concrete class for producing a language-specific user message that
|
||||
contains numbers, currency, percentages, date, time and string variables.
|
||||
|
||||
* [ChoiceFormat](messages/index) (§)
|
||||
* [`ChoiceFormat`](messages/examples#choiceformat-class)
|
||||
|
||||
A concrete class for mapping strings to ranges of numbers and for handling
|
||||
plurals and names series in user messages.
|
||||
|
@ -21,9 +21,9 @@ License & terms of use: http://www.unicode.org/copyright.html
|
||||
|
||||
---
|
||||
|
||||
## MessageFormat Class
|
||||
## `MessageFormat` Class
|
||||
|
||||
ICU's MessageFormat class can be used to format messages in a locale-independent
|
||||
ICU's `MessageFormat` class can be used to format messages in a locale-independent
|
||||
manner to localize the user interface (UI) strings.
|
||||
|
||||
### C++
|
||||
@ -269,17 +269,19 @@ public class TestMessageFormat{
|
||||
}
|
||||
```
|
||||
|
||||
## ChoiceFormat Class
|
||||
## `ChoiceFormat` Class
|
||||
|
||||
**Important:** The following documentation is outdated. *ChoiceFormat is
|
||||
probably not what you need. Please use MessageFormat with plural arguments for
|
||||
**Important:** The following documentation is outdated. *`ChoiceFormat` is
|
||||
probably not what you need. Please use `MessageFormat` with plural arguments for
|
||||
proper plural selection, and select arguments for simple selection among a fixed
|
||||
set of choices!*
|
||||
|
||||
ICU's ChoiceFormat class provides more flexibility than the printf() and scanf()
|
||||
ICU's `ChoiceFormat` class provides more flexibility than the `printf()` and `scanf()`
|
||||
style functions for formatting UI strings. This interface can be useful if you
|
||||
would like a message to change according to the number of items you are
|
||||
displaying. Note: Some Asian languages do not have plural words or phrases.
|
||||
displaying.
|
||||
|
||||
Note: Some Asian languages do not have plural words or phrases.
|
||||
|
||||
### C++
|
||||
|
||||
|
@ -35,13 +35,13 @@ and formatted elements, then translators would not be able to rearrange the
|
||||
pieces, and they would have a hard time translating each of the string
|
||||
fragments.
|
||||
|
||||
## MessageFormat
|
||||
## `MessageFormat`
|
||||
|
||||
The ICU **MessageFormat** class uses message "pattern" strings with
|
||||
The ICU **`MessageFormat`** class uses message `"pattern"` strings with
|
||||
variable-element placeholders (called "arguments" in the API docs) enclosed in
|
||||
{curly braces}. The argument syntax can include formatting details, otherwise a
|
||||
default format is used. For details about the pattern syntax and the formatting
|
||||
behavior see the MessageFormat API docs
|
||||
behavior see the `MessageFormat` API docs
|
||||
([Java](https://unicode-org.github.io/icu-docs/apidoc/released/icu4j/com/ibm/icu/text/MessageFormat.html),
|
||||
[C++](https://unicode-org.github.io/icu-docs/apidoc/released/icu4c/classMessageFormat.html#_details),
|
||||
[C](https://unicode-org.github.io/icu-docs/apidoc/released/icu4c/umsg_8h.html#_details)).
|
||||
@ -49,15 +49,15 @@ behavior see the MessageFormat API docs
|
||||
### Complex Argument Types
|
||||
|
||||
Certain types of arguments select among several choices which are nested
|
||||
MessageFormat pattern strings. Keeping these choices together in one message
|
||||
`MessageFormat` pattern strings. Keeping these choices together in one message
|
||||
pattern string facilitates translation in context, by one single translator.
|
||||
(Commercial translation systems often distribute different messages to different
|
||||
translators.)
|
||||
|
||||
* Use a "plural" argument to select sub-messages based on a numeric value,
|
||||
* Use a `"plural"` argument to select sub-messages based on a numeric value,
|
||||
together with the plural rules for the specified language.
|
||||
* Use a "select" argument to select sub-messages via a fixed set of keywords.
|
||||
* Use of the old "choice" argument type is discouraged. It cannot handle
|
||||
* Use a `"select"` argument to select sub-messages via a fixed set of keywords.
|
||||
* Use of the old `"choice"` argument type is discouraged. It cannot handle
|
||||
plural rules for many languages, and is clumsy for simple selection.
|
||||
|
||||
It is tempting to cover only a minimal part of a message string with a complex
|
||||
@ -115,7 +115,7 @@ language](http://cldr.unicode.org/index/cldr-spec/plural-rules).*
|
||||
|
||||
If syntax characters occur in the text portions, then they need to be quoted by
|
||||
enclosing the syntax in pairs of ASCII apostrophes. A pair of ASCII apostrophes
|
||||
always represents one ASCII apostrophe, similar to %% in printf representing one %,
|
||||
always represents one ASCII apostrophe, similar to `%%` in `printf` representing one `%`,
|
||||
although this rule still applies inside quoted text. ("`This '{isn''t}' obvious`" → "`This {isn't} obvious`")
|
||||
|
||||
* Before ICU 4.8, ASCII apostrophes always started quoted text and had
|
||||
@ -197,17 +197,17 @@ should include regional variants (e.g., “fr-CA”).
|
||||
|
||||
#### Custom Format Objects (discouraged)
|
||||
|
||||
The MessageFormat class allows setting custom Format objects to format
|
||||
The `MessageFormat` class allows setting custom Format objects to format
|
||||
arguments, overriding the arguments' pattern specification. This is discouraged:
|
||||
For custom formatting of some values it should normally suffice to format them
|
||||
externally and to provide the formatted strings to the `MessageFormat.format()`
|
||||
methods.
|
||||
|
||||
Only the top-level arguments are accessible and settable via setFormat(),
|
||||
Only the top-level arguments are accessible and settable via `setFormat()`,
|
||||
`getFormat()` etc. Arguments inside nested sub-messages, inside
|
||||
choice/plural/select arguments, are "invisible" via these API methods.
|
||||
|
||||
Some of these methods (the ones corresponding to the original JDK MessageFormat
|
||||
Some of these methods (the ones corresponding to the original JDK `MessageFormat`
|
||||
API) address the top-level arguments in their order of appearance in the pattern
|
||||
string, which is usually not useful because it varies with translations. Newer
|
||||
methods address arguments by argument number ("index") or name.
|
||||
@ -234,5 +234,5 @@ was a disturbance in the Force on planet 7."
|
||||
err);
|
||||
```
|
||||
|
||||
There are several more usage examples for the MessageFormat and ChoiceFormat
|
||||
There are several more usage examples for the `MessageFormat` and `ChoiceFormat`
|
||||
classes in [C , C++ and Java](examples.md).
|
||||
|
@ -13,8 +13,8 @@ License & terms of use: http://www.unicode.org/copyright.html
|
||||
# Formatting Numbers
|
||||
|
||||
Since ICU 60, the recommended mechanism for formatting numbers is
|
||||
[NumberFormatter](https://unicode-org.github.io/icu-docs/apidoc/released/icu4c/numberformatter_8h.html)
|
||||
([Java](https://unicode-org.github.io/icu-docs/apidoc/released/icu4j/com/ibm/icu/number/NumberFormatter.html)). NumberFormatter supports the formatting of:
|
||||
[`NumberFormatter`](https://unicode-org.github.io/icu-docs/apidoc/released/icu4c/numberformatter_8h.html)
|
||||
([Java](https://unicode-org.github.io/icu-docs/apidoc/released/icu4j/com/ibm/icu/number/NumberFormatter.html)). `NumberFormatter` supports the formatting of:
|
||||
|
||||
- Decimal Formatting
|
||||
- Currencies
|
||||
@ -23,8 +23,8 @@ Since ICU 60, the recommended mechanism for formatting numbers is
|
||||
- Scientific Notation
|
||||
- Compact Notation
|
||||
|
||||
For number ranges, including currency and measurement unit ranges, see [NumberRangeFormatter](https://unicode-org.github.io/icu-docs/apidoc/released/icu4c/numberrangeformatter_8h.html) ([Java](https://unicode-org.github.io/icu-docs/apidoc/released/icu4j/com/ibm/icu/number/NumberRangeFormatter.html)).
|
||||
For number ranges, including currency and measurement unit ranges, see [`NumberRangeFormatter`](https://unicode-org.github.io/icu-docs/apidoc/released/icu4c/numberrangeformatter_8h.html) ([Java](https://unicode-org.github.io/icu-docs/apidoc/released/icu4j/com/ibm/icu/number/NumberRangeFormatter.html)).
|
||||
|
||||
For rule-based number formatting, including spellout rules and support for traditional numbering systems not covered by base-10 decimal digits, see [rbnf.md](rbnf.md).
|
||||
|
||||
For the classic NumberFormat class, which also includes legacy parsing support for localized number strings, see [legacy-numberformat.md](legacy-numberformat.md).
|
||||
For the classic `NumberFormat` class, which also includes legacy parsing support for localized number strings, see [legacy-numberformat.md](legacy-numberformat.md).
|
||||
|
@ -10,7 +10,7 @@ parent: Formatting Numbers
|
||||
License & terms of use: http://www.unicode.org/copyright.html
|
||||
-->
|
||||
|
||||
# Legacy NumberFormat
|
||||
# Legacy `NumberFormat`
|
||||
{: .no_toc }
|
||||
|
||||
## Contents
|
||||
@ -27,12 +27,12 @@ License & terms of use: http://www.unicode.org/copyright.html
|
||||
>
|
||||
> This page is here for reference for the older NumberFormat hierarchy in ICU4C and ICU4J.
|
||||
|
||||
## NumberFormat
|
||||
## `NumberFormat`
|
||||
|
||||
[NumberFormat](https://unicode-org.github.io/icu-docs/apidoc/released/icu4c/classNumberFormat.html) is
|
||||
[`NumberFormat`](https://unicode-org.github.io/icu-docs/apidoc/released/icu4c/classNumberFormat.html) is
|
||||
the abstract base class for all number formats. It provides an interface for
|
||||
formatting and parsing numbers. It also provides methods to determine which
|
||||
locales have number formats, and what their names are. NumberFormat helps format
|
||||
locales have number formats, and what their names are. `NumberFormat` helps format
|
||||
and parse numbers for any locale. Your program can be written to be completely
|
||||
independent of the locale conventions for decimal points or
|
||||
thousands-separators. It can also be written to be independent of the particular
|
||||
@ -48,42 +48,41 @@ $1234.50 //U.S. currency
|
||||
|
||||
### Usage
|
||||
|
||||
#### Formatting for a Locale
|
||||
#### Formatting for a `Locale`
|
||||
|
||||
To format a number for the current Locale, use one of the static factory methods
|
||||
To format a number for the current `Locale`, use one of the static factory methods
|
||||
to create a format, then call a format method to format it. To format a number
|
||||
for a different Locale, specify the Locale in the call to createInstance(). You
|
||||
for a different `Locale`, specify the `Locale` in the call to `createInstance()`. You
|
||||
can control the numbering system to be used for number formatting by creating a
|
||||
Locale that uses the @numbers keyword defined. For example, by default, the Thai
|
||||
`Locale` that uses the `@numbers` keyword defined. For example, by default, the Thai
|
||||
locale "th" uses the western digits 0-9. To create a number format that uses the
|
||||
native Thai digits instead, first create a locale with "@numbers=thai" defined.
|
||||
native Thai digits instead, first create a locale with `"@numbers=thai"` defined.
|
||||
See [the description on Locales](../../locale/index.md) for details.
|
||||
|
||||
> :point_right: **Note**: If you are formatting multiple numbers, save processing time
|
||||
> by constructing the formatter once and then using it several times.
|
||||
|
||||
#### Instantiating a NumberFormat
|
||||
#### Instantiating a `NumberFormat`
|
||||
|
||||
The following methods are used for instantiating NumberFormat objects:
|
||||
The following methods are used for instantiating `NumberFormat` objects:
|
||||
|
||||
1. **createInstance()**
|
||||
1. **`createInstance()`**
|
||||
Returns the normal number format for the current locale or for a specified
|
||||
locale.
|
||||
|
||||
2. **createCurrencyInstance()**
|
||||
2. **`createCurrencyInstance()`**
|
||||
Returns the currency format for the current locale or for a specified
|
||||
locale.
|
||||
|
||||
3. **createPercentInstance()**
|
||||
3. **`createPercentInstance()`**
|
||||
Returns the percentage format for the current locale or for a specified
|
||||
locale.
|
||||
|
||||
4. **createScientificInstance()**
|
||||
4. **`createScientificInstance()`**
|
||||
Returns the scientific number format for the current locale or for a
|
||||
specified locale.
|
||||
|
||||
To create a format for spelled-out numbers, use a constructor on
|
||||
RuleBasedNumberFormat (§).
|
||||
To create a format for spelled-out numbers, use a constructor on `RuleBasedNumberFormat`.
|
||||
|
||||
#### Currency Formatting
|
||||
|
||||
@ -94,7 +93,7 @@ based on the country code in the locale ID. However, for all but trivial uses,
|
||||
this is fragile because countries change currencies over time, and the locale
|
||||
data for a particular country may not be available.
|
||||
|
||||
For proper currency formatting, both the number and the currency must be
|
||||
For proper currency formatting, both number and currency must be
|
||||
specified. Aside from achieving reliably correct results, this also allows to
|
||||
format monetary values in any currency with the format of any locale, like in
|
||||
exchange rate lists. If the locale data does not contain display symbols or
|
||||
@ -112,21 +111,20 @@ the Euro currency.
|
||||
|
||||
In terms of APIs, the currency code is set as an attribute on a number format
|
||||
object (on a currency instance), while the number value is passed into each
|
||||
format() call or returned from parse() as usual.
|
||||
`format()` call or returned from `parse()` as usual.
|
||||
|
||||
1. ICU4C (C++) NumberFormat.setCurrency() takes a Unicode string (const UChar
|
||||
\*) with the 3-letter code.
|
||||
1. ICU4C (C++) `NumberFormat.setCurrency()` takes a Unicode string (`const UChar*`) with the 3-letter code.
|
||||
|
||||
2. ICU4C (C API) allows to set the currency code via unum_setTextAttribute()
|
||||
using the UNUM_CURRENCY_CODE selector.
|
||||
2. ICU4C (C API) allows to set the currency code via `unum_setTextAttribute()`
|
||||
using the `UNUM_CURRENCY_CODE` selector.
|
||||
|
||||
3. ICU4J NumberFormat.setCurrency() takes an ICU Currency object which
|
||||
3. ICU4J `NumberFormat.setCurrency()` takes an ICU Currency object which
|
||||
encapsulates the 3-letter code.
|
||||
|
||||
4. The base JDK's NumberFormat.setCurrency() takes a JDK Currency object which
|
||||
4. The base JDK's `NumberFormat.setCurrency()` takes a JDK Currency object which
|
||||
encapsulates the 3-letter code.
|
||||
|
||||
The functionality of Currency and setCurrency() is more advanced in ICU than in
|
||||
The functionality of `Currency` and `setCurrency()` is more advanced in ICU than in
|
||||
the base JDK. When using ICU, setting the currency automatically adjusts the
|
||||
number format object appropriately, i.e., it sets not only the currency symbol
|
||||
and display name, but also the correct number of fraction digits and the correct
|
||||
@ -135,20 +133,20 @@ the API references for more details.
|
||||
|
||||
There is ICU4C sample code at
|
||||
[icu4c/source/samples/numfmt/main.cpp](https://github.com/unicode-org/icu/blob/master/icu4c/source/samples/numfmt/main.cpp)
|
||||
which illustrates the use of NumberFormat.setCurrency().
|
||||
which illustrates the use of `NumberFormat.setCurrency()`.
|
||||
|
||||
#### Displaying Numbers
|
||||
|
||||
You can also control the display of numbers with methods such as
|
||||
getMinimumFractionDigits. If you want even more control over the format or
|
||||
parsing, or want to give your users more control, cast the NumberFormat returned
|
||||
from the factory methods to a DecimalNumberFormat. This works for the vast
|
||||
`getMinimumFractionDigits()`. If you want even more control over the format or
|
||||
parsing, or want to give your users more control, cast the `NumberFormat` returned
|
||||
from the factory methods to a `DecimalNumberFormat`. This works for the vast
|
||||
majority of countries.
|
||||
|
||||
#### Working with Positions
|
||||
|
||||
You can also use forms of the parse and format methods with ParsePosition and
|
||||
UFieldPosition to enable you to:
|
||||
You can also use forms of the parse and format methods with `ParsePosition` and
|
||||
`UFieldPosition` to enable you to:
|
||||
|
||||
1. progressively parse through pieces of a string.
|
||||
|
||||
@ -157,57 +155,57 @@ UFieldPosition to enable you to:
|
||||
For example, you can align numbers in two ways:
|
||||
|
||||
1. If you are using a mono-spaced font with spacing for alignment, pass the
|
||||
FieldPosition in your format call with field = INTEGER_FIELD. On output,
|
||||
getEndIndex is set to the offset between the last character of the integer
|
||||
and the decimal. Add (desiredSpaceCount - getEndIndex) spaces at the front
|
||||
`FieldPosition` in your format call with `field = INTEGER_FIELD`. On output,
|
||||
`getEndIndex` is set to the offset between the last character of the integer
|
||||
and the decimal. Add `(desiredSpaceCount - getEndIndex)` spaces at the front
|
||||
of the string. You can also use the space padding feature available in
|
||||
DecimalFormat.
|
||||
`DecimalFormat`.
|
||||
|
||||
2. If you are using proportional fonts, instead of padding with spaces, measure
|
||||
the width of the string in pixels from the start to getEndIndex. Then move
|
||||
the pen by (desiredPixelWidth - widthToAlignmentPoint) before drawing the
|
||||
the width of the string in pixels from the start to `getEndIndex`. Then move
|
||||
the pen by `(desiredPixelWidth - widthToAlignmentPoint)` before drawing the
|
||||
text. It also works where there is no decimal, but additional characters at
|
||||
the end (that is, with parentheses in negative numbers: "(12)" for -12).
|
||||
|
||||
#### Emulating printf
|
||||
#### Emulating `printf`
|
||||
|
||||
NumberFormat can produce many of the same formats as printf.
|
||||
`NumberFormat` can produce many of the same formats as printf.
|
||||
|
||||
| printf | ICU |
|
||||
|--------|-----|
|
||||
| Width specifier, e.g., "%5d" has a width of 5. | Use DecimalFormat. Either specify the padding, with can pad with any character, or specify a minimum integer count and a minimum fraction count, which will emit a specific number of digits, with zero padded to the left and right. |
|
||||
| Precision specifier for %f and %e, e.g. "%.6f" or "%.6e". This defines the number of digits to the right of the decimal point. | Use DecimalFormat. Specify the maximum fraction digits. |
|
||||
| General scientific notation, %g. This format uses either %f or %e, depending on the magnitude of the number being displayed. | Use ChoiceFormat with DecimalFormat. For example, for a typical %g, which has 6 significant digits, use a ChoiceFormat with thresholds of 1e-4 and 1e6. For values between the two thresholds, use a fixed DecimalFormat with the pattern "@#####". For values outside the thresholds, use a DecimalFormat with the pattern "@#####E0". |
|
||||
| Width specifier, e.g., `"%5d"` has a width of 5. | Use `DecimalFormat`. Either specify the padding, with can pad with any character, or specify a minimum integer count and a minimum fraction count, which will emit a specific number of digits, with zero padded to the left and right. |
|
||||
| Precision specifier for `%f` and `%e`, e.g. `"%.6f"` or `"%.6e"`. This defines the number of digits to the right of the decimal point. | Use `DecimalFormat`. Specify the maximum fraction digits. |
|
||||
| General scientific notation, `%g`. This format uses either `%f` or `%e`, depending on the magnitude of the number being displayed. | Use `ChoiceFormat` with `DecimalFormat`. For example, for a typical `%g`, which has 6 significant digits, use a `ChoiceFormat` with thresholds of 1e-4 and 1e6. For values between the two thresholds, use a fixed `DecimalFormat` with the pattern `"@#####"`. For values outside the thresholds, use a `DecimalFormat` with the pattern `"@#####E0"`. |
|
||||
|
||||
## DecimalFormat
|
||||
## `DecimalFormat`
|
||||
|
||||
DecimalFormat is a NumberFormat that converts numbers into strings using the
|
||||
`DecimalFormat` is a `NumberFormat` that converts numbers into strings using the
|
||||
decimal numbering system. This is the formatter that provides standard number
|
||||
formatting and parsing services for most usage scenarios in most locales. In
|
||||
order to access features of DecimalFormat not exposed in the NumberFormat API,
|
||||
you may need to cast your NumberFormat object to a DecimalFormat. You may also
|
||||
construct a DecimalFormat directly, but this is not recommended because it can
|
||||
order to access features of `DecimalFormat` not exposed in the `NumberFormat` API,
|
||||
you may need to cast your `NumberFormat` object to a `DecimalFormat`. You may also
|
||||
construct a `DecimalFormat` directly, but this is not recommended because it can
|
||||
hinder proper localization.
|
||||
|
||||
For a complete description of DecimalFormat, including the pattern syntax,
|
||||
formatting and parsing behavior, and available API, see the [ICU4J DecimalFormat
|
||||
For a complete description of `DecimalFormat`, including the pattern syntax,
|
||||
formatting and parsing behavior, and available API, see the [ICU4J `DecimalFormat`
|
||||
API](https://unicode-org.github.io/icu-docs/apidoc/released/icu4j/com/ibm/icu/text/DecimalFormat.html) or
|
||||
[ICU4C DecimalFormat
|
||||
[ICU4C `DecimalFormat`
|
||||
API](https://unicode-org.github.io/icu-docs/apidoc/released/icu4c/classDecimalFormat.html) documentation.
|
||||
|
||||
## DecimalFormatSymbols
|
||||
## `DecimalFormatSymbols`
|
||||
|
||||
[DecimalFormatSymbols](https://unicode-org.github.io/icu-docs/apidoc/released/icu4c/classDecimalFormatSymbols.html)
|
||||
specifies the exact characters a DecimalFormat uses for various parts of a
|
||||
[`DecimalFormatSymbols`](https://unicode-org.github.io/icu-docs/apidoc/released/icu4c/classDecimalFormatSymbols.html)
|
||||
specifies the exact characters a `DecimalFormat` uses for various parts of a
|
||||
number (such as the characters to use for the digits, the character to use as
|
||||
the decimal point, or the character to use as the minus sign).
|
||||
|
||||
This class represents the set of symbols needed by DecimalFormat to format
|
||||
numbers. DecimalFormat creates its own instance of DecimalFormatSymbols from its
|
||||
locale data. The DecimalFormatSymbols can be adopted by a DecimalFormat
|
||||
instance, or it can be specified when a DecimalFormat is created. If you need to
|
||||
change any of these symbols, can get the DecimalFormatSymbols object from your
|
||||
DecimalFormat and then modify it.
|
||||
This class represents the set of symbols needed by `DecimalFormat` to format
|
||||
numbers. `DecimalFormat` creates its own instance of `DecimalFormatSymbols` from its
|
||||
locale data. The `DecimalFormatSymbols` can be adopted by a `DecimalFormat`
|
||||
instance, or it can be specified when a `DecimalFormat` is created. If you need to
|
||||
change any of these symbols, can get the `DecimalFormatSymbols` object from your
|
||||
`DecimalFormat` and then modify it.
|
||||
|
||||
## Additional Sample Code
|
||||
|
||||
|
@ -10,7 +10,7 @@ parent: Formatting Numbers
|
||||
License & terms of use: http://www.unicode.org/copyright.html
|
||||
-->
|
||||
|
||||
# RuleBasedNumberFormat Examples
|
||||
# `RuleBasedNumberFormat` Examples
|
||||
{: .no_toc }
|
||||
|
||||
## Contents
|
||||
@ -21,10 +21,10 @@ License & terms of use: http://www.unicode.org/copyright.html
|
||||
|
||||
---
|
||||
|
||||
## Annotated RuleBasedNumberFormat Example
|
||||
## Annotated `RuleBasedNumberFormat` Example
|
||||
|
||||
The following example provides a quick idea of how the rules work. The
|
||||
[RuleBasedNumberFormat API
|
||||
[`RuleBasedNumberFormat` API
|
||||
documentation](https://unicode-org.github.io/icu-docs/apidoc/released/icu4c/classRuleBasedNumberFormat.html)
|
||||
describes the rule syntax in more detail.
|
||||
|
||||
@ -103,9 +103,7 @@ Rulesets are invoked by first applying negative and fractional rules, then by
|
||||
finding the rule whose range includes the current value and applying that rule,
|
||||
recursing as directed by the rule. Again, a complete description of the rule
|
||||
syntax can be found in the [API
|
||||
Documentation](https://unicode-org.github.io/icu-docs/apidoc/released/icu4c/classRuleBasedNumberFormat.html)
|
||||
.
|
||||
Documentation](https://unicode-org.github.io/icu-docs/apidoc/released/icu4c/classRuleBasedNumberFormat.html).
|
||||
|
||||
More rule examples can be found in the RuleBasedNumberFormat [demo
|
||||
source](https://github.com/unicode-org/icu/blob/master/icu4j/demos/src/com/ibm/icu/dev/demo/rbnf/RbnfSampleRuleSets.java)
|
||||
.
|
||||
More rule examples can be found in the `RuleBasedNumberFormat` [demo
|
||||
source](https://github.com/unicode-org/icu/blob/master/icu4j/demos/src/com/ibm/icu/dev/demo/rbnf/RbnfSampleRuleSets.java).
|
||||
|
@ -133,6 +133,6 @@ numbers, and so are best suited for formatting numbers in ways that are
|
||||
inherently recursive.
|
||||
|
||||
A full explanation of this example can be found in the [RuleBasedNumberFormat
|
||||
examples](rbnf-examples.md) . A complete description of the rule syntax can be
|
||||
examples](rbnf-examples.md). A complete description of the rule syntax can be
|
||||
found in the [RuleBasedNumberFormat API
|
||||
Documentation](https://unicode-org.github.io/icu-docs/apidoc/released/icu4c/classRuleBasedNumberFormat.html).
|
||||
|
@ -23,11 +23,11 @@ License & terms of use: http://www.unicode.org/copyright.html
|
||||
|
||||
## Overview
|
||||
|
||||
Number skeletons are a locale-agnostic way to configure a NumberFormatter in
|
||||
ICU. Number skeletons work in MessageFormat.
|
||||
Number skeletons are a locale-agnostic way to configure a `NumberFormatter` in
|
||||
ICU. Number skeletons work in `MessageFormat`.
|
||||
|
||||
Number skeletons consist of case-sensitive tokens that correspond to settings
|
||||
in ICU NumberFormatter. For example, to format a currency in compact notation
|
||||
in ICU `NumberFormatter`. For example, to format a currency in compact notation
|
||||
with the sign always shown, you could use this skeleton:
|
||||
|
||||
sign-always compact-short currency/GBP
|
||||
@ -36,7 +36,7 @@ with the sign always shown, you could use this skeleton:
|
||||
|
||||
+! K currency/GBP
|
||||
|
||||
To use a skeleton in MessageFormat, use the "number" type and prefix the
|
||||
To use a skeleton in `MessageFormat`, use the "number" type and prefix the
|
||||
skeleton with `::`
|
||||
|
||||
{0, number, :: +! K currency/GBP}
|
||||
@ -48,9 +48,9 @@ concise skeletons.
|
||||
## Syntax
|
||||
|
||||
A token consists of a *stem* and zero or more *options*. The stem is what
|
||||
occurs before the first "/" character in a token, and the options are each of
|
||||
the subsequent "/"-delimited strings. For example, "compact-short" and
|
||||
"currency" are stems, and "GBP" is an option.
|
||||
occurs before the first `"/"` character in a token, and the options are each of
|
||||
the subsequent `"/"`-delimited strings. For example, `"compact-short"` and
|
||||
"currency" are stems, and `"GBP"` is an option.
|
||||
|
||||
Tokens are space-separated, with exceptions for concise skeletons listed at
|
||||
the end of this document.
|
||||
@ -61,7 +61,7 @@ digits, you could use the following stem:
|
||||
|
||||
@@#
|
||||
|
||||
A few examples of number skeletons are shown below. The list of available
|
||||
A few examples of number skeletons are shown below. The list of available
|
||||
stems and options can be found below in [Skeleton Stems and
|
||||
Options](#skeleton-stems-and-options).
|
||||
|
||||
@ -70,11 +70,11 @@ Options](#skeleton-stems-and-options).
|
||||
| Long Skeleton | Concise Skeleton | Input | en-US Output | Comments |
|
||||
|---|---|---|---|---|
|
||||
| `percent` | `%` | 25 | 25% |
|
||||
| `.00` | `.00` | 25 | 25.00 | Equivalent to Precision::fixedFraction(2) |
|
||||
| `.00` | `.00` | 25 | 25.00 | Equivalent to `Precision::fixedFraction(2)` |
|
||||
| `percent .00` | `% .00` | 25 | 25.00% |
|
||||
| `scale/100` | `scale/100` | 0.3 | 30 | Multiply by 100 before formatting |
|
||||
| `percent scale/100` | `%x100` | 0.3 | 30% |
|
||||
| `measure-unit/length-meter` | `unit/meter` | 5 | 5 m | UnitWidth defaults to Short |
|
||||
| `measure-unit/length-meter` | `unit/meter` | 5 | 5 m | `UnitWidth` defaults to `Short` |
|
||||
| `measure-unit/length-meter` <br/> `unit-width-full-name` | `unit/meter` <br/> `unit-width-full-name` | 5 | 5 meters |
|
||||
| `currency/CAD` | `currency/CAD` | 10 | CA$10.00 |
|
||||
| `currency/CAD` <br/> `unit-width-narrow` | `currency/CAD` <br/> `unit-width-narrow` | 10 | $10.00 | Use the narrow symbol variant |
|
||||
@ -92,8 +92,7 @@ Options](#skeleton-stems-and-options).
|
||||
|
||||
## Skeleton Stems and Options
|
||||
|
||||
The full set of features supported by number skeletons is listed by category
|
||||
below.
|
||||
The full set of features supported by number skeletons is listed by category below.
|
||||
|
||||
### Notation
|
||||
|
||||
@ -103,19 +102,17 @@ Use one of the following stems to select compact or simple notation:
|
||||
- `compact-long` or `KK` (concise)
|
||||
- `notation-simple` (or omit since this is default)
|
||||
|
||||
There are two ways to select scientific or engineering notation: using long-form
|
||||
syntax or concise syntax.
|
||||
There are two ways to select scientific or engineering notation: using long-form syntax or concise syntax.
|
||||
|
||||
#### Scientific and Engineering Notation: Long Form
|
||||
|
||||
Start with the stem `scientific` or `engineering`. Those stems take the
|
||||
following optional options:
|
||||
Start with the stem `scientific` or `engineering`. Those stems take the following optional options:
|
||||
|
||||
- `/sign-xxx` sets the sign display option for the exponent; see [Sign](#sign).
|
||||
- `/*ee` sets exponent digits to "at least 2"; use `/*eee` for at least 3 digits, etc.
|
||||
- ***Prior to ICU 67***, use `/+ee` instead of `/*ee`.
|
||||
|
||||
For example, all of the following skeletons are valid:
|
||||
For example, all the following skeletons are valid:
|
||||
|
||||
- `scientific`
|
||||
- `scientific/sign-always`
|
||||
@ -171,7 +168,7 @@ As with the `measure-unit` stem, pass the unit identifier as the option:
|
||||
|
||||
- `per-measure-unit/aaaa-bbbb`
|
||||
|
||||
Note that if the `unit` stem is used, the demonimator can be placed in the same
|
||||
Note that if the `unit` stem is used, the denominator can be placed in the same
|
||||
token as the numerator.
|
||||
|
||||
### Unit Width
|
||||
@ -185,17 +182,17 @@ The unit width can be specified by the following stems:
|
||||
- `unit-width-hidden`
|
||||
|
||||
For more details, see
|
||||
[UNumberUnitWidth](https://unicode-org.github.io/icu-docs/apidoc/released/icu4c/unumberformatter_8h.html).
|
||||
[`UNumberUnitWidth`](https://unicode-org.github.io/icu-docs/apidoc/released/icu4c/unumberformatter_8h.html).
|
||||
|
||||
### Precision
|
||||
|
||||
The precision category has more blueprint stems than most other categories;
|
||||
they are documented in detail below. The following non-blueprint stems are
|
||||
they are documented in detail below. The following non-blueprint stems are
|
||||
accepted:
|
||||
|
||||
- `precision-integer` (round to the nearest integer) --- accepts fraction-precision options
|
||||
- `precision-unlimited` (do not perform rounding; display all digits)
|
||||
- `precision-increment/dddd` (round to *dddd*, a decimal number) --- see below
|
||||
- `precision-increment/dddd` (round to *`dddd`*, a decimal number) --- see below
|
||||
- `precision-currency-standard`
|
||||
- `precision-currency-cash`
|
||||
|
||||
@ -288,10 +285,10 @@ integer digits):
|
||||
| `integer-width/*` | - | Zero or more <br/> integer digits | `IntegerWidth::zeroFillTo(0) `
|
||||
|
||||
The long-form option starts with either a single `*` symbol, signaling no limit
|
||||
on the number of integer digits (no *truncateAt*), or zero or more `#` symbols.
|
||||
on the number of integer digits (no *`truncateAt`*), or zero or more `#` symbols.
|
||||
It should then be followed by zero or more `0` symbols, indicating the minimum
|
||||
integer digits (the argument to *zeroFillTo*). If there is no `*` symbol, the
|
||||
maximum integer digits (the argument to *truncateAt*) is the number of `#`
|
||||
integer digits (the argument to *`zeroFillTo`*). If there is no `*` symbol, the
|
||||
maximum integer digits (the argument to *`truncateAt`*) is the number of `#`
|
||||
symbols plus the number of `0` symbols.
|
||||
|
||||
The concise skeleton is simply one or more `0` characters. This supports
|
||||
@ -305,14 +302,13 @@ To specify the scale, use the following stem and option:
|
||||
|
||||
- `scale/dddd`
|
||||
|
||||
where *dddd* is a decimal number. For example, the following are valid
|
||||
skeletons:
|
||||
where *`dddd`* is a decimal number. For example, the following are valid skeletons:
|
||||
|
||||
- `scale/100` (multiply by 100)
|
||||
- `scale/1E2` (same as above)
|
||||
- `scale/0.5` (multiply by 0.5)
|
||||
|
||||
The decimal number should conform to a standard decimal number syntax. In
|
||||
The decimal number should conform to a standard decimal number syntax. In
|
||||
C++, it is parsed using the decimal number library described in
|
||||
[LocalizedNumberFormatter::formatDecimal](https://unicode-org.github.io/icu-docs/apidoc/released/icu4c/classicu_1_1number_1_1LocalizedNumberFormatter.html).
|
||||
In Java, it is parsed using
|
||||
@ -331,7 +327,7 @@ The grouping strategy can be specified by the following stems:
|
||||
- `group-thousands` (no concise equivalent)
|
||||
|
||||
For more details, see
|
||||
[UNumberGroupingStrategy](https://unicode-org.github.io/icu-docs/apidoc/released/icu4c/unumberformatter_8h.html).
|
||||
[`UNumberGroupingStrategy`](https://unicode-org.github.io/icu-docs/apidoc/released/icu4c/unumberformatter_8h.html).
|
||||
|
||||
### Symbols
|
||||
|
||||
@ -340,7 +336,7 @@ The following stems are allowed for specifying the number symbols:
|
||||
- `latin` (use Latin-script digits)
|
||||
- `numbering-system/nnnn` (use the `nnnn` numbering system)
|
||||
|
||||
A custom NDecimalFormatSymbols instance is not supported at this time.
|
||||
A custom `NDecimalFormatSymbols` instance is not supported at this time.
|
||||
|
||||
### Sign Display
|
||||
|
||||
@ -355,7 +351,7 @@ The following stems specify sign display:
|
||||
- `sign-accounting-except-zero` or `()?` (concise)
|
||||
|
||||
For more details, see
|
||||
[UNumberSignDisplay](https://unicode-org.github.io/icu-docs/apidoc/released/icu4c/unumberformatter_8h.html).
|
||||
[`UNumberSignDisplay`](https://unicode-org.github.io/icu-docs/apidoc/released/icu4c/unumberformatter_8h.html).
|
||||
|
||||
### Decimal Separator Display
|
||||
|
||||
@ -365,4 +361,4 @@ The following stems specify decimal separator display:
|
||||
- `decimal-always`
|
||||
|
||||
For more details, see
|
||||
[UNumberDecimalSeparatorDisplay](https://unicode-org.github.io/icu-docs/apidoc/released/icu4c/unumberformatter_8h.html).
|
||||
[`UNumberDecimalSeparatorDisplay`](https://unicode-org.github.io/icu-docs/apidoc/released/icu4c/unumberformatter_8h.html).
|
||||
|
@ -220,5 +220,5 @@ Any additional arguments will be passed to the program.
|
||||
|
||||
* `icurun test1.cpp *args...*`
|
||||
|
||||
*This feature is a work in progress. Please give feedback at [Ticket
|
||||
#8481](https://unicode-org.atlassian.net/browse/ICU-8481).*
|
||||
*Please give feedback to the [icu-support mailing list](http://site.icu-project.org/contacts),
|
||||
and refer to [Ticket #8481](https://unicode-org.atlassian.net/browse/ICU-8481).*
|
||||
|
@ -32,8 +32,8 @@ with the kinds of problems that typically arise during critical stages of the
|
||||
software life cycle.
|
||||
|
||||
In general, the standard process for creating globalized software includes
|
||||
"internationalization," which covers generic coding and design issues, and
|
||||
"localization," which involves translating and customizing a product for a
|
||||
"internationalization", which covers generic coding and design issues, and
|
||||
"localization", which involves translating and customizing a product for a
|
||||
specific market.
|
||||
|
||||
Software developers must understand the intricacies of internationalization
|
||||
@ -44,7 +44,7 @@ translated and customized. Therefore, software developers need to understand key
|
||||
localization concepts.
|
||||
|
||||
From a geographic perspective, a locale is a place. From a software perspective,
|
||||
a locale is an ID used to select information associated with a a language and/or
|
||||
a locale is an ID used to select information associated with a language and/or
|
||||
a place. ICU locale information includes the name and identifier of the spoken
|
||||
language, sorting and collating requirements, currency usage, numeric display
|
||||
preferences, and text direction (left-to-right or right-to-left, horizontal or
|
||||
@ -112,7 +112,7 @@ program code. Also ensure that the text which the program generates on the fly
|
||||
(such as numbers and dates) comes out in the right language. The text must be
|
||||
formatted correctly for the targeted user community.
|
||||
|
||||
Make sure that the analysis and manipulation of both text and kinds of data
|
||||
Make sure the analysis and manipulation of both text and kinds of data
|
||||
(such as dates), is done in a manner that can be easily adapted for different
|
||||
languages and user communities. This includes tasks such as alphabetizing lists
|
||||
and looking for line-break positions.
|
||||
@ -162,25 +162,35 @@ work correctly in 75 languages and more than 200 locales. The key is
|
||||
understanding when, where, why, or how to use the APIs effectively.
|
||||
|
||||
The remainder of this section provides an overview of some cultural and hidden
|
||||
assumptions components. (See the Table of contents for a list of topics.)
|
||||
assumptions components. See a list of topics below:
|
||||
* [Numbers and Dates](#numbers-and-dates)
|
||||
* [Messages](#messages)
|
||||
* [Measuring Units](#measuring-units)
|
||||
* [Alphabetical Order of Characters](#alphabetical-order-of-characters)
|
||||
* [Characters](#characters)
|
||||
* [Text Input and Layout](#text-input-and-layout)
|
||||
* [Text Manipulation](#text-manipulation)
|
||||
* [Date/Time Formatting](#datetime-formatting)
|
||||
* [Distributed Locale Support](#distributed-locale-support)
|
||||
* [LayoutEngine](#layoutengine)
|
||||
|
||||
#### Numbers and Dates
|
||||
|
||||
Numbers and dates are represented in different languages. Do not implement
|
||||
routines for converting numbers into strings, and do not call low-level system
|
||||
interfaces like sprintf() that do not produce language-sensitive results.
|
||||
Instead, see how ICU's [NumberFormat](formatparse/numbers/index.md) and
|
||||
[DateFormat](formatparse/datetime/index.md) services can be used more
|
||||
interfaces like `sprintf()` that do not produce language-sensitive results.
|
||||
Instead, see how ICU's [NumberFormat](format_parse/numbers/index.md) and
|
||||
[DateFormat](format_parse/datetime/index.md) services can be used more
|
||||
effectively.
|
||||
|
||||
#### Messages
|
||||
|
||||
Be careful when formulating assumptions about how individual pieces of text are
|
||||
used together to create a complete sentence (for example, when error messages
|
||||
are generated) . The elements might go together in a different order if the
|
||||
are generated). The elements might go together in a different order if the
|
||||
message is translated into a new language. ICU provides
|
||||
[MessageFormat](formatparse/messages/index.md) (§) and
|
||||
[ChoiceFormat](formatparse/messages/index.md) (§) to help with these
|
||||
[MessageFormat](format_parse/messages/index.md) (§) and
|
||||
[ChoiceFormat](format_parse/messages/index.md) (§) to help with these
|
||||
occurrences.
|
||||
|
||||
> :point_right: **Note**: *There also might be situations where parts of the sentence change when other
|
||||
@ -195,7 +205,7 @@ the representation of $1,000 dollars. This amount can represent either U.S. or
|
||||
Canadian dollar values. US dollars can be displayed as USD while Canadian
|
||||
dollars can be displayed as CAD, depending on the locale. In this case, the
|
||||
displayed numerical quantity might change, and the number itself might also
|
||||
change. [NumberFormat](formatparse/numbers/index.md) provides some support for
|
||||
change. [NumberFormat](format_parse/numbers/index.md) provides some support for
|
||||
this.
|
||||
|
||||
#### Alphabetical Order of Characters
|
||||
@ -204,14 +214,14 @@ All languages (even those using the same alphabet) do not necessarily have the
|
||||
same concept of alphabetical order. Do not assume that alphabetical order is the
|
||||
same as the numerical order of the character's code-point values. In practice,
|
||||
'a' is distinct from 'A' and 'b' is distinct from 'B'. Each has a different code
|
||||
point . This means that you can not use a bit-wise lexical comparison (such as
|
||||
point . This means that you cannot use a bit-wise lexical comparison (such as
|
||||
what strcmp() provides), to sort user-visible lists.
|
||||
|
||||
Not all languages interpret the same characters as equivalent. If a character's
|
||||
case is changed it is not always a one-to-one mapping. Accent differences, the
|
||||
presence or absence of certain characters, and even spelling differences might
|
||||
be insignificant when determining whether two strings are equal. The[
|
||||
Collator](collation/index.md) services provide significant help in this area.
|
||||
be insignificant when determining whether two strings are equal. The
|
||||
[Collator](collation/index.md) services provide significant help in this area.
|
||||
|
||||
#### Characters
|
||||
|
||||
@ -226,9 +236,9 @@ When checking characters for membership in a particular class, do not list the
|
||||
specific characters you are interested in, and do not assume they come in any
|
||||
particular order in the encoding scheme. For example, /A-Za-z/ does not mean all
|
||||
letters in most European languages, and /0-9/ does not mean all digits in many
|
||||
writing systems. This also holds true when using C interfaces such as isupper()
|
||||
and islower. ICU provides a large group of utility functions for testing
|
||||
character properties, such as u_isupper and u_islower().
|
||||
writing systems. This also holds true when using C interfaces such as `isupper()`
|
||||
and `islower()`. ICU provides a large group of utility functions for testing
|
||||
character properties, such as `u_isupper()` and `u_islower()`.
|
||||
|
||||
#### Text Input and Layout
|
||||
|
||||
@ -243,7 +253,7 @@ characters and keystrokes all occur in real text in some languages.
|
||||
|
||||
Do not assume that all textual data, which the program stores and manipulates,
|
||||
is in any particular language or writing system. ICU provides many methods that
|
||||
help with text storage. The UnicodeString class and u_strxxx functions are
|
||||
help with text storage. The `UnicodeString` class and `u_strxxx` functions are
|
||||
provided for Unicode-based character manipulation. For example, when appending
|
||||
an existing Unicode character buffer, characters can be removed or extracted out
|
||||
of the buffer.
|
||||
@ -256,13 +266,10 @@ correctly process multi-lingual text such as this correctly.
|
||||
|
||||
Time can be determined in many units, such as the lengths of months or years,
|
||||
which day is the first day of the week, or the allowable range of values like
|
||||
month and year (with DateFormat). It can also determine the time zone you are in
|
||||
(with TimeZone), or when daylight-savings time starts. ICU provides the Calendar
|
||||
month and year (with `DateFormat`). It can also determine the time zone you are in
|
||||
(with `TimeZone`), or when daylight-savings time starts. ICU provides the Calendar
|
||||
services needed to handle these issues.
|
||||
|
||||
This example shows how a user interface element can be used to increment or
|
||||
decrement the time field value.
|
||||
|
||||
#### Distributed Locale Support
|
||||
|
||||
In most server applications, do not assume that all clients connected to the
|
||||
|
@ -26,30 +26,32 @@ Java SE 6 introduced a new feature which allows Java user code to extend locale
|
||||
support in Java runtime environment. JREs shipped by Oracle or IBM come with
|
||||
decent locale coverage, but some users may want more locale support. Java SE 6
|
||||
includes abstract classes extending
|
||||
[java.util.spi.LocaleServiceProvider](http://download.oracle.com/javase/6/docs/api/java/util/spi/LocaleServiceProvider.html).
|
||||
[`java.util.spi.LocaleServiceProvider`](http://download.oracle.com/javase/6/docs/api/java/util/spi/LocaleServiceProvider.html).
|
||||
Java SE 6 users can create a subclass of these abstract class to supply their
|
||||
own locale support for text break, collation, date/number formatting or
|
||||
providing translations for currency, locale and time zone names.
|
||||
|
||||
ICU4J has been providing more comprehensive locale coverage than standard JREs.
|
||||
However, Java programmers have to use ICU4J's own internationalization service
|
||||
APIs (com.ibm.icu.\*) to utilize the rich locale support. Sometimes, the
|
||||
APIs (`com.ibm.icu.\*`) to utilize the rich locale support. Sometimes, the
|
||||
migration is not an option for various reasons. For example, your code may
|
||||
depend on existing Java libraries utilizing JDK internationalization service
|
||||
APIs, but you have no access to the source code. In this case, it is not
|
||||
possible to modify the libraries to use ICU4J APIs.
|
||||
|
||||
ICU4J Locale Service Provider is a component consists of classes implementing
|
||||
the Java SE 6 locale sensitive service provider interfaces. Available service
|
||||
providers are -
|
||||
providers are:
|
||||
|
||||
* [BreakIteratorProvider](http://download.oracle.com/javase/6/docs/api/java/text/spi/BreakIteratorProvider.html)
|
||||
* [CollatorProvider](http://download.oracle.com/javase/6/docs/api/java/text/spi/CollatorProvider.html)
|
||||
* [DateFormatProvider](http://download.oracle.com/javase/6/docs/api/java/text/spi/DateFormatProvider.html)
|
||||
* [DateFormatSymbolsProvider](http://download.oracle.com/javase/6/docs/api/java/text/spi/DateFormatSymbolsProvider.html)
|
||||
* [DecimalFormatSymbolsProvider](http://download.oracle.com/javase/6/docs/api/java/text/spi/DecimalFormatSymbolsProvider.html)
|
||||
* [NumberFormatProvider](http://download.oracle.com/javase/6/docs/api/java/text/spi/NumberFormatProvider.html)
|
||||
* [CurrencyNameProvider](http://download.oracle.com/javase/6/docs/api/java/util/spi/CurrencyNameProvider.html)
|
||||
* [LocaleNameProvider](http://download.oracle.com/javase/6/docs/api/java/util/spi/LocaleNameProvider.html)
|
||||
* [TimeZoneNameProvider](http://download.oracle.com/javase/6/docs/api/java/util/spi/TimeZoneNameProvider.html)
|
||||
* [`BreakIteratorProvider`](http://download.oracle.com/javase/6/docs/api/java/text/spi/BreakIteratorProvider.html)
|
||||
* [`CollatorProvider`](http://download.oracle.com/javase/6/docs/api/java/text/spi/CollatorProvider.html)
|
||||
* [`DateFormatProvider`](http://download.oracle.com/javase/6/docs/api/java/text/spi/DateFormatProvider.html)
|
||||
* [`DateFormatSymbolsProvider`](http://download.oracle.com/javase/6/docs/api/java/text/spi/DateFormatSymbolsProvider.html)
|
||||
* [`DecimalFormatSymbolsProvider`](http://download.oracle.com/javase/6/docs/api/java/text/spi/DecimalFormatSymbolsProvider.html)
|
||||
* [`NumberFormatProvider`](http://download.oracle.com/javase/6/docs/api/java/text/spi/NumberFormatProvider.html)
|
||||
* [`CurrencyNameProvider`](http://download.oracle.com/javase/6/docs/api/java/util/spi/CurrencyNameProvider.html)
|
||||
* [`LocaleNameProvider`](http://download.oracle.com/javase/6/docs/api/java/util/spi/LocaleNameProvider.html)
|
||||
* [`TimeZoneNameProvider`](http://download.oracle.com/javase/6/docs/api/java/util/spi/TimeZoneNameProvider.html)
|
||||
|
||||
ICU4J Locale Service Provider is designed to work as installed extensions in a
|
||||
JRE. Once the component is configured properly, Java application running on the
|
||||
@ -65,14 +67,12 @@ optional package to extend the functionality of the Java core platform. To
|
||||
install an optional package, its JAR files must be placed in the Java extension
|
||||
directory. The standard location is *<java-home>/lib/ext*. You can alternatively
|
||||
use the system property *java.ext.dirs* to specify one or more locations where
|
||||
optional packages are installed. For example, if the JRE root directry is
|
||||
JAVA_HOME and you put ICU4J Locale Service Provider files in ICU_SPI_DIR, the
|
||||
ICU4J Locale Service Provider is enabled by the following command.
|
||||
optional packages are installed. For example, if the JRE root directory is
|
||||
*JAVA_HOME* and you put ICU4J Locale Service Provider files in *ICU_SPI_DIR*, the
|
||||
ICU4J Locale Service Provider is enabled by the following command:
|
||||
|
||||
java -Djava.ext.dirs=%JAVA_HOME%\\lib\\ext;%ICU_SPI_DIR% <your_java_app>
|
||||
\[Microsoft Windows\]
|
||||
java -Djava.ext.dirs=$JAVA_HOME/lib/ext:$ICU_SPI_DIR <your_java_app> \[Linux,
|
||||
Solaris and other unix like platforms\]
|
||||
* `java -Djava.ext.dirs=%JAVA_HOME%\\lib\\ext;%ICU_SPI_DIR% <your_java_app>` \[Microsoft Windows\]
|
||||
* `java -Djava.ext.dirs=$JAVA_HOME/lib/ext:$ICU_SPI_DIR <your_java_app>` \[Linux,Solaris and other unix like platforms\]
|
||||
|
||||
The ICU4J's implementations of Java SE 6 locale sensitive service provider
|
||||
interfaces and configuration files are packaged in a single JAR file
|
||||
@ -80,22 +80,22 @@ interfaces and configuration files are packaged in a single JAR file
|
||||
classes and data are in the ICU4J core JAR file (*icu4j-<version>.jar*). So you
|
||||
need to put the localespi JAR file along with the core JAR file in the Java
|
||||
extension directory.
|
||||
|
||||
Once the ICU4J Locale Service Provider is installed properly, factory methods in
|
||||
JDK internationalization classes look for the implementation provided by ICU4J
|
||||
when a requested locale is not supported by the JDK service class. For example,
|
||||
locale *af_ZA* (Afrikaans - South Africa) is not supported by JDK DateFormat in
|
||||
Oracle Java SE 6. The following code snippet returns an instance of DateFormat
|
||||
locale *af_ZA* (Afrikaans - South Africa) is not supported by JDK `DateFormat` in
|
||||
Oracle Java SE 6. The following code snippet returns an instance of `DateFormat`
|
||||
from ICU4J Locale Service Provider and prints out the current date localized for
|
||||
af_ZA.
|
||||
|
||||
DateFormat df = DateFormat.getDateInstance(DateFormat.LONG, new Locale("af",
|
||||
"ZA"));
|
||||
System.out.println(df.format(new Date()));
|
||||
DateFormat df = DateFormat.getDateInstance(DateFormat.LONG, new Locale("af", "ZA"));
|
||||
System.out.println(df.format(new Date()));
|
||||
|
||||
Sample output:
|
||||
|
||||
2008 Junie 19 \[With ICU4J Locale Service Provider enabled\]
|
||||
June 19, 2008 \[Without ICU4J Locale Service Provider\]
|
||||
* `2008 Junie 19` \[With ICU4J Locale Service Provider enabled\]
|
||||
* `June 19, 2008` \[Without ICU4J Locale Service Provider\]
|
||||
|
||||
## Optional Configuration
|
||||
|
||||
@ -108,10 +108,10 @@ configuration files from *META-INF/services* in the localespi JAR file. For
|
||||
example, if you do not want to use ICU's time zone name service at all, you can
|
||||
remove the file: *META-INF/services/java.util.spi.TimeZoneNameProvider* from the
|
||||
JAR file.
|
||||
**Note:** Disabling DateFormatSymbolsProvider/DecimalFormatSymbolsProvider won't
|
||||
affect the localized symbols actually used by
|
||||
DateFormatProvider/NumberFormatProvider by the current implementation. These
|
||||
services are implemented independently.
|
||||
|
||||
**Note:** Disabling `DateFormatSymbolsProvider/DecimalFormatSymbolsProvider` won't
|
||||
affect the localized symbols actually used by `DateFormatProvider/NumberFormatProvider`
|
||||
by the current implementation. These services are implemented independently.
|
||||
|
||||
### Configuring the behavior of ICU4J Locale Service Provider
|
||||
|
||||
@ -119,34 +119,10 @@ services are implemented independently.
|
||||
localespi JAR file is used for configuring the behavior of the ICU4J Locale
|
||||
Service Provider implementation. There are some configuration properties
|
||||
available. See the table below for each configuration in detail.
|
||||
**Property** **Value** **Default** **Description**
|
||||
com.ibm.icu.impl.javaspi.ICULocaleServiceProvider.enableIcuVariants "true" or
|
||||
"false" "true" Whether if Locales with ICU's variant suffix will be included in
|
||||
getAvailableLocales. The current Java SE 6 locale sensitive service does not
|
||||
allow user provided provider implementations to override locales supported by
|
||||
JRE itself. When this property is "true"(default), ICU4J Locale Service Provider
|
||||
includes Locales with the
|
||||
suffix(com.ibm.icu.impl.javaspi.ICULocaleServiceProvider.icuVariantSuffix) in
|
||||
the variant field. For example, the ICU4J provider includes locales fr_FR and
|
||||
fr_FR_ICU4J in the available locale list. So JDK API user can still access the
|
||||
internationalization service object created by the ICU4J provider by the special
|
||||
locale fr_FR_ICU4J.
|
||||
com.ibm.icu.impl.javaspi.ICULocaleServiceProvider.icuVariantSuffix *Any String*
|
||||
"ICU4J" (49 or later)
|
||||
"ICU" (before 49)
|
||||
Suffix string used in Locale's variant field to specify the ICU implementation.
|
||||
com.ibm.icu.impl.javaspi.ICULocaleServiceProvider.enableIso3Languages "true" or
|
||||
"false" "true" Whether if 3-letter language Locales are included in
|
||||
getAvailabeLocales. Use of 3-letter language codes in java.util.Locale is not
|
||||
supported by the API reference document. However, the implementation does not
|
||||
check the length of language code, so there is no practical problem with it.
|
||||
com.ibm.icu.impl.javaspi.ICULocaleServiceProvider.useDecimalFormat "true" or
|
||||
"false" "false" Whether if java.text.DecimalFormat subclass is used for
|
||||
NumberFormat#getXXXInstance.
|
||||
DecimalFormat#format(Object,StringBuffer,FieldPosition) is declared as final, so
|
||||
ICU cannot override the implementation. As a result, some number types such as
|
||||
BigInteger/BigDecimal are not handled by the ICU implementation. If a client
|
||||
expects NumberFormat#getXXXInstance returns a DecimalFormat (for example, need
|
||||
to manipulate decimal format patterns), he/she can set true to this setting.
|
||||
However, in this case, BigInteger/BigDecimal support is not done by ICU's
|
||||
implementation.
|
||||
|
||||
|**Property**|**Value**|**Default**|**Description**|
|
||||
|:---|:---:|:---:|:---|
|
||||
|`com.ibm.icu.impl.javaspi.ICULocaleServiceProvider.enableIcuVariants`|`"true"` or `"false"`|`"true"`|Whether if Locales with ICU's variant suffix will be included in `getAvailableLocales`. The current Java SE 6 locale sensitive service does not allow user provided provider implementations to override locales supported by JRE itself. When this property is `"true"` (default), ICU4J Locale Service Provider includes Locales with the suffix (`com.ibm.icu.impl.javaspi.ICULocaleServiceProvider.icuVariantSuffix`) in the variant field. For example, the ICU4J provider includes locales fr_FR and fr_FR_ICU4J in the available locale list. So JDK API user can still access the internationalization service object created by the ICU4J provider by the special locale fr_FR_ICU4J|
|
||||
|`com.ibm.icu.impl.javaspi.ICULocaleServiceProvider.icuVariantSuffix`|*Any String*|`"ICU4J"` (49 or later) `"ICU"` (before 49)|Suffix string used in Locale's variant field to specify the ICU implementation.|
|
||||
|`com.ibm.icu.impl.javaspi.ICULocaleServiceProvider.enableIso3Languages`|`"true"` or `"false"`|`"true"`|Whether if 3-letter language locales are included in `getAvailabeLocales`. Use of 3-letter language codes in `java.util.Locale` is not supported by the API reference document. However, the implementation does not check the length of language code, so there is no practical problem with it.|
|
||||
|`com.ibm.icu.impl.javaspi.ICULocaleServiceProvider.useDecimalFormat`|`"true"` or `"false"`|`"false"`|Whether if `java.text.DecimalFormat` subclass is used for `NumberFormat#getXXXInstance`. `DecimalFormat#format(Object,StringBuffer,FieldPosition)` is declared as final, so ICU cannot override the implementation. As a result, some number types such as `BigInteger`/`BigDecimal` are not handled by the ICU implementation. If a client expects `NumberFormat#getXXXInstance` returns a `DecimalFormat` (for example, need to manipulate decimal format patterns), he/she can set true to this setting. However, in this case, `BigInteger`/`BigDecimal` support is not done by ICU's implementation.|
|
||||
|
@ -132,16 +132,16 @@ happens only once per process, at the time an ICU data item is first requested.
|
||||
|
||||
3. Dynamically load (memory map, typically) a common format (.dat) file
|
||||
containing the default ICU data. Loading is described in the section
|
||||
[How Data Loading Works](icudata.md#how-data-loading-works). The path to
|
||||
[How Data Loading Works](icudata#how-data-loading-works). The path to
|
||||
the data is of the form "icudt\<version\>\<flag\>", where \<version\> is
|
||||
the two-digit ICU version number, and \<flag\> is a letter indicating the
|
||||
internal format of the file (see the
|
||||
[Sharing ICU Data Between Platforms](icudata.md#sharing-icu-data-between-platforms)
|
||||
[Sharing ICU Data Between Platforms](icudata#sharing-icu-data-between-platforms)
|
||||
section).
|
||||
|
||||
Once the default ICU data has been located, loading of individual data items
|
||||
proceeds as described in the section
|
||||
[How Data Loading Works](icudata.md#how-data-loading-works).
|
||||
[How Data Loading Works](icudata#how-data-loading-works).
|
||||
|
||||
## Building and Linking against ICU data
|
||||
|
||||
@ -252,7 +252,7 @@ and set with `udata_setCommonData()` or `udata_setAppData()`, or they must be
|
||||
in a known location that is included in the ICU data directory string. This
|
||||
requires the application installer, or the application itself at runtime, to
|
||||
locate the ICU and/or application data by setting the ICU data directory (see
|
||||
the [ICU Data Directory](icudata.md#icu-data-directory) section above) or by
|
||||
the [ICU Data Directory](icudata#icu-data-directory) section above) or by
|
||||
loading the data and providing it to one of the `udata_setXYZData()` functions.
|
||||
|
||||
Unlike shared libraries, .dat package files can be taken apart into separate
|
||||
@ -261,7 +261,7 @@ modification of a package file. The `gencmn` and `pkgdata` ICU tools can then be
|
||||
used to reassemble the .dat package file.
|
||||
|
||||
For more information about .dat package files see the section [Sharing ICU Data
|
||||
Between Platforms](icudata.md#sharing-icu-data-between-platforms) below.
|
||||
Between Platforms](icudata#sharing-icu-data-between-platforms) below.
|
||||
|
||||
### Data Overriding vs. Loading Performance
|
||||
|
||||
@ -275,7 +275,7 @@ all directories and matching .dat files mentioned before checking in
|
||||
already-loaded package files. This allows overriding of packaged data items with
|
||||
single files after installation but costs some time for filesystem accesses.
|
||||
This is usually done only once per data item; see
|
||||
[User Data Caching](icudata.md#user-data-caching) below.
|
||||
[User Data Caching](icudata#user-data-caching) below.
|
||||
|
||||
### Single Data Files vs. Packages
|
||||
|
||||
@ -311,7 +311,7 @@ are not handled here.)
|
||||
|
||||
> :point_right: **Note**: The ICU data directory string itself may
|
||||
contain multiple directories and path/filenames to .dat package files. See the
|
||||
[ICU Data Directory](icudata.md#icu-data-directory) section.
|
||||
[ICU Data Directory](icudata#icu-data-directory) section.
|
||||
|
||||
It is recommended to not include the directory in the path argument but to make
|
||||
sure via setting the application data or the ICU data directory string that the
|
||||
@ -403,8 +403,8 @@ There is no mechanism for removing or updating cached data files.
|
||||
## Directory Separator Characters
|
||||
|
||||
If a directory separator (generally '/' or '\\') is needed in a path parameter,
|
||||
use the form that is native to the platform. The ICU header "putil.h" defines
|
||||
U_FILE_SEP_CHAR appropriately for the platform.
|
||||
use the form that is native to the platform. The ICU header `"putil.h"` defines
|
||||
`U_FILE_SEP_CHAR` appropriately for the platform.
|
||||
|
||||
> :point_right: **Note**: On Windows, the directory separator must be '\\' for
|
||||
any paths passed to ICU APIs. This is different from native Windows APIs, which
|
||||
@ -584,7 +584,7 @@ above. The ICU project provides a large number of additional locales in its
|
||||
[locale
|
||||
repository](https://github.com/unicode-org/icu/blob/master/icu4c/source/data/locales/)
|
||||
on the web. Most of this locale data is derived from the CLDR ([Common Locale
|
||||
Data Repository](http://www.unicode.org/cldr/) ) project.
|
||||
Data Repository](http://www.unicode.org/cldr/)) project.
|
||||
|
||||
Dropping the txt file into the correct place in the source tree is sufficient to
|
||||
add it to your ICU build. You will need to re-configure in order to pick it up.
|
||||
@ -768,7 +768,7 @@ loadable data objects.)
|
||||
#### ICU.dat package files
|
||||
* Source format: (list of files provided as input to the icupkg tool, or
|
||||
on the gencmn tool command line)
|
||||
* Binary format: .dat: [source/tools/toolutil/pkg_gencmn.c](../../icu4c/source/tools/toolutil/pkg_gencmn.cpp)
|
||||
* Binary format: .dat: [source/tools/toolutil/pkg_gencmn.cpp](../../icu4c/source/tools/toolutil/pkg_gencmn.cpp)
|
||||
* Generator tool: [icupkg](../../icu4c/source/tools/icupkg) or
|
||||
[gencmn](../../icu4c/source/tools/gencmn)
|
||||
|
||||
|
@ -62,7 +62,7 @@ Time Zone Data Version: 2011g
|
||||
#### I'm using ICU4J X, but planning to upgrade ICU4J version to X+1 soon. What should I do for the migration?
|
||||
|
||||
See the user guide section
|
||||
[Version Numbers in ICU](../design.md#version-numbers-in-icu)
|
||||
[Version Numbers in ICU](../design#version-numbers-in-icu)
|
||||
for the details about the meaning of the version number parts and how the ICU
|
||||
version number changes.
|
||||
|
||||
@ -70,7 +70,7 @@ In general, two different reference releases are not binary compatible (i.e.
|
||||
drop-in jar file replacement would not work). To use a new reference version of
|
||||
ICU4J, you should rebuild your application with the new ICU4J library. ICU
|
||||
project has the
|
||||
[API compatibility policy](../design.md#icu-api-compatibility)
|
||||
[API compatibility policy](../design#icu-api-compatibility)
|
||||
long as you're using ICU APIs marked as @stable in the API reference
|
||||
documentation, your application should successfully compile with the new
|
||||
reference version of ICU4J library without any source code modifications. (Note:
|
||||
|
@ -88,7 +88,7 @@ your topic.
|
||||
#### What is the ICU binary compatibility policy?
|
||||
|
||||
Please see the section on
|
||||
[binary compatibility](../design.md#icu-binary-compatibility-using-icu-as-an-operating-system-level-library)
|
||||
[binary compatibility](../design#icu-binary-compatibility)
|
||||
in the [design chapter](../design.md).
|
||||
|
||||
#### How is ICU licensed?
|
||||
@ -120,7 +120,7 @@ upgrade-friendly.
|
||||
|
||||
* **API:** ensure that you are not using draft APIs which may have changed in
|
||||
a future release. See the section on
|
||||
[API compatibility](../design.md#icu-api-compatibility) in the
|
||||
[API compatibility](../design#icu-api-compatibility) in the
|
||||
[design chapter](../design.md).
|
||||
* **Unicode:** See the release notes for particular versions of Unicode to
|
||||
ensure that your code is not affected by property changes or other
|
||||
@ -133,7 +133,7 @@ upgrade-friendly.
|
||||
currencies, types of calenders
|
||||
* **Building/Deploying your Application (ICU4C):** ICU4C usually builds with
|
||||
symbol renaming (See:
|
||||
[binary compatibility](../design.md#icu-binary-compatibility-using-icu-as-an-operating-system-level-library)
|
||||
[binary compatibility](../design#icu-binary-compatibility)
|
||||
in the [design chapter](../design.md)). Be sure that you build your
|
||||
application with the updated ICU header files, so that it will link against
|
||||
the current ICU. Also, don't hard-code the names of ICU libraries in your
|
||||
@ -185,7 +185,7 @@ Support List](http://icu-project.org/contacts.html) .
|
||||
|
||||
Use the [Data Customizer](https://unicode-org.atlassian.net/browse/ICU-12835)
|
||||
or see
|
||||
[Customizing ICU's Data Library](../icudata.md#customizing-icus-data-library)
|
||||
[Customizing ICU's Data Library](../icudata#customizing-icus-data-library)
|
||||
in the [ICU Data Management](../icudata.md) chapter of this User's Guide.
|
||||
|
||||
#### Why am I seeing a small ( only a few K ) instead of a large ( several megabytes ) data shared library (icudt)?
|
||||
@ -203,7 +203,7 @@ Studio.
|
||||
|
||||
#### Can I add or remove a converter from ICU?
|
||||
|
||||
Yes. Please see [Customizing ICU's Data Library](../icudata.md#customizing-icus-data-library)
|
||||
Yes. Please see [Customizing ICU's Data Library](../icudata#customizing-icus-data-library)
|
||||
in the [ICU Data Management](../icudata.md) of this User's Guide. You can also
|
||||
get extra converters from <http://www.icu-project.org/charts/charset/> or use
|
||||
the [ICU Data Customizer](https://unicode-org.atlassian.net/browse/ICU-12835)
|
||||
@ -268,7 +268,7 @@ ICU4C (ICU) is written in C and C++, and ICU4J is written in Java™.
|
||||
|
||||
#### How are the APIs documented for deprecation?
|
||||
|
||||
Please read the [ICU API compatibility](../design.md#icu-api-compatibility)
|
||||
Please read the [ICU API compatibility](../design#icu-api-compatibility)
|
||||
section in the [ICU Design](../design.md) chapter.
|
||||
|
||||
#### What version of Unicode standard does ICU support?
|
||||
|
@ -96,7 +96,7 @@ Download ICU in one of the following ways:
|
||||
1. From the download page, <http://www.icu-project.org/download/>, for
|
||||
packaged stable releases of ICU.
|
||||
|
||||
2. From the source code repository, <http://www.icu-project.org/repository/>,
|
||||
2. From the source code repository, <https://github.com/unicode-org/icu/releases/>,
|
||||
for the latest development versions.
|
||||
|
||||
After downloading, see the included README file for information on what is
|
||||
@ -106,4 +106,4 @@ included, building, installing, etc.
|
||||
|
||||
Current license: <https://github.com/unicode-org/icu/blob/master/icu4c/LICENSE>
|
||||
|
||||
See also <https://github.com/unicode-org/icu/blob/userguide-migration/docs/userguide/icufaq/index.md#how-is-the-icu-licensed>
|
||||
See also FAQ section ["How is the ICU licensed"](icufaq/index#how-is-icu-licensed)
|
||||
|
@ -126,7 +126,7 @@ Features:
|
||||
|
||||
Defined at: [icu-docs/master/design/bnf_rb.txt](https://raw.githubusercontent.com/unicode-org/icu-docs/master/design/bnf_rb.txt)
|
||||
|
||||
To use with ICU4C, see the [Resource Bundle APIs](resources.md#resource-bundle-apis) section of this userguide.
|
||||
To use with ICU4C, see the [Resource Bundle APIs](resources#resource-bundle-apis) section of this userguide.
|
||||
|
||||
Example: (`de.txt`)
|
||||
|
||||
@ -145,7 +145,7 @@ capabilities of the .txt and binary ICU4C resource bundles in XML form. However,
|
||||
we have decided to drop the format for lack of use and instead adopt standard
|
||||
XLIFF format for localization. For more information on XLIFF format, see the
|
||||
following section. For examples on using ICU tools to produce and read XLIFF
|
||||
format see the XLIFF Usage (§) section in the [resource management chapter](resources.md#using-xliff-for-localization).
|
||||
format see the XLIFF Usage section in the [resource management chapter](resources#using-xliff-for-localization).
|
||||
|
||||
## XLIFF
|
||||
|
||||
@ -203,7 +203,7 @@ http://www.oasis-open.org/committees/xliff/documents/xliff-core-1.1.xsd'>
|
||||
```
|
||||
|
||||
For examples on using ICU tools to produce and read XLIFF format see the XLIFF
|
||||
Usage (§) section in the [resource management chapter](resources.md#using-xliff-for-localization).
|
||||
Usage (§) section in the [resource management chapter](resources#using-xliff-for-localization).
|
||||
|
||||
## DITA
|
||||
|
||||
|
@ -44,7 +44,7 @@ with -DU_STATIC_IMPLEMENTATION. Also see [How To Use ICU](../howtouseicu.md).
|
||||
|
||||
### Reduce the number of libraries used
|
||||
|
||||
ICU consists of a number of different libraries. The library dependency chart in the [Design](../design.md#Library_Dependencies_C)
|
||||
ICU consists of a number of different libraries. The library dependency chart in the [Design](../design#library-dependencies-c)
|
||||
chapter can be used to understand and
|
||||
determine the exact set of libraries needed.
|
||||
|
||||
@ -207,5 +207,5 @@ DLLs will be copied with names such as 'icuuc55.dll'.
|
||||
The services which are now known as ICU were written to provide operating
|
||||
system-level and application environment-level services. Several operating
|
||||
systems include ICU as a standard or optional package.
|
||||
See [ICU Binary Compatibility](../design.md#ICU_Binary_Compatibility) for
|
||||
See [ICU Binary Compatibility](../design#icu-binary-compatibility) for
|
||||
more details.
|
||||
|
@ -45,7 +45,7 @@ in the future.
|
||||
|
||||
#### ISO C
|
||||
|
||||
The ISO C standard provides two basic character types (char and wchar_t) and
|
||||
The ISO C standard provides two basic character types (`char` and `wchar_t`) and
|
||||
defines strings as arrays of units of these types. The standard allows nearly
|
||||
arbitrary character and string character sets and encodings, which was necessary
|
||||
when there was no single character set that worked everywhere.
|
||||
@ -80,7 +80,7 @@ character and string literals.
|
||||
few of the string encodings used with it are documented.
|
||||
|
||||
5. See also [What size wchar_t do I need for
|
||||
Unicode?](http://icu-project.org/docs/papers/unicode_wchar_t.html) .
|
||||
Unicode?](http://icu-project.org/docs/papers/unicode_wchar_t.html)
|
||||
|
||||
6. A program based on this model must be recompiled for each platform. Usually,
|
||||
it must be recompiled for each supported language or family of languages.
|
||||
@ -97,26 +97,24 @@ character and string literals.
|
||||
ICU always processes Unicode text. Unicode covers all languages and allows safe
|
||||
hard coding of character codes, in addition to providing many standard or
|
||||
recommended algorithms and a lot of useful character property data. See the
|
||||
chapters about [Unicode Basics](unicode.md) and [Strings](strings/index.md) and
|
||||
others.
|
||||
chapters about [Unicode Basics](unicode.md) and [Strings](strings/index.md) and others.
|
||||
|
||||
ICU uses the 16-bit encoding form of Unicode (UTF-16) for processing, making it
|
||||
fully interoperable with most Unicode-aware software. (See [UTF-16 for
|
||||
Processing](http://www.unicode.org/notes/tn12/) .) In the case of ICU4J, this is
|
||||
fully interoperable with most Unicode-aware software. See [UTF-16 for
|
||||
Processing](http://www.unicode.org/notes/tn12/). In the case of ICU4J, this is
|
||||
naturally the case because the Java language and the JDK use UTF-16.
|
||||
|
||||
ICU uses and/or provides direct access to all of the [Unicode
|
||||
properties](strings/properties.md) which provide a much finer-grained
|
||||
classification of characters than [C/POSIX character
|
||||
classes](https://htmlpreview.github.io/?https://github.com/unicode-org/icu-docs/blob/master/design/posix_classes.html)
|
||||
.
|
||||
classes](https://htmlpreview.github.io/?https://github.com/unicode-org/icu-docs/blob/master/design/posix_classes.html).
|
||||
|
||||
In C/C++ source code character and string literals, ICU uses only "invariant"
|
||||
characters. They are the subset of graphic ASCII characters that are almost
|
||||
always encoded with the same byte values on all systems. (One set of byte values
|
||||
for ASCII-based systems, and another such set of byte values for EBCDIC
|
||||
systems.) See
|
||||
[utypes.h](https://github.com/unicode-org/icu/blob/master/icu4c/source/common/unicode/utypes.h)
|
||||
[`utypes.h`](https://github.com/unicode-org/icu/blob/master/icu4c/source/common/unicode/utypes.h)
|
||||
for the set of "invariant" characters.
|
||||
|
||||
With the use of Unicode, the implementation of many of the Unicode standard
|
||||
@ -127,7 +125,7 @@ portable, and reliable text processing.
|
||||
|
||||
#### ISO C
|
||||
|
||||
The standard C functions tolower(), towupper(), etc. take and return one
|
||||
The standard C functions `tolower()`, `toupper()`, etc. take and return one
|
||||
character code each.
|
||||
|
||||
#### Problems
|
||||
@ -201,9 +199,8 @@ locale's respective character class.
|
||||
this repertoire is not possible with such an implementation.
|
||||
|
||||
7. For more about the problems with POSIX character classes in a Unicode
|
||||
context see [Annex C: Compatibility Properties in Unicode Technical Standard
|
||||
#18: Unicode Regular
|
||||
Expressions](http://www.unicode.org/reports/tr18/#Compatibility_Properties)
|
||||
context see [Annex C: Compatibility Properties in Unicode
|
||||
Technical Standard #18: Unicode Regular Expressions](http://www.unicode.org/reports/tr18/#Compatibility_Properties)
|
||||
and see the mailing list archives for the unicode list (on unicode.org). See
|
||||
also the ICU design document about [C/POSIX character
|
||||
classes](https://htmlpreview.github.io/?https://github.com/unicode-org/icu-docs/blob/master/design/posix_classes.html).
|
||||
@ -212,8 +209,8 @@ locale's respective character class.
|
||||
|
||||
ICU provides locale-independent access to all [Unicode
|
||||
properties](strings/properties.md) (except Unihan.txt properties), as well as to
|
||||
the POSIX character classes, via functions defined in uchar.h and in ICU4J's
|
||||
UCharacter class (see API references) as well as via UnicodeSet. The POSIX
|
||||
the POSIX character classes, via functions defined in `uchar.h` and in ICU4J's
|
||||
`UCharacter` class (see API references) as well as via `UnicodeSet`. The POSIX
|
||||
character classes are implemented according to the recommendations in UTS #18.
|
||||
|
||||
The Unicode Character Database defines more than 70 character properties, their
|
||||
@ -228,10 +225,10 @@ of Unicode properties.
|
||||
|
||||
#### POSIX
|
||||
|
||||
The strfmon() function is used to format monetary values. The default format and
|
||||
The `strfmon()` function is used to format monetary values. The default format and
|
||||
the currency display symbol or display name are selected by the LC_MONETARY
|
||||
locale ID. The number formatting can also be controlled with a formatting string
|
||||
resembling what printf() uses.
|
||||
resembling what `printf()` uses.
|
||||
|
||||
#### Problems
|
||||
|
||||
@ -244,7 +241,7 @@ resembling what printf() uses.
|
||||
2. Using a single locale ID for the whole format makes it very difficult to
|
||||
format values for multiple currencies with the same number format (for
|
||||
example, for an exchange rate list or for showing the price of an item
|
||||
adjusted for several currencies). strfmon() allows to specify the number
|
||||
adjusted for several currencies). `strfmon()` allows to specify the number
|
||||
format fully, but then the application cannot use a country's default number
|
||||
format.
|
||||
|
||||
@ -258,4 +255,4 @@ resembling what printf() uses.
|
||||
ICU number formatting APIs have separate, orthogonal settings for the number
|
||||
format, which can be selected with a locale ID, and the currency, which is
|
||||
specified with an ISO code. See the [Formatting
|
||||
Numbers](formatparse/numbers/index.md) chapter for details.
|
||||
Numbers](format_parse/numbers/index.md) chapter for details.
|
||||
|
@ -35,17 +35,17 @@ ICU provides basic Unicode support for the following:
|
||||
* [Unicode strings](strings/index.md)
|
||||
|
||||
ICU includes type definitions for UTF-16 strings and code points. It also
|
||||
contains many C u_string functions and the C++ UnicodeString class with many
|
||||
contains many C `u_string` functions and the C++ `UnicodeString` class with many
|
||||
additional string functions.
|
||||
|
||||
* [Unicode properties](strings/properties.md)
|
||||
|
||||
ICU includes the C definitions and functions found in uchar.h as well as
|
||||
some macros found in utf.h. It also includes the C++ Unicode class.
|
||||
ICU includes the C definitions and functions found in `uchar.h` as well as
|
||||
some macros found in `utf.h`. It also includes the C++ Unicode class.
|
||||
|
||||
* [Unicode string iteration](strings/characteriterator.md)
|
||||
|
||||
In C, ICU uses the macros in utf.h for the iteration of strings. In C++, ICU
|
||||
In C, ICU uses the macros in `utf.h` for the iteration of strings. In C++, ICU
|
||||
uses the characterIterator and its subclasses.
|
||||
|
||||
### Conversion Basics
|
||||
@ -70,15 +70,15 @@ is an abstract concept that is typically expressed by one of the following:
|
||||
A locale ID specifies a language and region enabling the software to support
|
||||
culturally and linguistically appropriate information for each user. A locale
|
||||
object represents a specific geographical, political, or cultural region. As a
|
||||
programmatic expression of locale IDs, ICU provides the C++ locale class. In C,
|
||||
Application Programming Interfaces (APIs) use simple C strings for locale IDs.
|
||||
programmatic expression of locale IDs, ICU provides the C++ `Locale` class. In C,
|
||||
Application Programming Interfaces (APIs) use simple C `string` for locale IDs.
|
||||
|
||||
ICU stores locale-specific data in resource bundles, which provide a general
|
||||
mechanism to access strings and other objects for ICU services to perform
|
||||
according to locale conventions. ICU contains data for its services to support
|
||||
many locales. Resource bundles contain the locale data of applications that use
|
||||
ICU. In C++, the **ResourceBundle** implements the locale data. In C, this
|
||||
feature is provided by the **ures_** interface.
|
||||
ICU. In C++, the `**ResourceBundle**` implements the locale data. In C, this
|
||||
feature is provided by the `**ures_**` interface.
|
||||
|
||||
In addition to storing system-level data in ICU's resource bundles, applications
|
||||
typically also need to use resource bundles of their own to store
|
||||
@ -91,7 +91,7 @@ separate file instead of being embedded in the lines of the program.*
|
||||
### Locales and Services
|
||||
|
||||
The interaction between locales and services is fundamental to ICU. Please refer
|
||||
to [Locales and Services](./locale/index.md#Locales_and_Services).
|
||||
to [Locales and Services](./locale/index#locales-and-services).
|
||||
|
||||
### Transliteration
|
||||
|
||||
@ -105,10 +105,12 @@ each step of the process can use a UnicodeSet to restrict the characters that
|
||||
are affected. There are two basic types of transliterators:
|
||||
|
||||
Most natural language transliterators (such as Greek-Latin) are written a
|
||||
rule-based transliterators. Transliterators can be written as text files using a
|
||||
rule-based transliterators.
|
||||
|
||||
Transliterators can be written as text files using a
|
||||
simple language that is similar to regular expression syntax.
|
||||
|
||||
### Date and Time Classes
|
||||
### `Date` and `Time` Classes
|
||||
|
||||
Date and time routines manage independent date and time functions in
|
||||
milliseconds since January 1, 1970 (0:00:00.000 UTC). Points in time before then
|
||||
@ -117,19 +119,21 @@ are represented as negative numbers.
|
||||
ICU provides the following [classes](datetime/index.md) to support calendars and
|
||||
time zones:
|
||||
|
||||
* [Calendar](datetime/calendar/index.md#calendar)
|
||||
The abstract superclass for extracting calendar-related attributes from a
|
||||
Date value.
|
||||
* [`Calendar`](datetime/calendar/index#calendar)
|
||||
|
||||
The abstract superclass for extracting calendar-related attributes from a `Date` value.
|
||||
|
||||
* [Gregorian Calendar](datetime/calendar/index.md#gregorian-calendar)
|
||||
* [`GregorianCalendar`](datetime/calendar/index#gregoriancalendar)
|
||||
|
||||
A concrete class for representing a Gregorian calendar.
|
||||
|
||||
* [TimeZone](datetime/timezone/index.md)
|
||||
* [`TimeZone`](datetime/timezone/index.md)
|
||||
|
||||
An abstract superclass for representing a time zone.
|
||||
|
||||
* [SimpleTimeZone](datetime/timezone/index.md)
|
||||
A concrete class for representing a time zone for use with a Gregorian
|
||||
calendar.
|
||||
* [`SimpleTimeZone`](datetime/timezone/index.md)
|
||||
|
||||
A concrete class for representing a time zone for use with a Gregorian calendar.
|
||||
|
||||
> :point_right: **Note**: *C classes provide the same functionality as the C++ classes with the exception
|
||||
of subclassing.*
|
||||
@ -152,55 +156,60 @@ numbers, formatting dates and times, and formatting messages:
|
||||
|
||||
#### General Formatting
|
||||
|
||||
See [Formatting and Parsing Classes](formatparse/index.md#formatting-and-parsing-classes) for an introduction to the following:
|
||||
See [Formatting and Parsing Classes](format_parse/index#formatting-and-parsing-classes) for an introduction to the following:
|
||||
|
||||
* Format
|
||||
* FieldPosition
|
||||
* ParsePosition
|
||||
* Formattable
|
||||
* `Format`
|
||||
* `FieldPosition`
|
||||
* `ParsePosition`
|
||||
* `Formattable`
|
||||
|
||||
#### Formatting Numbers
|
||||
|
||||
* [NumberFormat](formatparse/numbers/index.md#numberformat)
|
||||
* [`NumberFormat`](format_parse/numbers/index#formatting-numbers)
|
||||
NumberFormat provides the basic fields and methods to format number objects
|
||||
and number primitives into localized strings and parse localized strings to
|
||||
number objects.
|
||||
|
||||
* [DecimalFormat](formatparse/numbers/index.md#decimalformat)
|
||||
* [`DecimalFormat`](format_parse/numbers/index#decimalformat)
|
||||
DecimalFormat provides the methods used to format number objects and number
|
||||
primitives into localized strings and parse localized strings into number
|
||||
objects in base 10.
|
||||
|
||||
* [DecimalFormatSymbols](formatparse/numbers/index.md#decimalformatsymbols)
|
||||
* [`DecimalFormatSymbols`](formatparse/numbers/index#decimalformatsymbols)
|
||||
DecimalFormatSymbols is a concrete class used by DecimalFormat to access
|
||||
localized number strings such as the grouping separators, the decimal
|
||||
separator, and the percent sign.
|
||||
|
||||
#### Formatting Dates and Times
|
||||
|
||||
* [DateFormat](formatparse/datetime/index.md) (§)
|
||||
DateFormat provides the basic fields and methods for formatting date objects
|
||||
* [`DateFormat`](formatparse/datetime/index.md)
|
||||
|
||||
`DateFormat` provides the basic fields and methods for formatting date objects
|
||||
to localized strings and parsing date and time strings to date objects.
|
||||
|
||||
* [SimpleDateFormat](formatparse/datetime/index.md) (§)
|
||||
SimpleDateFormat is a concrete class used to format date objects to
|
||||
* [`SimpleDateFormat`](formatparse/datetime/index.md)
|
||||
|
||||
`SimpleDateFormat` is a concrete class used to format date objects to
|
||||
localized strings and to parse date and time strings to date objects using a
|
||||
GregorianCalendar.
|
||||
`GregorianCalendar`.
|
||||
|
||||
* [DateFormatSymbols](formatparse/datetime/index.md) (§)
|
||||
DateFormatSymbols is a concrete class used to access localized date and time
|
||||
* [`DateFormatSymbols`](formatparse/datetime/index.md)
|
||||
|
||||
`DateFormatSymbols` is a concrete class used to access localized date and time
|
||||
formatting strings, such as names of the months, days of the week, and the
|
||||
time zone.
|
||||
|
||||
#### Formatting Messages
|
||||
|
||||
* [MessageFormat](formatparse/messages/index.md) (§)
|
||||
MessageFormat is a concrete class used to produce a language-specific user
|
||||
* [`MessageFormat`](formatparse/messages/index.md)
|
||||
|
||||
`MessageFormat` is a concrete class used to produce a language-specific user
|
||||
message that contains numbers, currency, percentages, date, time, and string
|
||||
variables.
|
||||
|
||||
* [ChoiceFormat](formatparse/messages/index.md) (§)
|
||||
ChoiceFormat is a concrete class used to map strings to ranges of numbers
|
||||
* [`ChoiceFormat`](formatparse/messages/index.md)
|
||||
|
||||
`ChoiceFormat` is a concrete class used to map strings to ranges of numbers
|
||||
and to handle plural words and name series in user messages.
|
||||
|
||||
> :point_right: **Note**: *C classes provide the same functionality as the C++ classes with the exception
|
||||
@ -223,34 +232,34 @@ within the same language:
|
||||
* In some cases, an accented letter is treated as if it were two letters. In
|
||||
traditional German, for example, ä is compared as if it were ae.
|
||||
|
||||
Searching and sorting is done through collation using the Collator class and its
|
||||
sub-classes RuleBasedCollator and CollationElementIterator as well as the
|
||||
CollationKey object. Collation determines the proper sort sequence for two or
|
||||
Searching and sorting is done through collation using the `Collator` class and its
|
||||
sub-classes `RuleBasedCollator` and `CollationElementIterator` as well as the
|
||||
`CollationKey` object. Collation determines the proper sort sequence for two or
|
||||
more natural language strings. It also can determine if two strings are
|
||||
equivalent for the purpose of searching.
|
||||
|
||||
The Collator class and its sub-class RuleBasedCollator perform locale-sensitive
|
||||
The `Collator` class and its sub-class `RuleBasedCollator` perform locale-sensitive
|
||||
string comparisons to create sorting and searching routines for natural language
|
||||
text. Collator and RuleBasedCollator can distinguish between characters
|
||||
text. `Collator` and `RuleBasedCollator` can distinguish between characters
|
||||
associated with base characters (such as 'a' and 'b'), accent marks (such as
|
||||
'ò', 'ó'), and uppercase or lowercase properties (such as 'a' and 'A').
|
||||
|
||||
ICU provides the following collation classes for sorting and searching natural
|
||||
language text according to locale-specific rules:
|
||||
|
||||
* [Collator](collation/architecture.md) is the abstract base class of all classes that compare strings.
|
||||
* [`Collator`](collation/architecture.md) is the abstract base class of all classes that compare strings.
|
||||
|
||||
* [CollationElementIterator](collation/architecture.md) is a concrete iterator class that provides an
|
||||
* [`CollationElementIterator`](collation/architecture.md) is a concrete iterator class that provides an
|
||||
iterator for stepping through each character of a locale-specific string
|
||||
according to the rules of a specific collator object.
|
||||
|
||||
* [RuleBasedCollator](collation/architecture.md) is the only built-in
|
||||
* [`RuleBasedCollator`](collation/architecture.md) is the only built-in
|
||||
implementation of the collator. It
|
||||
provides a sophisticated mechanism for comparing strings in a
|
||||
language-specific manner, and an interface that allows the user to
|
||||
specifically customize the sorting order.
|
||||
|
||||
* [CollationKey](collation/architecture.md) is an object that enables the fast sorting of strings by
|
||||
* [`CollationKey`](collation/architecture.md) is an object that enables the fast sorting of strings by
|
||||
representing a string as a sort key under the rules of a specific collator
|
||||
object.
|
||||
|
||||
@ -288,28 +297,31 @@ The BreakIterator services were designed and developed around an "iterator" or
|
||||
text. You can move the pointer forward or backward to search the text for
|
||||
boundaries.
|
||||
|
||||
The BreakIterator class makes it possible to iterate over user characters. A
|
||||
BreakIterator can find the location of a character, word, sentence or potential
|
||||
The `BreakIterator` class makes it possible to iterate over user characters. A
|
||||
`BreakIterator` can find the location of a character, word, sentence or potential
|
||||
line-break boundary. This makes it possible for a software program to properly
|
||||
select characters for text operations such as highlighting a character, cutting
|
||||
a word, moving to the next sentence, or wrapping words at a line ending.
|
||||
BreakIterator performs these operations in a locale-sensitive manner, meaning
|
||||
`BreakIterator` performs these operations in a locale-sensitive manner, meaning
|
||||
that it recognizes text boundaries according to the particular locale ID.
|
||||
|
||||
ICU provides the following classes for iterating over locale-specific text:
|
||||
|
||||
* [BreakIterator](boundaryanalysis/index.md)
|
||||
* [`BreakIterator`](boundaryanalysis/index.md)
|
||||
|
||||
The abstract base class that defines the operations for finding and getting
|
||||
the positions of logical breaks in a string of text: characters, words,
|
||||
sentences, and potential line breaks.
|
||||
|
||||
* [CharacterIterator](strings/characteriterator.md)
|
||||
* [`CharacterIterator`](strings/characteriterator.md)
|
||||
|
||||
The abstract base class for forward and backward iteration over a string of
|
||||
Unicode characters.
|
||||
|
||||
* [StringCharacterIterator](strings/index.md)
|
||||
* [`StringCharacterIterator`](strings/index.md)
|
||||
|
||||
A concrete class for forward and backward iteration over a string of Unicode
|
||||
characters. StringCharacterIterator inherits from CharacterIterator.
|
||||
characters. `StringCharacterIterator` inherits from `CharacterIterator`.
|
||||
|
||||
### Paragraph Layout
|
||||
|
||||
@ -322,26 +334,26 @@ different one for each locale.
|
||||
|
||||
| C API | C++ Class | Description |
|
||||
|----------|--------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
|
||||
| ubrk_ | BreakIterator | The BreakIterator class implements methods to find the location of boundaries in the text. |
|
||||
| ucal_ | Calendar | The Calendar class is an abstract base class that converts between a UDate object and a set of integer fields such as YEAR, MONTH, DAY, HOUR, and so on. |
|
||||
| umsg.h | ChoiceFormat | A ChoiceFormat class enables you to attach a format to a range of numbers. |
|
||||
| ucol_ | CollationElementIterator | The CollationElementIterator class is used as an iterator to walk through each character of an international string. |
|
||||
| ucol_ | CollationKey | The Collator class generates the Collation keys. |
|
||||
| ucol_ | Collator | The Collator class performs locale-sensitive string comparison. |
|
||||
| udat_ | DateFormat | DateFormat is an abstract class for a family of classes. DateFormat converts dates and times from their internal representations to a textual form that is language-independent, and then back to their internal representations. |
|
||||
| udat_ | DateFormatSymbols | DateFormatSymbols is a public class that encapsulates localized date and time formatting data. This information includes time zone information. |
|
||||
| unum_ | DecimalFormatSymbols | This class represents the set of symbols needed by DecimalFormat to format numbers. |
|
||||
| umsg.h | Format | The Format class is the base class for all formats. |
|
||||
| ucal_ | GregorianCalendar | GregorianCalendar is a concrete class that provides the standard calendar used in many locations. |
|
||||
| uloc_ | Locale | A Locale object represents a specific geographical, political, or cultural region. |
|
||||
| umsg.h | MessageFormat | MessageFormat provides a means to produce concatenated messages in language-neutral way. |
|
||||
| unum_ | NumberFormat | NumberFormat is an abstract base class for all number formats. |
|
||||
| ures_ | ResourceBundle | ResourceBundle provides a means to access a collection of locale-specific information. |
|
||||
| ucol_ | RuleBasedCollator | The RuleBasedCollator provides the implementation of the Collator class using data-driven tables. |
|
||||
| udat_ | SimpleDateFormat | SimpleDateFormat is a concrete class used to format and parse dates in a language-independent way. |
|
||||
| ucal_ | SimpleTimeZone | SimpleTimeZone is a concrete subclass of TimeZone that represents a time zone for use with a Gregorian calendar. |
|
||||
| usearch_ | StringSearch | StringSearch provides a way to search text in a locale sensitive manner. |
|
||||
| ucal_ | TimeZone | TimeZone represents a time zone offset, and also determines daylight savings time settings. |
|
||||
| `ubrk_` | `BreakIterator` | The `BreakIterator` class implements methods to find the location of boundaries in the text. |
|
||||
| `ucal_` | `Calendar` | The `Calendar` class is an abstract base class that converts between a `UDate` object and a set of integer fields such as `YEAR`, `MONTH`, `DAY`, `HOUR`, and so on. |
|
||||
| `umsg.h` | `ChoiceFormat` | A `ChoiceFormat` class enables you to attach a format to a range of numbers. |
|
||||
| `ucol_` | `CollationElementIterator` | The `CollationElementIterator` class is used as an iterator to walk through each character of an international string. |
|
||||
| `ucol_` | `CollationKey` | The `Collator` class generates the Collation keys. |
|
||||
| `ucol_` | `Collator` | The `Collator` class performs locale-sensitive string comparison. |
|
||||
| `udat_` | `DateFormat` | `DateFormat` is an abstract class for a family of classes. `DateFormat` converts dates and times from their internal representations to a textual form that is language-independent, and then back to their internal representations. |
|
||||
| `udat_` | `DateFormatSymbols` | `DateFormatSymbols` is a public class that encapsulates localized date and time formatting data. This information includes time zone information. |
|
||||
| `unum_` | `DecimalFormatSymbols` | This class represents the set of symbols needed by `DecimalFormat` to format numbers. |
|
||||
| `umsg.h` | `Format` | The `Format` class is the base class for all formats. |
|
||||
| `ucal_` | `GregorianCalendar` | `GregorianCalendar` is a concrete class that provides the standard calendar used in many locations. |
|
||||
| `uloc_` | `Locale` | A `Locale` object represents a specific geographical, political, or cultural region. |
|
||||
| `umsg.h` | `MessageFormat` | `MessageFormat` provides a means to produce concatenated messages in language-neutral way. |
|
||||
| `unum_` | `NumberFormat` | `NumberFormat` is an abstract base class for all number formats. |
|
||||
| `ures_` | `ResourceBundle` | `ResourceBundle` provides a means to access a collection of locale-specific information. |
|
||||
| `ucol_` | `RuleBasedCollator` | The `RuleBasedCollator` provides the implementation of the `Collator` class using data-driven tables. |
|
||||
| `udat_` | `SimpleDateFormat` | `SimpleDateFormat` is a concrete class used to format and parse dates in a language-independent way. |
|
||||
| `ucal_` | `SimpleTimeZone` | `SimpleTimeZone` is a concrete subclass of `TimeZone` that represents a time zone for use with a Gregorian calendar. |
|
||||
| `usearch_` | `StringSearch` | `StringSearch` provides a way to search text in a locale sensitive manner. |
|
||||
| `ucal_` | `TimeZone` | `TimeZone` represents a time zone offset, and also determines daylight savings time settings. |
|
||||
|
||||
## Locale-Independent Operations
|
||||
|
||||
@ -350,18 +362,18 @@ locale-independent services and users do not need to specify a locale ID:
|
||||
|
||||
| C API | C++ Class | Description |
|
||||
|-----------|-------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
|
||||
| ubidi_ | | UBiDi is used for implementing the Unicode BiDi algorithm. |
|
||||
| utf.h | CharacterIterator | CharacterIterator is an abstract class that defines an API for iteration on text objects. It is an interface for forward and backward iteration and for the random access of a text object. Also, it provides backward compatibility to the Java and older ICU CharacterIterator classes. |
|
||||
| n/a | Formattable | Formattable is a thin wrapper class that converts between the primitive numeric types (double, long, and so on) and the UDate and UnicodeString classes. Formattable objects can be passed to the Format class or its subclasses for formatting. |
|
||||
| unorm_ | Normalizer | Normalizer transforms Unicode text into an equivalent composed or decomposed form to allow for easier sorting and searching of text. |
|
||||
| n/a | ParsePosition | ParsePosition is a simple class used by the Format class and its subclasses to keep track of the current position during parsing. |
|
||||
| uidna_ | | An implementation of the IDNA protocol as defined in RFC 3490. |
|
||||
| utf.h | StringCharacterIterator | A concrete subclass of CharacterIterator that iterates over the characters (code units or code points) in a UnicodeString. |
|
||||
| utf.h | UCharCharacterIterator | A concrete subclass of CharacterIterator that iterates over the characters (code units or code points) in a UChar array. |
|
||||
| uchar.h | | The Unicode character properties API allows you to query the properties associated with individual Unicode character values. |
|
||||
| uregex_ | RegexMatcher | RegexMatcher is a regular expressions implementation. This allows you to perform string matching based upon a pattern. |
|
||||
| utrans_ | Transliterator | Transliterator is an abstract class that transliterates text from one format to another. The most common type of transliterator is a script, or an alphabet. |
|
||||
| uset_ | UnicodeSet | Objects of the UnicodeSet class represent character classes used in regular expressions. These classes specify a subset of the set of all Unicode characters. This is a mutable set of Unicode characters. |
|
||||
| ustring.h | UnicodeString | UnicodeString is a string class that stores Unicode characters directly. This class is a concrete implementation of the abstract class Replaceable. |
|
||||
| ushape.h | | Provides operations to transform (shape) between Arabic characters and their presentation forms. |
|
||||
| ucnv_ | | The Unicode conversion API allows you to convert data written in one codepage/encoding to and from UTF-16. |
|
||||
| `ubidi_` | | `UBiDi` is used for implementing the Unicode BiDi algorithm. |
|
||||
| `utf.h` | `CharacterIterator` | `CharacterIterator` is an abstract class that defines an API for iteration on text objects. It is an interface for forward and backward iteration and for the random access of a text object. Also, it provides backward compatibility to the Java and older ICU `CharacterIterator` classes. |
|
||||
| n/a | `Formattable` | `Formattable` is a thin wrapper class that converts between the primitive numeric types (`double`, `long`, and so on) and the `UDate` and `UnicodeString` classes. `Formattable` objects can be passed to the `Format` class or its subclasses for formatting. |
|
||||
| `unorm_` | `Normalizer` | `Normalizer` transforms Unicode text into an equivalent composed or decomposed form to allow for easier sorting and searching of text. |
|
||||
| n/a | `ParsePosition` | `ParsePosition` is a simple class used by the `Format` class and its subclasses to keep track of the current position during parsing. |
|
||||
| `uidna_` | | An implementation of the IDNA protocol as defined in RFC 3490. |
|
||||
| `utf.h` | `StringCharacterIterator` | A concrete subclass of `CharacterIterator` that iterates over the characters (code units or code points) in a `UnicodeString`. |
|
||||
| `utf.h` | `UCharCharacterIterator` | A concrete subclass of `CharacterIterator` that iterates over the characters (code units or code points) in a `UChar` array. |
|
||||
| `uchar.h` | | The Unicode character properties API allows you to query the properties associated with individual Unicode character values. |
|
||||
| `uregex_` | `RegexMatcher` | `RegexMatcher` is a regular expressions implementation. This allows you to perform string matching based upon a pattern. |
|
||||
| `utrans_` | `Transliterator` | `Transliterator` is an abstract class that transliterates text from one format to another. The most common type of transliterator is a script, or an alphabet. |
|
||||
| `uset_` | `UnicodeSet` | Objects of the `UnicodeSet` class represent character classes used in regular expressions. These classes specify a subset of the set of all Unicode characters. This is a mutable set of Unicode characters. |
|
||||
| `ustring.h` | `UnicodeString` | `UnicodeString` is a string class that stores Unicode characters directly. This class is a concrete implementation of the abstract class `Replaceable`. |
|
||||
| `ushape.h` | | Provides operations to transform (shape) between Arabic characters and their presentation forms. |
|
||||
| `ucnv_` | | The Unicode conversion API allows you to convert data written in one codepage/encoding to and from UTF-16. |
|
||||
|
@ -39,7 +39,7 @@ large block of text that can be iterated as a whole, or can be broken into
|
||||
smaller blocks for the purpose of iteration.
|
||||
|
||||
> :point_right: **Note**: *CharacterIterator is different from
|
||||
[Normalizer](../transforms/normalization/index.md) in that CharacterIterator
|
||||
[Normalizer](../transforms/normalization/index) in that CharacterIterator
|
||||
walks through the Unicode characters without interpretation.*
|
||||
|
||||
Prior to ICU release 1.6, the CharacterIterator class allowed access to a single
|
||||
|
@ -215,7 +215,7 @@ Notes:
|
||||
ICU does not provide the means to modify properties at runtime. The properties
|
||||
are provided exactly as specified by a recent version of the Unicode Standard
|
||||
(as published in the [Character
|
||||
Database](http://www.unicode.org/unicode/onlinedat/online.html) ).
|
||||
Database](http://www.unicode.org/onlinedat/online.html)).
|
||||
|
||||
For custom sets and maps, it is easiest to make UnicodeSet or
|
||||
UCPTrie/CodePointTrie objects with the desired values.
|
||||
|
@ -151,7 +151,7 @@ complete a complete description of the API.
|
||||
| \\W | ✓ | ✓ | Match a non-word character.
|
||||
| \\x{hhhh} | ✓ | ✓ | Match the character with hex value hhhh. From one to six hex digits may be supplied.
|
||||
| \\xhh | ✓ | ✓ | Match the character with two digit hex value hh.
|
||||
| \\X | ✓ | | Match a [Grapheme Cluster](http://www.unicode.org/unicode/reports/tr29/#Grapheme_Cluster_Boundaries).
|
||||
| \\X | ✓ | | Match a [Grapheme Cluster](http://www.unicode.org/reports/tr29/#Grapheme_Cluster_Boundaries).
|
||||
| \\Z | ✓ | | Match if the current position is at the end of input, but before the final line terminator, if one exists.
|
||||
| \\z | ✓ | | Match if the current position is at the end of input.
|
||||
| \\*n* | ✓ | | Back Reference. Match whatever the nth capturing group matched. n must be a number > 1 and < total number of capture groups in the pattern.
|
||||
@ -219,7 +219,7 @@ complete a complete description of the API.
|
||||
Case insensitive matching is specified by the UREGEX_CASE_INSENSITIVE flag
|
||||
during pattern compilation, or by the (?i) flag within a pattern itself. Unicode
|
||||
case insensitive matching is complicated by the fact that changing the case of a
|
||||
string may change its length. See <http://unicode.org/faq/casemap_charprop.html>
|
||||
string may change its length. See <http://www.unicode.org/faq/casemap_charprop.html>
|
||||
for more information on Unicode casing operations.
|
||||
|
||||
Full case-insensitive matching handles situations where the number of characters
|
||||
@ -506,5 +506,5 @@ processes, can be a problem with some regular expression packages.
|
||||
matches any of ```!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~```. From that list,
|
||||
ICU omits ```$+<=>^\`|~```
|
||||
ICU follows the recommendations from Unicode UTS-18,
|
||||
<http://unicode.org/reports/tr18/#Compatibility_Properties>. See also
|
||||
<http://www.unicode.org/reports/tr18/#Compatibility_Properties>. See also
|
||||
<https://unicode-org.atlassian.net/browse/ICU-20095>.
|
||||
|
@ -123,7 +123,7 @@ reordering. See the ubidi.h header file and the [BiDi API
|
||||
References](https://unicode-org.github.io/icu-docs/apidoc/released/icu4c/ubidi_8h.html) .
|
||||
|
||||
See [Unicode Standard Annex #9: The Bidirectional
|
||||
Algorithm](http://www.unicode.org/unicode/reports/tr9/) .
|
||||
Algorithm](http://www.unicode.org/reports/tr9/) .
|
||||
|
||||
## Programming Examples in C and C++
|
||||
|
||||
|
@ -63,7 +63,7 @@ character map.
|
||||
|
||||
A character is considered to have a lowercase, uppercase, or title case
|
||||
equivalent if there is a respective "simple" case mapping specified for the
|
||||
character in the [Unicode Character Database](http://unicode.org/ucd/) (UnicodeData.txt).
|
||||
character in the [Unicode Character Database](http://www.unicode.org/ucd/) (UnicodeData.txt).
|
||||
If a character has no mapping equivalent, the result is the character itself.
|
||||
|
||||
The APIs provided for the general case mapping, located in `uchar.h` file, handles
|
||||
|
@ -154,9 +154,9 @@ is the source and "Latin" is the target. The source specifier describes the
|
||||
characters or strings that the transform will modify. The target specifier
|
||||
describes the result of the modification. If the source is not given, then the
|
||||
source is "Any", the set of all characters. Source and Target specifiers can be
|
||||
[Script IDs](http://unicode.org/cldr/utility/properties.jsp#Script) (long like
|
||||
[Script IDs](http://www.unicode.org/cldr/utility/properties.jsp#Script) (long like
|
||||
"Latin" or short like "Latn"), [Unicode language
|
||||
Identifiers](http://unicode.org/reports/tr35/#Unicode_Language_and_Locale_Identifiers)
|
||||
Identifiers](http://www.unicode.org/reports/tr35/#Unicode_Language_and_Locale_Identifiers)
|
||||
(like fr, en_US, or zh_Hant), or special tags (like Any or Hex). For example:
|
||||
|
||||
1. Katakana-Latin
|
||||
|
@ -34,7 +34,7 @@ in particular. For further information, see:
|
||||
1. [The Web site of the Unicode consortium](http://www.unicode.org/)
|
||||
|
||||
2. [What is
|
||||
Unicode?](http://www.unicode.org/unicode/standard/WhatIsUnicode.html)
|
||||
Unicode?](https://www.unicode.org/standard/WhatIsUnicode.html)
|
||||
|
||||
3. [IBM® Globalization](http://www.ibm.com/software/globalization/)
|
||||
|
||||
@ -138,7 +138,7 @@ more complexity into the Unicode standard, but far less than managing a large
|
||||
number of different encodings.
|
||||
|
||||
Starting with Unicode 2.0 (published in 1996), the Unicode standard began
|
||||
assigning numbers from 0 to 10ffff16, which requires 21 bits but does not use
|
||||
assigning numbers from 0 to 10ffff<sub>16</sub>,which requires 21 bits but does not use
|
||||
them completely. This gives more than enough room for all written languages in
|
||||
the world. The original repertoire covered all major languages commonly used in
|
||||
computing. Unicode continues to grow, and it includes more scripts.
|
||||
@ -181,7 +181,7 @@ syntactically important characters are the same.*
|
||||
|
||||
## Character Encoding Forms and Schemes for Unicode
|
||||
|
||||
Unicode assigns characters a number from 0 to 10FFFF16, giving enough elbow room
|
||||
Unicode assigns characters a number from 0 to 10FFFF<sub>16</sub>, giving enough elbow room
|
||||
to allow for unambiguous encoding of every character in common use. Such a
|
||||
character number is called a "code point".
|
||||
|
||||
@ -195,9 +195,9 @@ differ in their byte ordering. The difference between an encoding form and an
|
||||
encoding scheme is that an encoding form maps the character set codes to values
|
||||
that fit into internal data types (like a short in C), while an encoding scheme
|
||||
maps to bits and bytes. For traditional encodings, they are the same since the
|
||||
encoding forms already map to bytes
|
||||
encoding forms already map to bytes.
|
||||
|
||||
. The different Unicode encoding forms are optimized for a variety of different
|
||||
The different Unicode encoding forms are optimized for a variety of different
|
||||
uses:
|
||||
|
||||
1. UTF-16, the default encoding form, maps a character code point to either one
|
||||
@ -207,18 +207,18 @@ uses:
|
||||
ASCII-based, byte-oriented APIs and protocols. A character is stored with 1,
|
||||
2, 3, or 4 bytes.
|
||||
|
||||
3. UTF-32 is the simplest but most memory-intensive encoding form: It uses one
|
||||
3. UTF-32 is the simplest, but most memory-intensive encoding form: It uses one
|
||||
32-bit integer per Unicode character.
|
||||
|
||||
4. SCSU is an encoding scheme that provides a simple compression of Unicode
|
||||
text. It is designed only for input and output, not for internal use.
|
||||
|
||||
ICU uses UTF-16 internally. ICU 2.0 fully supports supplementary characters
|
||||
(with code points 1000016..10FFFF16. Older versions of ICU provided only partial
|
||||
(with code points 10000<sub>16</sub>..10FFFF<sub>16</sub>). Older versions of ICU provided only partial
|
||||
support for supplementary characters.
|
||||
|
||||
For input/output, character encoding schemes define a byte serialization of
|
||||
text. UTF-8 is itself both an encoding form and an encoding scheme because it is
|
||||
text. UTF-8 is itself both an encoding form, and an encoding scheme because it is
|
||||
byte-based. For each of UTF-16 and UTF-32, there are two variants defined: one
|
||||
that serializes the code units in big-endian byte order (most significant byte
|
||||
first), and one that serializes the code units in little-endian byte order
|
||||
@ -230,24 +230,23 @@ either to character encoding forms where 16/32-bit words are processed and are
|
||||
naturally stored in the platform endianness, or they refer to the
|
||||
IANA-registered charset names, i.e., to character encoding schemes or byte
|
||||
serializations. In addition to simple byte serialization, the charsets with
|
||||
these names also use optional Byte Order Marks (see Serialized Formats (§)
|
||||
below).*
|
||||
these names also use optional Byte Order Marks (see [Serialized Formats](#serialized-formats) below).*
|
||||
|
||||
## Overview of UTF-16
|
||||
|
||||
The default encoding form of the Unicode Standard uses 16-bit code units. Code
|
||||
point values for the most common characters are in the range of 0 to FFFF16 and
|
||||
point values for the most common characters are in the range of 0 to FFFF<sub>16</sub> and
|
||||
are encoded with just one 16-bit unit of the same value. Code points from
|
||||
1000016 to 10FFFF16 are encoded with two code units that are often called
|
||||
10000<sub>16</sub> to 10FFFF<sub>16</sub> are encoded with two code units that are often called
|
||||
"surrogates", and they are called a "surrogate pair" when, together, they
|
||||
correctly encode one Unicode character. The first surrogate in a pair must be in
|
||||
the range D80016 to DBFF16, and the second one must be in the range DC0016 to
|
||||
DFFF16. Every Unicode code point has only one possible UTF-16 encoding with
|
||||
the range D800<sub>16</sub> to DBFF<sub>16</sub>, and the second one must be in the range DC00<sub>16</sub> to
|
||||
DFFF<sub>16</sub>. Every Unicode code point has only one possible UTF-16 encoding with
|
||||
either one code unit that is not a surrogate or with a correct pair of
|
||||
surrogates. The code point values D80016 to DFFF16 are set aside just for this
|
||||
surrogates. The code point values D800<sub>16</sub> to DFFF<sub>16</sub> are set aside just for this
|
||||
mechanism and will never, by themselves, be assigned any characters.
|
||||
|
||||
Most commonly used characters have code points below FFFF16, but Unicode 3.1
|
||||
Most commonly used characters have code points below FFFF<sub>16</sub>, but Unicode 3.1
|
||||
assigns more than 40,000 supplementary characters that make use of surrogate
|
||||
pairs in UTF-16.
|
||||
|
||||
@ -265,26 +264,26 @@ To meet the requirements of byte-oriented, ASCII-based systems, the Unicode
|
||||
Standard defines UTF-8. UTF-8 is a variable-length, byte-based encoding that
|
||||
preserves ASCII transparency.
|
||||
|
||||
UTF-8 maintains transparency for all of the ASCII code values (0..127). These
|
||||
UTF-8 maintains transparency for all the ASCII code values (0..127). These
|
||||
values do not appear in any byte of a transformed result except as the direct
|
||||
representation of the ASCII values. Thus, ASCII text is also UTF-8 text.
|
||||
|
||||
Characteristics of UTF-8 include:
|
||||
|
||||
1. Unicode code points 0 to 7F16 are each encoded with a single byte of the
|
||||
1. Unicode code points 0 to 7F<sub>16</sub> are each encoded with a single byte of the
|
||||
same value. Therefore, ASCII characters take up 50% less space with UTF-8
|
||||
encoding than with UTF-16.
|
||||
|
||||
2. All other code points are encoded with multibyte sequences, with the first
|
||||
byte (lead byte) indicating the number of bytes that follow (trail bytes).
|
||||
This results in very efficient parsing. The lead bytes are in the range c016
|
||||
to fd16, the trail bytes are in the range 8016 to bf16. The byte values fe16
|
||||
and FF16 are never used.
|
||||
This results in very efficient parsing. The lead bytes are in the range c0<sub>16</sub>
|
||||
to fd<sub>16</sub>, the trail bytes are in the range 80<sub>16</sub> to bf<sub>16</sub>. The byte values fe<sub>16</sub>
|
||||
and FF<sub>16</sub> are never used.
|
||||
|
||||
3. UTF-8 is relatively compact and resource conservative in its use of the
|
||||
bytes required for encoding text in European scripts, but uses 50% more
|
||||
space than UTF-16 for East Asian text. Code points up to 7FF16 take up two
|
||||
bytes, code points up to FFFF16 take up three (50% more memory than UTF-16),
|
||||
space than UTF-16 for East Asian text. Code points up to 7FF<sub>16</sub> take up two
|
||||
bytes, code points up to FFFF<sub>16</sub> take up three (50% more memory than UTF-16),
|
||||
and all others four.
|
||||
|
||||
4. Binary comparisons of UTF-8 strings based on their bytes result in the same
|
||||
@ -297,12 +296,12 @@ point. This results in a very simple encoding.
|
||||
|
||||
The drawback is its memory consumption: Since code point values use only 21
|
||||
bits, one-third of the memory is always unused, and since most commonly used
|
||||
characters have code point values of up to FFFF16, they take up only one 16-bit
|
||||
characters have code point values of up to FFFF<sub>16</sub>, they take up only one 16-bit
|
||||
unit in UTF-16 (50% less) and up to three bytes in UTF-8 (25% less).
|
||||
|
||||
UTF-32 is mainly used in APIs that are defined with the same data type for both
|
||||
code points and code units. Modern versions of the C standard library that
|
||||
support Unicode use a 32-bit wchar_t with UTF-32 semantics.
|
||||
support Unicode use a 32-bit `wchar_t` with UTF-32 semantics.
|
||||
|
||||
## Overview of SCSU
|
||||
|
||||
@ -316,7 +315,7 @@ makes it unsuitable for internal processing. It also uses all possible byte
|
||||
values, which might require additional processing for protocols such as SMTP
|
||||
(email).
|
||||
|
||||
See also <http://www.unicode.org/unicode/reports/tr6/> .
|
||||
See also <https://www.unicode.org/reports/tr6/> .
|
||||
|
||||
## Other Unicode Encodings
|
||||
|
||||
@ -376,7 +375,7 @@ lists these implications:
|
||||
points for string searching. False matches never occurs since the end of one
|
||||
sequence is never the same as the start of another sequence. Overlap is one
|
||||
of the biggest problems with common multi-byte encodings like Shift-JIS. All
|
||||
of the UTFs avoid this problem.
|
||||
the UTFs avoid this problem.
|
||||
|
||||
4. Uses simple iteration. Getting the next or previous code point is
|
||||
straightforward, and only takes a small number of machine instructions.
|
||||
@ -397,11 +396,11 @@ Conversion between different UTFs is very fast. Unlike converting to and from
|
||||
legacy encodings like Latin-2, conversion between UTFs does not require table
|
||||
look-ups.
|
||||
|
||||
ICU provides two basic data type definitions for Unicode. UChar32 is a 32-bit
|
||||
ICU provides two basic data type definitions for Unicode. `UChar32` is a 32-bit
|
||||
type for code points, and used for single Unicode characters. It may be signed
|
||||
or unsigned. It is the same as wchar_t if it is 32 bits wide. UChar is an
|
||||
or unsigned. It is the same as `wchar_t` if it is 32 bits wide. `UChar` is an
|
||||
unsigned 16-bit integer for UTF-16 code units. It is the base type for strings
|
||||
(`UChar *`), and it is the same as wchar_t if it is 16 bits wide.
|
||||
(`UChar *`), and it is the same as `wchar_t` if it is 16 bits wide.
|
||||
|
||||
Some higher-level APIs, used especially for formatting, use characters closer to
|
||||
a representation for a glyph. Such "user characters" are also called "graphemes"
|
||||
@ -428,7 +427,7 @@ format-dependent way. However, the Unicode standard offers a mechanism for
|
||||
tagging text files with a "signature" for cases where protocols do not identify
|
||||
character encoding schemes.
|
||||
|
||||
The character ZERO WIDTH NO-BREAK SPACE (FEFF16) can be used as a signature by
|
||||
The character ZERO WIDTH NO-BREAK SPACE (FEFF<sub>16</sub>) can be used as a signature by
|
||||
prepending it to a file or stream. The alternative function of U+FEFF as a
|
||||
format control character has been copied to U+2060 WORD JOINER, and U+FEFF
|
||||
should only be used for Unicode signatures.
|
||||
@ -454,7 +453,7 @@ sequences for U+FEFF:
|
||||
|
||||
9. UTF-EBCDIC: DD 73 66 73
|
||||
|
||||
ICU provides the function ucnv_detectUnicodeSignature() for Unicode signature
|
||||
ICU provides the function `ucnv_detectUnicodeSignature()` for Unicode signature
|
||||
detection.
|
||||
|
||||
*There is no signature for CESU-8 separate from the one for UTF-8. UTF-8 and
|
||||
@ -465,7 +464,7 @@ CESU-8 should only be used in limited, closed, specific environments.*
|
||||
In UTF-16 and UTF-32, where the signature also distinguishes between big-endian
|
||||
and little-endian byte orders, it is also called a byte order mark (BOM). The
|
||||
signature works for UTF-16 since the code point that has the byte-swapped
|
||||
encoding, FFFE16, will never be a valid Unicode character. (It is a
|
||||
encoding, FFFE<sub>16</sub>, will never be a valid Unicode character. (It is a
|
||||
"non-character" code point.) In Internet protocols, if an encoding specification
|
||||
of "UTF-16" or "UTF-32" is used, it is expected that there is a signature byte
|
||||
sequence (BOM) that identifies the byte ordering, which is not the case for the
|
||||
@ -510,9 +509,9 @@ standard. Both standards have the same character repertoire and the same
|
||||
encoding forms and schemes.
|
||||
|
||||
One difference used to be that the ISO standard defined code point values to be
|
||||
from 0 to 7FFFFFFF16, not just up to 10FFFF16. The ISO work group decided to add
|
||||
from 0 to 7FFFFFFF<sub>16</sub>, not just up to 10FFFF<sub>16</sub>. The ISO work group decided to add
|
||||
an amendment to the standard. The amendment removes this difference by declaring
|
||||
that no characters will ever be assigned code points above 10FFFF16. The main
|
||||
that no characters will ever be assigned code points above 10FFFF<sub>16</sub>. The main
|
||||
reason for the ISO work group's decision is interoperability between the UTFs.
|
||||
UTF-16 can not encode any code points above this limit.
|
||||
|
||||
@ -526,7 +525,7 @@ specifies sequences of five and six bytes to cover that whole range.
|
||||
|
||||
Another difference is that the ISO standard defines encoding forms "UCS-4" and
|
||||
"UCS-2". UCS-4 is essentially UTF-32 with a theoretical upper limit of
|
||||
7FFFFFFF16, using 31 out of the 32 bits. However, in practice, the ISO committee
|
||||
7FFFFFFF<sub>16</sub>, using 31 out of the 32 bits. However, in practice, the ISO committee
|
||||
has accepted that the characters above 10FFFF will not be encoded, so there is
|
||||
essentially no difference between the forms. The "4" stands for "four-byte
|
||||
form".
|
||||
@ -542,5 +541,5 @@ The standards differ in what kind of information they provide: The Unicode
|
||||
standard provides more character properties and describes algorithms etc., while
|
||||
the ISO standard defines collections, subsets and similar.
|
||||
|
||||
The standards are synchronized and the respective committees work together to
|
||||
The standards are synchronized, and the respective committees work together to
|
||||
add new characters and assign code point values.
|
||||
|
Loading…
Reference in New Issue
Block a user