ICU data, when stored in files, is loaded from the file system
directory that is returned by u_getDataDirectory()
.
That directory is determined sequentially by
getenv("ICU_DATA")
-
the contents of the ICU_DATA environment variable"Path"
of the registry key
HKEY_LOCAL_MACHINE "SOFTWARE\\ICU\\Unicode\\Data"
icuuc.dll
or libicu-uc.so
or similar
is loaded from: if it is loaded from /some/path/lib/libicu-uc.so
, then
the path will be /some/path/lib/../share/icu/1.3.1/
where "1.3.1"
is an example for the version of the ICU library that
is trying to locate the data directoryicuuc.dll
or libicu-uc.so
or similar
is found by searching the PATH
or LIBPATH
as appropriate; the relative path is determined as above(system drive)/share/icu/1.3.1/
,
where (system drive)
is empty or a path to the system drive, like
"D:\"
on Windows or OS/2When ICU data is loaded using the udata
API functions, then
there is a defined sequence of file locations and entry point names that are
used to locate the data. See the description in icu/source/common/udata.h
for
details. Note that the exact data finding depends on the implementation
of this API and may differ by platform and by build configuration.
See also icu/source/common/udata.c
for implementation details.
Data files for ICU and for applications loading their data with ICU, should have a memory-mappable format. This means that the data should be layed out in the file in an immediately useful way, so that the code that uses the data does not need to parse it or copy it to allocated memory and build additional structures (like Hashtables). Here are some points to consider:
sizeof(double)
(the largest scalar data type)
if you use unewdata.h/.c
to write the data.
To be exact, unewdata
writes the data 16-aligned,
and it is 16-aligned in memory-mapped files. However, the build
process forced us to insert a double
before the
binary data to get any alignment, thus only 8-aligning
(sizeof(double)==8
on most machines) the data.int32_t
, not using an ambiguous int
.bool_t
, bool
) values
and use explictly sized integer values instead
because the size of the boolean type may vary.char[]
strings, write only "invariant"
characters - avoid anything that is not common among all ASCII-
or EBCDIC-based encodings. This avoids incompatibilities and
real, heavyweight codepage conversions.
Even on the same platform, the default encoding may not always
be the same one, and every "non-invariant" character
may change.Data files with formats as described above should be portable among machines with the same set of relevant properties:
uint16_t
, int32_t
.char[]
.
Such strings should contain only "invariant characters", but
are even so only portable among machines with the same character set
family, i.e., they must share for example the ASCII or EBCDIC
graphic characters.UChar[]
.
In principle, Unicode characters are stored using UTF-8, UTF-16, or UTF-32.
Thus, Unicode strings are directly compatible if the code unit size is the same.
ICU uses only UTF-16 at this point.All of these properties can be verified by checking the
UDataInfo
structure of the data, which is done
best in a UDataMemoryIsAcceptable()
function passed into
the udata_openChoice()
API function.
If a data file is loaded on a machine with different relevant properties than the machine where the data file was generated, then the using code could adapt by detecting the differences and reformatting the data on the fly or in a copy in memory. This would improve portability of the data files but significantly decrease performance.
"Relevant" properties are those that affect the portability of the data in the particular file.
For example, a flat (memory-mapped) binary data file
that contains 16-bit and 32-bit integers and is
created for a typical, big-endian Unix machine, can be used
on an OS/390 system or any other big-endian machine.
If the file also contains char[]
strings,
then it can be easily shared among all big-endian and
ASCII-based machines, but not with (e.g.) an OS/390.
OS/390 and OS/400 systems, however, could easily share such
a data file created on either of these systems.
To make sure that the relevant platform properties of
the data file and the loading machine match, the
udata_openChoice()
API function should be used with a
UDataMemoryIsAcceptable()
function that checks for
these properties.
Some data file loading mechanisms prevent using data files generated on a different platform to begin with, especially data files packaged as DLLs (shared libraries).
This is a raw draft.
... Use icu/source/tools/toolutil/unewdata.h|.c
to write data files,
can include a copyright statement or other comment...See icu/source/tools/gennames
...