ICU - Formats and API for Binary Data Files

Finding ICU data

ICU data, when stored in files, is loaded from the file system directory that is returned by u_getDataDirectory(). That directory is determined sequentially by

getenv("ICU_DATA") - the contents of the ICU_DATA environment variable
on Windows, by the value named "Path" of the registry key HKEY_LOCAL_MACHINE "SOFTWARE\\IBM\\Unicode\\Data"
relative to the path where icuuc.dll or libicu-uc.so or similar is loaded from: if it is loaded from /some/path/lib/libicu-uc.so, then the path will be /some/path/lib/../share/icu/1.3.1/ where "1.3.1" is an example for the version of the ICU library that is trying to locate the data directory
relative to the path where icuuc.dll or libicu-uc.so or similar is found by searching the PATH or LIBPATH as appropriate; the relative path is determined as above
hardcoded to (system drive)/share/icu/1.3.1/, where (system drive) is empty or a path to the system drive, like "D:\" on Windows or OS/2

When ICU data is loaded using the udata API functions, then there is a defined sequence of file locations and entry point names that are used to locate the data. See the description in icu/source/common/udata.h for details. Note that the exact data finding depends on the implementation of this API and may differ by platform and by build configuration. See also icu/source/common/udata.c for implementation details.

Binary Data File Formats

Data files for ICU and for applications loading their data with ICU, should have a memory-mappable format. This means that the data should be layed out in the file in an immediately useful way, so that the code that uses the data does not need to parse it or copy it to allocated memory and build additional structures (like Hashtables). Here are some points to consider:

The data memory starts at an offset within the data file that is divisible by (at least) sizeof(double) (the largest scalar data type) if you use unewdata.h/.c to write the data. To be exact, unewdata writes the data 16-aligned, and it is 16-aligned in memory-mapped files. However, the build process forced us to insert a double before the binary data to get any alignment, thus only 8-aligning (sizeof(double)==8 on most machines) the data.
Write explicitly sized values: explicitly 32 bits with an int32_t, not using an ambiguous int.
Align all values according to their data type size: Align 16-bit integers on even offsets, 32-bit integers on offsets divisible by 4, etc.
Align structures according to their largest field.
When writing structures directly, avoid implicit field padding/alignment: if a field may not be aligned within the structure according to its size, then insert additional (reserved) fields to explicitly size-align that field.
Avoid floating point values if possible. Their size and structure may differ among platforms.
Avoid boolean (bool_t, bool) values and use explictly sized integer values instead because the size of the boolean type may vary.
Write offsets to sub-structures at the beginning of the data so that those sub-structures can be accessed directly without parsing the data that precedes them.
If data needs to be read linearly, then precede it with its length rather than terminating it with a sentinel value.
When writing char[] strings, write only "invariant" characters - avoid anything that is not common among all ASCII- or EBCDIC-based encodings. This avoids incompatibilities and real, heavyweight codepage conversions. Even on the same platform, the default encoding may not always be the same one, and every "non-invariant" character may change.
(The term "invariant characters" is from Unicode Technical Report 16 (UTF-EBCDIC).)

Platform-dependency of Binary Data Files

Data files with formats as described above should be portable among machines with the same set of relevant properties:

Byte ordering: If the data contains values other than byte arrays.
Example: uint16_t, int32_t.
Character set family: Some data files contain char[]. Such strings should contain only "invariant characters", but are even so only portable among machines with the same character set family, i.e., they must share for example the ASCII or EBCDIC graphic characters.
Unicode Character size: Some data files contain UChar[]. In principle, Unicode characters are stored using UTF-8, UTF-16, or UTF-32. Thus, Unicode strings are directly compatible if the code unit size is the same. ICU uses only UTF-16 at this point.

All of these properties can be verified by checking the UDataInfo structure of the data, which is done best in a UDataMemoryIsAcceptable() function passed into the udata_openChoice() API function.

If a data file is loaded on a machine with different relevant properties than the machine where the data file was generated, then the using code could adapt by detecting the differences and reformatting the data on the fly or in a copy in memory. This would improve portability of the data files but significantly decrease performance.

"Relevant" properties are those that affect the portability of the data in the particular file.

For example, a flat (memory-mapped) binary data file that contains 16-bit and 32-bit integers and is created for a typical, big-endian Unix machine, can be used on an OS/390 system or any other big-endian machine.
If the file also contains char[] strings, then it can be easily shared among all big-endian and ASCII-based machines, but not with (e.g.) an OS/390.
OS/390 and OS/400 systems, however, could easily share such a data file created on either of these systems.

To make sure that the relevant platform properties of the data file and the loading machine match, the udata_openChoice() API function should be used with a UDataMemoryIsAcceptable() function that checks for these properties.

Some data file loading mechanisms prevent using data files generated on a different platform to begin with, especially data files packaged as DLLs (shared libraries).

Writing a binary data file

This is a raw draft.

... Use icu/source/tools/toolutil/unewdata.h|.c to write data files, can include a copyright statement or other comment...See icu/source/tools/gennames...