ICU - Formats and API for Binary Data Files

This is a raw draft.

Finding ICU data

ICU data, when stored in files, is loaded from the file system directory that is returned by u_getDataDirectory(). That directory is determined sequentially by

When ICU data is loaded using the udata API functions, then there is a defined sequence of file locations and entry point names that are used to locate the data. See the description in icu/source/common/udata.h for details. Note that the exact data finding depends on the implementation of this API and may differ by platform and by build configuration. See also icu/source/common/udata.c for implementation details.

Binary Data File Formats

Data files for ICU and for applications loading their data with ICU, should have a memory-mappable format. This means that the data should be layed out in the file in an immediately useful way, so that the code that uses the data does not need to parse it or copy it to allocated memory and build additional structures (like Hashtables). Here are some points to consider:

Platform-dependency of Binary Data Files

Data files with formats as described above should be portable among machines with the same set of relevant properties:

All of these properties can be verified by checking the UDataInfo structure of the data, which is done best in a UDataMemoryIsAcceptable() function passed into the udata_openChoice() API function.

If a data file is loaded on a machine with different relevant properties than the machine where the data file was generated, then the using code could adapt by detecting the differences and reformatting the data on the fly or in a copy in memory. This would improve portability of the data files but significantly decrease performance.

"Relevant" properties are those that affect the portability of the data in the particular file.

For example, a flat (memory-mapped) binary data file that contains 16-bit and 32-bit integers and is created for a typical, big-endian Unix machine, can be used on an OS/390 system or any other big-endian machine.
If the file also contains char[] strings, then it can be easily shared among all big-endian and ASCII-based machines, but not with (e.g.) an OS/390.
OS/390 and OS/400 systems, however, could easily share such a data file.

To make sure that the relevant platform properties of the data file and the loading machine match, the udata_openChoice() API function should be used with a UDataMemoryIsAcceptable() function that checks for these properties.

Some data file loading mechanisms prevent using data files generated on a different platform to begin with, especially data files packaged as DLLs (shared libraries).

Writing a binary data file

... Use icu/source/tools/toolutil/unewdata.h|.c to write data files, can include a copyright statement or other comment...