ICU - Formats and API for Binary Data Files

Finding ICU data

ICU data, when stored in files, is loaded from the file system directory that is returned by u_getDataDirectory(). That directory is determined sequentially by

Common data, single files, extensibility, and search sequence

ICU data consists of several hundred pieces of data like converter mapping tables, locale resource bundles, break iterator and collation rules and dictionaries, and so on. During the build process, they are compiled into binary, memory-mappable files with a general structure conforming to the recommendations below.

For performance and ease of installation, all of these elements are then typically combined into one single, common data file with a Table of Contents listing all of its elements. This data file can be in one of four formats:

  1. A binary, memory-mappable file with the same general structure and a Table of Contents with offsets to the data elements that are copied into this common file.
  2. A shared library (DLL) that contains one entry point with exactly the same structure as the above file.
  3. A shared library (DLL) that contains one entry point to a small structure with a Table of contents with pointers to the other data elements that have been linked into the same library. The pointers are resolved by the linker and/or loader. Each data element may or may not also be exported with its own entry point.
  4. A shared library (DLL) that contains an entry point per data element but no explicit Table of Contents data structure. Instead, the list of entry points with the system API to get an address for an entry point serves implicitly as the Table of Contents mechanism.

Data is loaded using the udata API functions by first looking in the common data file. If no common file is loaded yet, then it is loaded as a shared library, then as a memory-mappable file. This allows to add separate data files that get loaded if no data element with the same name is found in the common file. The entire process of finding and loading a data element on most platforms amounts to the following:

  1. Load or use the common data file as follows:
    1. Use previously loaded, cached common data. This may have been set by udata_setCommonData().
    2. Attempt to load the common data from a shared library (DLL); locate the shared library first in the folder u_getDataDirectory(), then without a folder specification.
    3. Attempt to load the common data by memory-mapping a common data file with a Table of Contents structure; locate the file first in the folder u_getDataDirectory(), then without a folder specification.
  2. If there is a common data file, then try to find the data element in its Table of Contents according to the format of the common file.
  3. If the data is not found in the common data, then attempt to load it directly by memory-mapping it as a separate file; locate the file first in the folder u_getDataDirectory(), then without a folder specification.
This process ends as soon as the data is found.

If the data is not ICU's data itself, but application data like application-specific resource bundles, then the process is almost the same, except for

For more details, see icu/source/common/udata.h. Note that the exact data finding depends on the implementation of this API and may differ by platform. See also icu/source/common/udata.c for implementation details.

Setting the ICU data pointer

An application that uses ICU may choose to find and load the ICU data itself and provide the ICU library with a pointer to it. This may be useful in very restricted environments, when getenv(), LIBPATH and many system services may be unavailable. It also makes it possible for an application to have installation settings only for itself, without special installation for ICU, since ICU would then not rely on its own settings and capabilities.
The common data can be in any of the formats with explicit Table of Contents described above; a shared library without a Table of Contents (with only entry-point-based lookup) cannot be used. For details, see in udata.h the function udata_setCommonData().

Porting the ICU data loading to more platforms - help wanted

The data loading as described above is complete for Windows (Win32) and a number of POSIX-style platforms. On platforms that do not support dynamic loading of shared libraries (DLLs), only memory-mapping is used.
Note that shared libraries can be easier to find because of the system support for them, while memory-mappable files are more portable.

Where memory-mapping is not available, ICU uses simple file access with fopen() and fread() etc. instead, which is much less efficient:
Loading a shared library or memory-mapping a file typically results in shared, demand-paged, virtually memory, while simple file access results in reading the entire file into each ICU-using process's memory.

Similarly, the fastest way to build a shared library (DLL) is to build the common, memory-mappable file and to turn it into a .obj (.o) file directly to feed it into the linker. This is currently only done on Windows.

For best performance, ICU needs to have efficient mechanisms for finding and loading its and its applications' data. Right now, this means that we are looking for more implementations of the platform-specific functions to load shared libraries and to memory-map files. At build time, it is also desirable to build .o files directly from raw data on more platforms.

Binary Data File Formats

Data files for ICU and for applications loading their data with ICU, should have a memory-mappable format. This means that the data should be layed out in the file in an immediately useful way, so that the code that uses the data does not need to parse it or copy it to allocated memory and build additional structures (like Hashtables). Here are some points to consider:

Platform-dependency of Binary Data Files

Data files with formats as described above should be portable among machines with the same set of relevant properties:

All of these properties can be verified by checking the UDataInfo structure of the data, which is done best in a UDataMemoryIsAcceptable() function passed into the udata_openChoice() API function.

If a data file is loaded on a machine with different relevant properties than the machine where the data file was generated, then the using code could adapt by detecting the differences and reformatting the data on the fly or in a copy in memory. This would improve portability of the data files but significantly decrease performance.

"Relevant" properties are those that affect the portability of the data in the particular file.

For example, a flat (memory-mapped) binary data file that contains 16-bit and 32-bit integers and is created for a typical, big-endian Unix machine, can be used on an OS/390 system or any other big-endian machine.
If the file also contains char[] strings, then it can be easily shared among all big-endian and ASCII-based machines, but not with (e.g.) an OS/390.
OS/390 and OS/400 systems, however, could easily share such a data file created on either of these systems.

To make sure that the relevant platform properties of the data file and the loading machine match, the udata_openChoice() API function should be used with a UDataMemoryIsAcceptable() function that checks for these properties.

Some data file loading mechanisms prevent using data files generated on a different platform to begin with, especially data files packaged as DLLs (shared libraries).

Writing a binary data file

This is a raw draft.

... Use icu/source/tools/toolutil/unewdata.h|.c to write data files, can include a copyright statement or other comment...See icu/source/tools/gennames...