UnicodeTools

This file provides instructions for building and running the UnicodeTools, which
can be used to:

WARNING!!

Instructions:

0. You will need to get ICU4J on your system, using CVS.

The rest of this will assume that you have set up CVS so that you load the ICU4J project into C:\ICU4J

You need both the main icu4j and a subproject called unicodetools. See: http://ibm.com/software/globalization/icu/repository.jsp. Inside unicodetools, look at com/ibm/text. The main directories of interest are UCD, UCA and utility.

0a. If you are using Eclipse for your IDE, look at the instructions on http://oss.software.ibm.com/icu/docs/eclipse_howto/eclipse_howto.html

Set up Eclipse to build two projects: ICU4J and UnicodeTools:

Project Name: ICU4J
Directory: C:\ICU4J\icu4j
Default output folder = ICU4J/classes

Project Name: UnicodeTools
Directory: C:\ICU4J\unicodetools
Default Output Folder: UnicodeTools/classes

After Eclipse is set up with these, exclude certain files from UnicodeTools:

Right-Click UnicodeTools > Properties > Java Build Path > Exclusions
com/ibm/rbm/
com/ibm/text/utility/UnicodeMapInt.java
com/ibm/text/utility/TestUtility.java
com/ibm/text/UCD/GenerateThaiBreaks-old.java/
com/ibm/text/UCD/ProcessUnihan.java/
com/ibm/text/UCA/WriteHTMLCollation.java/

UnicodeTools must also include the ICU4J project, with

Right-Click UnicodeTools > Properties > Java Build Path > Projects

1. In UCD, you must edit UCD_Types.java at the top, to set the directories for the build:

public static final String DATA_DIR = "C:\\DATA\\";
public static final String UCD_DIR = BASE_DIR + "UCD\\";
public static final String BIN_DIR = DATA_DIR + "BIN\\";
public static final String GEN_DIR = DATA_DIR + "GEN\\";

Make sure that each of these directories exist. Also make sure that the following
exist:

<GEN_DIR>/DerivedData
<GEN_DIR>/DerivedData/ExtractedProperties
<UCD_DIR>/EXTRAS-Update

2. Download all of the UnicodeData files for each version into UCD_DIR.

The folder names must be of the form: "3.2.0-Update", so rename the folders on the
Unicode site to this format.

2a Ensure Complete Release

If you are downloading any "incomplete" release (one that does not contain a complete set of data files for that release, you need to also download the previous complete release). Most of the N.M-Update directoriess are complete, *except*:

4.0-Update, which does not contain a copy of Unihan.txt and some other files
3.1-Update, which does not contain a copy of BidiMirroring.txt

Also, make the following changes to UnicodeData for 1.1.5:

Delete

3400;HANGUL SYLLABLE KIYEOK A;Lo;0;L;1100 1161;;;;N;;;;;
4DFF;HANGUL SYLLABLE MIEUM WEO RIEUL-THIEUTH;Lo;0;L;1106 116F 11B4;;;;N;;;;;
4E00;;Lo;0;L;;;;;N;;;;;

Add:

4E00;;Lo;0;L;;;;;N;;;;;
9FA5;;Lo;0;L;;;;;N;;;;;
E000;;Co;0;L;;;;;N;;;;;
F8FF;;Co;0;L;;;;;N;;;;;

And from a late version of Unicode, add:

F900;CJK COMPATIBILITY IDEOGRAPH-F900;Lo;0;L;8C48;;;;N;;;;;
...
FA2D;CJK COMPATIBILITY IDEOGRAPH-FA2D;Lo;0;L;9DB4;;;;N;;;;;

2b. UCA data

If you are building any of the UCA tools, you need to get a copy of the UCA data file
from http://www.unicode.org/reports/tr10/#AllKeys. The default location for this is:

BASE_DIR + "Collation\allkeys" + VERSION + ".txt".

If you have it in a different location, change that value for KEYS in UCA.java, and
the value for BASE_DIR

2c. Here is an example of the default directory structure with files:

C://DATA/

        BIN/
    
        Collation/
            allkeys-3.1.1.txt
        
        GEN/
            DerivedData/
                ExtractedProperties
        UCD/
            3.0.0-Update/
                Unihan-3.2.0.txt
                ...
            3.0.1-Update/
                ...
            3.1.0-Update/
                ...
            3.1.1-Update/
                ...
            3.2.0-Update/
                ...
            4.0.0-Update/
                ArabicShaping-4.0.0d14b.txt
                BidiMirroring-4.0.0d1b.txt
                ...
            EXTRAS-Update/

3. Versions

All of the following have "version X" in the options you give to Java (either on the  command line, or in the Eclipse 'run' options. If you want a specific version like 3.1.0, then you would write "version 3.1.1". If you want the latest version (4.1.0), you can omit the "version X".

4. Running UCD, you will use com.ibm.text.UCD.Main as your main class.

The Working directory has to be C:\ICU4J\unicodetools\com\ibm\text\UCD
(In Eclipse you can also use ${workspace_loc:UnicodeTools/com/ibm/text/UCD}, which abstracts away the location.)

The same for UCA:

main: com.ibm.text.UCD.Main
directory: C:\ICU4J\unicodetools\com\ibm\text\UCA

4a. BIN

For each version, the tools build a set of binary data in BIN that contain the information for that release. This is done automatically, or you can manually do it with the options

version X build

This builds an compressed format of all the UCD data (except blocks and Unihan) into the BIN directory. Don't worry about the voluminous console messages, unless one says "FAIL".

You have to manually do this if you change any of the data files in that version!!

Note: if for any reason you modify the binary format of the BIN files, you also have to bump the value in that file:

static final byte BINARY_FORMAT = 8; // bumped if binary format of UCD changes

4b. To build the Unicode files for a particular version X, run the Main with the following argument:

MakeUnicodeFiles.generateFile

This will execute the commands in the file MakeUnicodeFiles.txt.

You will edit that file if you want a different 'd' version for the files, OR if you want to change which files are built. At the top of the file you will see the following text:

Generate: 
DeltaVersion: 7

4c. To change which files are built, put any number of regular expressions separated by spaces after Generate. Eg,

Generate: .*line.* prop.*

The matching is case-insensitive.

4d. To change the 'd' number that is appended to the generated files names, change the DeltaVersion.

4e. To run basic consistency checking, run:

version X verify

Don't worry about any console messages except those that say FAIL.

4f. Output

The files will be generated in the GEN directories.

5. Running UCA, you will use com.ibm.text.UCA.Main as your main class.

5a. To build all the UCA files used by ICU, use the option:

java <UCA>Main ICU

6. To build all the charts, use the UCA project, with options: normalizationChart caseChart scriptChart indexChart