This file provides instructions for building and running the UnicodeTools, which
can be used to:
WARNING!!
The rest of this will assume that you have set up CVS so that you load the ICU4J project into
C:\ICU4J
You need both the main icu4j and a subproject called unicodetools. See:
http://www.ibm.com/software/globalization/icu/repository.jsp. Inside unicodetools, look at com/ibm/text. The
main directories of interest are UCD, UCA and utility.
Set up Eclipse to build two projects: ICU4J and UnicodeTools:
Project Name: ICU4J
Directory: C:\ICU4J\icu4j
Default output folder = ICU4J/classes
Project Name: unicodetools
Create project from existing source: C:\ICU4J\unicodetools
Default Output Folder: unicodetools/classes
After Eclipse is set up with these, exclude certain files from unicodetools:
Right-Click UnicodeTools > Properties > Java Build Path > Exclusions
com/ibm/rbm/
com/ibm/text/utility/UnicodeMapInt.java
com/ibm/text/utility/TestUtility.java
com/ibm/text/UCD/GenerateThaiBreaks-old.java/
com/ibm/text/UCD/ProcessUnihan.java/
com/ibm/text/UCA/WriteHTMLCollation.java/
UnicodeTools must also include the ICU4J project, with
Right-Click UnicodeTools > Properties > Java Build Path > Projects
public static final String DATA_DIR = "C:\\DATA\\";
public static final String UCD_DIR = BASE_DIR + "UCD\\";
public static final String BIN_DIR = DATA_DIR + "BIN\\";
public static final String GEN_DIR = DATA_DIR + "GEN\\";
Make sure that each of these directories exist. Also make sure that the following
exist:
<GEN_DIR>/DerivedData
<GEN_DIR>/DerivedData/ExtractedProperties
<UCD_DIR>/EXTRAS-Update
The folder names must be of the form: "3.2.0-Update", so rename the folders on the
Unicode site to this format. If the
folder contains ucd, then make the contents of that directory be the contents of
the x.x.x-Update directory. That is, each directory will directly contain files
like PropList....txt
If you are downloading any "incomplete" release (one that does not contain a complete set of data files for that release, you need to also download the previous complete release). Most of the N.M-Update directoriess are complete, *except*:
4.0-Update, which does not contain a copy of Unihan.txt and some other files
3.1-Update, which does not contain a copy of BidiMirroring.txt
Also, make the following changes to UnicodeData for 1.1.5:
Delete
3400;HANGUL SYLLABLE KIYEOK A;Lo;0;L;1100 1161;;;;N;;;;; ... 4DFF;HANGUL SYLLABLE MIEUM WEO RIEUL-THIEUTH;Lo;0;L;1106 116F 11B4;;;;N;;;;; 4E00;;Lo;0;L;;;;;N;;;;;
Add:
4E00;;Lo;0;L;;;;;N;;;;; 9FA5; ;Lo;0;L;;;;;N;;;;; E000; ;Co;0;L;;;;;N;;;;; F8FF; ;Co;0;L;;;;;N;;;;;
And from a late version of Unicode, add:
F900;CJK COMPATIBILITY IDEOGRAPH-F900;Lo;0;L;8C48;;;;N;;;;; ... FA2D;CJK COMPATIBILITY IDEOGRAPH-FA2D;Lo;0;L;9DB4;;;;N;;;;;
If you are building any of the UCA tools, you need to get a copy of the UCA data file
from http://www.unicode.org/reports/tr10/#AllKeys. The default location for this is:
BASE_DIR + "Collation\allkeys" + VERSION + ".txt".
If you have it in a different location, change that value for KEYS in UCA.java, and
the value for BASE_DIR
C://DATA/ BIN/ Collation/ allkeys-3.1.1.txt GEN/ DerivedData/ UCD/ 3.0.0-Update/ Unihan-3.2.0.txt ... 3.0.1-Update/ ... 3.1.0-Update/ ... 3.1.1-Update/ ... 3.2.0-Update/ ... 4.0.0-Update/ ArabicShaping-4.0.0d14b.txt BidiMirroring-4.0.0d1b.txt ... EXTRAS-Update/
All of the following have "version X" in the options you give to Java (either on the command line, or in the Eclipse 'run' options. If you want a specific version like 3.1.0, then you would write "version 3.1.1". If you want the latest version (4.1.0), you can omit the "version X".
This file drives the production of the derived Unicode files. The first three lines contain parameters that you may want to modify at some times:
Generate: .*script.* // this is a regular expression. Use .* for all files DeltaVersion: 10 // This gets appended to the file name. Pick 1+ the highest value in Public CopyrightYear: 2006 // Pick the current year
build 5.0.0 MakeUnicodeFiles
-Xms512m -Xmx512m
Writing UCD_Data5.0.0 Data Size: 109,802 Wrote Data 109802
version X build
This builds an compressed format of all the UCD data (except blocks and Unihan) into the BIN directory. Don't worry about the voluminous console messages, unless one says "FAIL".
You have to manually do this if you change any of the data files in that version!
Note: if for any reason you modify the binary format of the BIN files, you also have to bump the value in that file:
static final byte BINARY_FORMAT = 8; // bumped if binary format of UCD changes
Diff_PropList-5.0.0d10.txt.bat OLDER-Diff_PropList-5.0.0d10.txt.bat UNCHANGED-Diff_PropertyValueAliases-5.0.0d10.txt.bat
C:\DATA\GEN\UnicodeInvariantResults.txt\
And on the console will list whether any problems are found. Thus in the following case there was one failure:
ParseErrorCount=0 TestFailureCount=1
# Canonical decompositions (minus exclusions) must be identical across releases
[$Decomposition_Type:Canonical - $Full_Composition_Exclusion] = [$×Decomposition_Type:Canonical - $×Full_Composition_Exclusion]
FALSE
**** START Error Info ****
In [$×Decomposition_Type:Canonical - $×Full_Composition_Exclusion], but not in [$Decomposition_Type:Canonical - $Full_Composition_Exclusion] :
# Total code points: 0
Not in [$×Decomposition_Type:Canonical - $×Full_Composition_Exclusion], but in [$Decomposition_Type:Canonical - $Full_Composition_Exclusion] :
1B06 # Lo BALINESE LETTER AKARA TEDUNG
1B08 # Lo BALINESE LETTER IKARA TEDUNG
1B0A # Lo BALINESE LETTER UKARA TEDUNG
1B0C # Lo BALINESE LETTER RA REPA TEDUNG
1B0E # Lo BALINESE LETTER LA LENGA TEDUNG
1B12 # Lo BALINESE LETTER OKARA TEDUNG
1B3B # Mc BALINESE VOWEL SIGN RA REPA TEDUNG
1B3D # Mc BALINESE VOWEL SIGN LA LENGA TEDUNG
1B40..1B41 # Mc [2] BALINESE VOWEL SIGN TALING TEDUNG..BALINESE VOWEL SIGN TALING REPA TEDUNG
1B43 # Mc BALINESE VOWEL SIGN PEPET TEDUNG
# Total code points: 11
In both [$×Decomposition_Type:Canonical - $×Full_Composition_Exclusion], and in [$Decomposition_Type:Canonical - $Full_Composition_Exclusion] :
00C0..00C5 # L& [6] LATIN CAPITAL LETTER A WITH GRAVE..LATIN CAPITAL LETTER A WITH RING ABOVE
00C7..00CF # L& [9] LATIN CAPITAL LETTER C WITH CEDILLA..LATIN CAPITAL LETTER I WITH DIAERESIS
00D1..00D6 # L& [6] LATIN CAPITAL LETTER N WITH TILDE..LATIN CAPITAL LETTER O WITH DIAERESIS
...
30F7..30FA # Lo [4] KATAKANA LETTER VA..KATAKANA LETTER VO
30FE # Lm KATAKANA VOICED ITERATION MARK
AC00..D7A3 # Lo [11172] HANGUL SYLLABLE GA..HANGUL SYLLABLE HIH
# Total code points: 12089
**** END Error Info ****
-DSHOW_FILES
writeCollationValidityLog
It will create a file:
C:\DATA\GEN\collation\5.0.0\CheckCollationValidity.html
normalizationChart caseChart scriptChart indexChart
They will be built into
C:\DATA\GEN\charts
Once UCA is released, then copy those files up to the right spots in the Unicode site:
ICU
They will be built into:
C:\DATA\GEN\collation\5.0.0
version 4.1.0 ICU
Or whatever the last version was.