scuffed-code/tools/unicodetools
Mark Davis 97863da9df ICU-5149 updated readme
X-SVN-Rev: 19521
2006-04-06 00:09:14 +00:00
..
com/ibm ICU-5149 update unicode tools after cvs problems. 2006-04-05 22:13:04 +00:00
license.html ICU-1905 Need the license for RBManager 2002-05-22 16:37:12 +00:00
readme.html ICU-5149 updated readme 2006-04-06 00:09:14 +00:00

<html>

<head>
<meta http-equiv="Content-Language" content="en-us">
<meta http-equiv="Content-Type" content="text/html; charset=windows-1252">
<title>New Page 18</title>
</head>

<body>

<h1>UnicodeTools</h1>
<p>This file provides instructions for building and running the UnicodeTools, which<br>
can be used to:</p>
<ul>
  <li>build the Derived Unicode files in the UCD (Unicode Character Database),</li>
  <li>build the transformed UCA (Unicode Collation Algorithm) files needed by ICU.</li>
  <li>run consistency checks on beta releases of the UCD and the UCA.</li>
  <li>build 4 chart folders on the unicode site</li>
</ul>
<p><font color="#FF0000"><b>WARNING!!</b></font></p>
<ul>
  <li>This is NOT production level code, and should never be used in programs.</li>
  <li>The API is subject to change without notice, and will not be maintained.</li>
  <li>The source is uncommented, and has many warts; since it is not production code, it has not 
  been worth the time to clean it up.</li>
  <li>It will probably not work on Unix or Mac without changing the file separator.</li>
  <li>Currently it uses hard-coded directory names.</li>
  <li>The contents of multiple versions of the UCD must be copied to a local directory, as described 
  below.</li>
</ul>
<h2>Instructions:</h2>
<h3>0. You will need to get ICU4J on your system, using CVS.</h3>
<p>The rest of this will assume that you have set up CVS so that you load the ICU4J project into 
C:\ICU4J<br>
<br>
You need both the main icu4j and a subproject called unicodetools. See:
<a href="http://ibm.com/software/globalization/icu/repository.jsp">
http://ibm.com/software/globalization/icu/repository.jsp</a>. Inside unicodetools, look at com/ibm/text. The 
main directories of interest are UCD, UCA and utility.</p>
<h4>0a. If you are using Eclipse for your IDE, look at the instructions on
<a href="http://icu.sourceforge.net/docs/eclipse_howto/eclipse_howto.html">
http://oss.software.ibm.com/icu/docs/eclipse_howto/eclipse_howto.html</a> </h4>
<p>Set up Eclipse to build two projects: ICU4J and UnicodeTools:<br>
<br>
<b>Project Name: </b>ICU4J<br>
<b>Directory: </b>C:\ICU4J\icu4j<br>
<b>Default output folder = </b>ICU4J/classes<br>
<br>
<b>Project Name: </b>unicodetools<br>
<b>Create project from existing source: </b>C:\ICU4J\unicodetools<br>
<b>Default Output Folder: </b>unicodetools/classes<br>
<br>
After Eclipse is set up with these, exclude certain files from unicodetools:<br>
<br>
Right-Click UnicodeTools &gt; Properties &gt; Java Build Path &gt; Exclusions<br>
com/ibm/rbm/<br>
com/ibm/text/utility/UnicodeMapInt.java<br>
com/ibm/text/utility/TestUtility.java<br>
com/ibm/text/UCD/GenerateThaiBreaks-old.java/<br>
com/ibm/text/UCD/ProcessUnihan.java/<br>
com/ibm/text/UCA/WriteHTMLCollation.java/<br>
<br>
UnicodeTools must also include the ICU4J project, with<br>
<br>
Right-Click UnicodeTools &gt; Properties &gt; Java Build Path &gt; Projects</p>
<h3>1. In UCD, you must edit UCD_Types.java at the top, to set the directories for the build:</h3>
<p>public static final String DATA_DIR = &quot;C:\\DATA\\&quot;;<br>
public static final String UCD_DIR = BASE_DIR + &quot;UCD\\&quot;;<br>
public static final String BIN_DIR = DATA_DIR + &quot;BIN\\&quot;;<br>
public static final String GEN_DIR = DATA_DIR + &quot;GEN\\&quot;;<br>
<br>
Make sure that each of these directories exist. Also make sure that the following<br>
exist:<br>
<br>
&lt;GEN_DIR&gt;/DerivedData<br>
&lt;GEN_DIR&gt;/DerivedData/ExtractedProperties<br>
&lt;UCD_DIR&gt;/EXTRAS-Update</p>
<h3>2. Download all of the UnicodeData files for each version into UCD_DIR.</h3>
<p>The folder names must be of the form: &quot;3.2.0-Update&quot;, so rename the folders on the<br>
Unicode site to this format. I<span style="background-color: #FFFF00">f the 
folder contains ucd, then make the contents of that directory be the contents of 
the x.x.x-Update directory. That is, each directory will directly contain files 
like PropList....txt</span></p>
<h4>2a Ensure Complete Release</h4>
<p>If you are downloading any &quot;incomplete&quot; release (one that does not contain a complete set of data 
files for that release, you need to also download the previous complete release). Most of the N.M-Update 
directoriess are complete, *except*:</p>
<p>4.0-Update, which does not contain a copy of Unihan.txt and some other files<br>
3.1-Update, which does not contain a copy of BidiMirroring.txt</p>
<p>Also, make the following changes to UnicodeData for 1.1.5:</p>
<p><b>Delete</b></p>
<pre>3400;HANGUL SYLLABLE KIYEOK A;Lo;0;L;1100 1161;;;;N;;;;;
...
4DFF;HANGUL SYLLABLE MIEUM WEO RIEUL-THIEUTH;Lo;0;L;1106 116F 11B4;;;;N;;;;;
4E00;<cjk IDEOGRAPH REPRESENTATIVE>;Lo;0;L;;;;;N;;;;;</pre>
<p><b>Add:</b></p>
<pre>4E00;<cjk Ideograph, First>;Lo;0;L;;;;;N;;;;;
9FA5;<cjk Ideograph, Last>;Lo;0;L;;;;;N;;;;;
E000;<private Use, First>;Co;0;L;;;;;N;;;;;
F8FF;<private Use, Last>;Co;0;L;;;;;N;;;;;</pre>
<p><b>And from a late version of Unicode, add:</b></p>
<pre>F900;CJK COMPATIBILITY IDEOGRAPH-F900;Lo;0;L;8C48;;;;N;;;;;
...
FA2D;CJK COMPATIBILITY IDEOGRAPH-FA2D;Lo;0;L;9DB4;;;;N;;;;;</pre>
<h4>2b. UCA data</h4>
<p>If you are building any of the UCA tools, you need to get a copy of the UCA data file<br>
from http://www.unicode.org/reports/tr10/#AllKeys. The default location for this is:<br>
<br>
BASE_DIR + &quot;Collation\allkeys&quot; + VERSION + &quot;.txt&quot;.<br>
<br>
If you have it in a different location, change that value for KEYS in UCA.java, and <br>
the value for BASE_DIR</p>
<h4>2c. Here is an example of the default directory structure with files. All of 
the yellow ones should exist</h4>
<pre>C://DATA/

        BIN/
    
<span style="background-color: #FFFF00">        Collation/
            allkeys-3.1.1.txt
</span>        
        GEN/
            DerivedData/
<span style="background-color: #FFFF00">        </span><span style="background-color: #FFFF00">UCD/
            3.0.0-Update/
                Unihan-3.2.0.txt
                ...
            3.0.1-Update/
                ...
            3.1.0-Update/
                ...
            3.1.1-Update/
                ...
            3.2.0-Update/
                ...
            4.0.0-Update/
                ArabicShaping-4.0.0d14b.txt
                BidiMirroring-4.0.0d1b.txt
                ...
            EXTRAS-Update/</span></pre>
<h3>3. Versions</h3>
<p>All of the following have &quot;version X&quot; in the options you give to Java (either on the&nbsp; 
command line, or in the Eclipse 'run' options. If you want a specific version like 3.1.0, then you 
would write &quot;version 3.1.1&quot;. If you want the latest version (4.1.0), you can omit the &quot;version X&quot;.</p>
<h3>4. Building Files</h3>
<ol>
	<li><b>Setup</b><ol>
		<li>In Eclipse, open the Package Explorer (Use Window&gt;Show View if you 
		don't see it)</li>
		<li>Open UnicodeTools<ul>
			<li>com.ibm.text.UCD<ul>
				<li>MakeUnicodeFiles.<span style="background-color: #FFFF00">txt</span><p>This file drives the production of 
				the derived Unicode files. The first three lines contain 
				parameters that you may want to modify at some times:</p>
				<pre>Generate: <b>.*script.*</b> <i>// this is a regular expression. Use .* for all files</i>
DeltaVersion: <b>10</b> <i>    // This gets appended to the file name. Pick 1+ the highest value in Public</i>
CopyrightYear: <b>2006</b> <i> // Pick the current year</i></pre>
				</li>
			</ul>
			</li>
		</ul>
		</li>
		<li>Open in Package Explorer 
		<ul>
			<li>com.ibm.text.UCD<ul>
				<li>Main</li>
			</ul>
			</li>
		</ul>
		</li>
		<li>Run&gt;Run As...<ol>
			<li>Choose Java Application<ul>
				<li>it will fail, don't worry; you need to set some parameters</li>
			</ul>
			</li>
		</ol>
		</li>
		<li>Run&gt;Run...<ul>
			<li>Select the Arguments tab, and fill in the following<ul>
				<li>Program arguments:<pre>build 5.0 MakeUnicodeFiles</pre>
				</li>
				<li>VM arguments: 
				<pre>-Xms512m -Xmx512m</pre>
				</li>
			</ul>
			</li>
			<li>Close and Save</li>
		</ul>
		</li>
	</ol>
	</li>
	<li><b>Run</b><ol>
		<li>You'll see it build the 5.0 files, with something like the following 
		results:<pre>Writing UCD_Data5.0.0
Data Size: 109,802
Wrote Data 109802</pre>
		</li>
		<li>For each version, the tools build a set of binary data in BIN that 
		contain the information for that release. This is done automatically, or 
		you can manually do it with the Program Arguments<pre>version X build</pre>
		<p>This builds an compressed format of all the UCD data (except blocks 
		and Unihan) into the BIN directory. Don't worry about the voluminous 
		console messages, unless one says &quot;FAIL&quot;.</p>
		<p><font color="#FF0000"><i>You have to manually do this if you change 
		any of the data files in that version!</i></font></p>
		<p>Note: if for any reason you modify the binary format of the BIN files, you also have to bump the 
value in that file:</p>
		<pre>static final byte BINARY_FORMAT = 8; // bumped if binary format of UCD changes</pre>
		</li>
	</ol>
	</li>
	<li>Results in <a href="file:///C:/DATA/GEN/DerivedData">
	C:\DATA\GEN\DerivedData</a><ol>
		<li>The files will be in this directory.</li>
		<li>There are also DIFF folders, that contain BAT files that you can run 
		on Windows with CompareIt. (You can modify the code to build BATs with 
		another Diff program if you want).<ol>
			<li>For any file with a significant difference, it will build two 
			BAT files, such as the first two below.<pre>Diff_PropList-5.0.0d10.txt.bat
OLDER-Diff_PropList-5.0.0d10.txt.bat

UNCHANGED-Diff_PropertyValueAliases-5.0.0d10.txt.bat</pre>
			</li>
		</ol>
		</li>
		<li>Any files without significant changes will have &quot;UNCHANGED&quot; as a 
		prefix: ignore them.&nbsp; The OLDER prefix is the comparison to the 
		last version of Unicode.</li>
		<li>On Windows you can run these BATs to compare files:</li>
	</ol>
	</li>
</ol>
<h3>5. Invariant Checking</h3>
<ol>
	<li>Setup<ol>
		<li>Open in Package Explorer<ul>
			<li>com.ibm.text.UCD<ul>
				<li>TestUnicodeInvariants.java</li>
			</ul>
			</li>
		</ul>
		</li>
		<li>Run&gt;Run As... Java Application<br>
		Will create the following file of results:<pre><a href="file:///C:/DATA/GEN/UnicodeInvariantResults.txt/">C:\DATA\GEN\UnicodeInvariantResults.txt\</a></pre>
		</li>
		<li>Open that file and search for &quot;**** START Error Info ****&quot; Each such 
		point provides a dump of comparison information.</li>
	</ol>
	</li>
</ol>
<h3>6. Options</h3>
<ol>
	<li>If you want to see files that are opened while processing, do the 
	following:<ol>
		<li>Run&gt;Run</li>
		<li>Select the Arguments tab, and add the following<ol>
			<li>VM arguments:
			<pre>-DSHOW_FILES</pre>
			</li>
		</ol>
		</li>
	</ol>
	</li>
</ol>
<h3>5. UCA</h3>
<ol>
	<li>
	<h3>You will use com.ibm.text.UCA.Main as your main class, creating along 
	the same lines as above.</h3></li>
	<li>
	<h4>To build all the UCA files used by ICU, use the Program arguments:</h4>
	<pre>Main ICU</pre>
	</li>
	<li>
	<h4>To build all the charts, use the UCA project, with options: </h4>
	<pre>normalizationChart caseChart scriptChart indexChart</pre>
	</li>
</ol>

</body>

</html>