52513beddd
X-SVN-Rev: 40302
77 lines
3.2 KiB
Plaintext
77 lines
3.2 KiB
Plaintext
file: testdata/break_rules/readme.txt
|
|
Copyright (C) 2016 and later: Unicode, Inc. and others.
|
|
License & terms of use: http://www.unicode.org/copyright.html#License
|
|
|
|
Copyright (c) 2015-2016, International Business Machines Corporation and others. All Rights Reserved.
|
|
|
|
This directory contains the break iterator reference rule files used by intltest rbbi/RBBIMonkeyTest/testMonkey.
|
|
The rules in this directory track the boundary rules from Unicode UAX 14 and 29. They are interpreted
|
|
to provide an expected set of boundary positions to compare with the results from ICU break iteration.
|
|
|
|
ICU4J also includes copies of the test reference rules, located in the directory
|
|
main/tests/core/src/com/ibm/icu/dev/test/rbbi/break_rules/
|
|
The copies should be kept synchronized; there should be no differences.
|
|
|
|
Each set of reference break rules lives in a separate file.
|
|
The list of rule files to run by default is hard coded into the test code, in rbbimonkeytest.cpp.
|
|
|
|
Each test file includes
|
|
- The type of ICU break iterator to create (word, line, sentence, etc.)
|
|
- The locale to use
|
|
- Character Class definitions
|
|
- Rule definitions
|
|
|
|
To Do
|
|
- Extend the syntax to support rule tailoring.
|
|
|
|
|
|
Character Class Definition:
|
|
name = set_regular_expression;
|
|
|
|
Rule Definition:
|
|
rule_regular_expression;
|
|
|
|
name:
|
|
[A-Za-z_][A-Za-z0-9_]*
|
|
|
|
set_regular_expression:
|
|
The intersection of an ICU regular expression [set] expression and a UnicodeSet pattern.
|
|
(They are mostly the same)
|
|
May include previously defined set names, which are logically expanded in-place.
|
|
|
|
rule_regular_expression:
|
|
An ICU Regular Expression.
|
|
May include set names, which are logically expanded in-place.
|
|
May include a '÷', which defines a boundary position.
|
|
|
|
Application of the rules:
|
|
Matching begins at the start of text, or after a previously identified boundary.
|
|
The pseudo-code below finds the next boundary.
|
|
|
|
while position < end of text
|
|
for each rule
|
|
if the text at position matches this rule
|
|
if the rule has a '÷'
|
|
Boundary is found.
|
|
return the position of the '÷' within the match.
|
|
else
|
|
position = last character of the rule match.
|
|
break from the inner rule loop, continue the outer loop.
|
|
|
|
This differs from the Unicode UAX algorithm in that each position in the text is
|
|
not tested separately. Instead, when a rule match is found, rule application restarts with the last
|
|
character of the preceding rule match. ICU's break rules also operate this way.
|
|
|
|
Expressing rules this way simplifies UAX rules that have leading or trailing context; it
|
|
is no longer necessary to write expressions that match the context starting from
|
|
any position within it.
|
|
|
|
This rule form differs from ICU rules in that the rules are applied sequentially, as they
|
|
are with the Unicode UAX rules. With the main ICU break rules, all are applied in parallel.
|
|
|
|
Word Dictionaries
|
|
The monkey test does not test dictionary based breaking. The set named 'dictionary' is special,
|
|
as it is in the main ICU rules. For the monkey test, no characters from the dictionary set are
|
|
included in the randomly-generated test data.
|
|
|