scuffed-code/icu4c/source/test/testdata/collationtest.txt
Steven R. Loomis 0c5b2b597d ICU-12455 BRS - BOM fix
X-SVN-Rev: 38915
2016-07-01 16:59:16 +00:00

2543 lines
44 KiB
Plaintext
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Copyright (C) 2016 and later: Unicode, Inc. and others.
# License & terms of use: http://www.unicode.org/copyright.html
# Copyright (c) 2012-2015 International Business Machines
# Corporation and others. All Rights Reserved.
#
# This file should be in UTF-8 with a signature byte sequence ("BOM").
#
# collationtest.txt: Collation test data.
#
# created on: 2012apr13
# created by: Markus W. Scherer
# A line with "** test: description" is used for verbose and error output.
# A collator can be set with "@ root" or "@ locale language-tag",
# for example "@ locale de-u-co-phonebk".
# An old-style locale ID can also be used, for example "@ locale de@collation=phonebook".
# A collator can be built with "@ rules".
# An "@ rules" line is followed by one or more lines with the tailoring rules.
# A collator can be modified with "% attribute=value".
# "* compare" tests the order (= or <) of the following strings.
# The relation can be "=" or "<" (the level of the difference is not specified)
# or "<1", "<2", "<c", "<3", "<4" (indicating the level of the difference).
# Test sections ("* compare") are terminated by
# definitions of new collators, changing attributes, or new test sections.
** test: simple CEs & expansions
# Many types of mappings are tested elsewhere, including via the UCA conformance tests.
# Here we mostly cover a few unusual mappings.
@ rules
&\x01 # most control codes are ignorable
<<<\u0300 # tertiary CE
&9<\x00 # NUL not ignorable
&\uA00A\uA00B=\uA002 # two long-primary CEs
&\uA00A\uA00B\u00050005=\uA003 # three CEs, require 64 bits
* compare
= \x01
= \x02
<3 \u0300
<1 9
<1 \x00
= \x01\x00\x02
<1 a
<3 a\u0300
<2 a\u0308
= ä
<1 b
<1 か # Hiragana Ka (U+304B)
<2 か\u3099 # plus voiced sound mark
= が # Hiragana Ga (U+304C)
<1 \uA00A\uA00B
= \uA002
<1 \uA00A\uA00B\u00050004
<1 \uA00A\uA00B\u00050005
= \uA003
<1 \uA00A\uA00B\u00050006
** test: contractions
# Create some interesting mappings, and map some normalization-inert characters
# (which are not subject to canonical reordering)
# to some of the same CEs to check the sequence of CEs.
@ rules
# Contractions starting with 'a' should not continue with any character < U+0300
# so that we can test a shortcut for that.
&a=ⓐ
&b<bz=ⓑ
&d<dz\u0301=ⓓ # d+z+acute
&z
<a\u0301=Ⓐ # a+acute sorts after z
<a\u0301\u0301=Ⓑ # a+acute+acute
<a\u0301\u0301\u0358=Ⓒ # a+acute+acute+dot above right
<a\u030a=Ⓓ # a+ring
<a\u0323=Ⓔ # a+dot below
<a\u0323\u0358=Ⓕ # a+dot below+dot above right
<a\u0327\u0323\u030a=Ⓖ # a+cedilla+dot below+ring
<a\u0327\u0323bz=Ⓗ # a+cedilla+dot below+b+z
&\U0001D158=⁰ # musical notehead black (has a symbol primary)
<\U0001D158\U0001D165=¼ # musical quarter note
# deliberately missing prefix contractions:
# dz
# a\u0327
# a\u0327\u0323
# a\u0327\u0323b
&\x01
<<<\U0001D165=¹ # musical stem (ccc=216)
<<<\U0001D16D=² # musical augmentation dot (ccc=226)
<<<\U0001D165\U0001D16D=³ # stem+dot (ccc=216 226)
&\u0301=❶ # acute (ccc=230)
&\u030a=❷ # ring (ccc=230)
&\u0308=❸ # diaeresis (ccc=230)
<<\u0308\u0301=❹ # diaeresis+acute (=dialytika tonos) (ccc=230 230)
&\u0327=❺ # cedilla (ccc=202)
&\u0323=❻ # dot below (ccc=220)
&\u0331=❼ # macron below (ccc=220)
<<\u0331\u0358=❽ # macron below+dot above right (ccc=220 232)
&\u0334=❾ # tilde overlay (ccc=1)
&\u0358=❿ # dot above right (ccc=232)
&\u0f71=① # tibetan vowel sign aa
&\u0f72=② # tibetan vowel sign i
# \u0f71\u0f72 # tibetan vowel sign aa + i = ii = U+0F73
&\u0f73=③ # tibetan vowel sign ii (ccc=0 but lccc=129)
** test: simple contractions
# Some strings are chosen to cause incremental contiguous contraction matching to
# go into partial matches for prefixes of contractions
# (where the prefixes are deliberately not also contractions).
# When there is no complete match, then the matching code must back out of those
# so that discontiguous contractions work as specified.
* compare
# contraction starter with no following text, or mismatch, or blocked
<1 a
= ⓐ
<1 aa
= ⓐⓐ
<1 ab
= ⓐb
<1 az
= ⓐz
* compare
<1 a
<2 a\u0308\u030a # ring blocked by diaeresis
= ⓐ❸❷
<2 a\u0327
= ⓐ❺
* compare
<2 \u0308
= ❸
<2 \u0308\u030a\u0301 # acute blocked by ring
= ❸❷❶
* compare
<1 \U0001D158
= ⁰
<1 \U0001D158\U0001D165
= ¼
# no discontiguous contraction because of missing prefix contraction d+z,
# and a starter ('z') after the 'd'
* compare
<1 dz\u0323\u0301
= dz❻❶
# contiguous contractions
* compare
<1 abz
= ⓐⓑ
<1 abzz
= ⓐⓑz
* compare
<1 a
<1 z
<1 a\u0301
= Ⓐ
<1 a\u0301\u0301
= Ⓑ
<1 a\u0301\u0301\u0358
= Ⓒ
<1 a\u030a
= Ⓓ
<1 a\u0323\u0358
= Ⓕ
<1 a\u0327\u0323\u030a # match despite missing prefix
= Ⓖ
<1 a\u0327\u0323bz
= Ⓗ
* compare
<2 \u0308\u0308\u0301 # acute blocked from first diaeresis, contracts with second
= ❸❹
* compare
<1 \U0001D158\U0001D165
= ¼
* compare
<3 \U0001D165\U0001D16D
= ³
** test: discontiguous contractions
* compare
<1 a\u0327\u030a # a+ring skips cedilla
= Ⓓ❺
<2 a\u0327\u0327\u030a # a+ring skips 2 cedillas
= Ⓓ❺❺
<2 a\u0327\u0327\u0327\u030a # a+ring skips 3 cedillas
= Ⓓ❺❺❺
<2 a\u0334\u0327\u0327\u030a # a+ring skips tilde overlay & 2 cedillas
= Ⓓ❾❺❺
<1 a\u0327\u0323 # a+dot below skips cedilla
= Ⓔ❺
<1 a\u0323\u0301\u0358 # a+dot below+dot ab.r.: 2-char match, then skips acute
= Ⓕ❶
<2 a\u0334\u0323\u0358 # a+dot below skips tilde overlay
= Ⓕ❾
* compare
<2 \u0331\u0331\u0358 # macron below+dot ab.r. skips the second macron below
= ❽❼
* compare
<1 a\u0327\u0331\u0323\u030a # a+ring skips cedilla, macron below, dot below (dot blocked by macron)
= Ⓓ❺❼❻
<1 a\u0327\u0323\U0001D16D\u030a # a+dot below skips cedilla
= Ⓔ❺²❷
<2 a\u0327\u0327\u0323\u030a # a+dot below skips 2 cedillas
= Ⓔ❺❺❷
<2 a\u0327\u0323\u0323\u030a # a+dot below skips cedilla
= Ⓔ❺❻❷
<2 a\u0334\u0327\u0323\u030a # a+dot below skips tilde overlay & cedilla
= Ⓔ❾❺❷
* compare
<1 \U0001D158\u0327\U0001D165 # quarter note skips cedilla
= ¼❺
<1 a\U0001D165\u0323 # a+dot below skips stem
= Ⓔ¹
# partial contiguous match, backs up, matches discontiguous contraction
<1 a\u0327\u0323b
= Ⓔ❺b
<1 a\u0327\u0323ba
= Ⓔ❺bⓐ
# a+acute+acute+dot above right skips cedilla, continues matching 2 same-ccc combining marks
* compare
<1 a\u0327\u0301\u0301\u0358
= Ⓒ❺
# FCD but not NFD
* compare
<1 a\u0f73\u0301 # a+acute skips tibetan ii
= Ⓐ③
# FCD but the 0f71 inside the 0f73 must be skipped
# to match the discontiguous contraction of the first 0f71 with the trailing 0f72 inside the 0f73
* compare
<1 \u0f71\u0f73 # == \u0f73\u0f71 == \u0f71\u0f71\u0f72
= ③①
** test: discontiguous contractions with nested contractions
* compare
<1 a\u0323\u0308\u0301\u0358
= Ⓕ❹
<2 a\u0323\u0308\u0301\u0308\u0301\u0358
= Ⓕ❹❹
** test: discontiguous contractions with interleaved contractions
* compare
# a+ring & cedilla & macron below+dot above right
<1 a\u0327\u0331\u030a\u0358
= Ⓓ❺❽
# a+ring & 1x..3x macron below+dot above right
<2 a\u0331\u030a\u0358
= Ⓓ❽
<2 a\u0331\u0331\u030a\u0358\u0358
= Ⓓ❽❽
# also skips acute
<2 a\u0331\u0331\u0331\u030a\u0301\u0358\u0358\u0358
= Ⓓ❽❽❽❶
# a+dot below & stem+augmentation dot, followed by contiguous d+z+acute
<1 a\U0001D165\u0323\U0001D16Ddz\u0301
= Ⓔ³ⓓ
** test: some simple string comparisons
@ root
* compare
# first string compares against ""
= \u0000
< a
<1 b
<3 B
= \u0000B\u0000
** test: compare with strength=primary
% strength=primary
* compare
<1 a
<1 b
= B
** test: compare with strength=secondary
% strength=secondary
* compare
<1 a
<1 b
= B
** test: compare with strength=tertiary
% strength=tertiary
* compare
<1 a
<1 b
<3 B
** test: compare with strength=quaternary
% strength=quaternary
* compare
<1 a
<1 b
<3 B
** test: compare with strength=identical
% strength=identical
* compare
<1 a
<1 b
<3 B
** test: côté with forwards secondary
@ root
* compare
<1 cote
<2 coté
<2 côte
<2 côté
** test: côté with forwards secondary vs. U+FFFE merge separator
# Merged sort keys: On each level, any difference in the first segment
# must trump any further difference.
* compare
<1 cote\uFFFEcôté
<2 coté\uFFFEcôte
<2 côte\uFFFEcoté
<2 côté\uFFFEcote
** test: côté with backwards secondary
% backwards=on
* compare
<1 cote
<2 côte
<2 coté
<2 côté
** test: côté with backwards secondary vs. U+FFFE merge separator
# Merged sort keys: On each level, any difference in the first segment
# must trump any further difference.
* compare
<1 cote\uFFFEcôté
<2 côte\uFFFEcoté
<2 coté\uFFFEcôte
<2 côté\uFFFEcote
** test: U+FFFE on identical level
@ root
% strength=identical
* compare
# All of these control codes are completely-ignorable, so that
# their low code points are compared with the merge separator.
# The merge separator must compare less than any other character.
<1 \uFFFE\u0001\u0002\u0003
<i \u0001\uFFFE\u0002\u0003
<i \u0001\u0002\uFFFE\u0003
<i \u0001\u0002\u0003\uFFFE
* compare
# The merge separator must even compare less than U+0000.
<1 \uFFFE\u0000\u0000
<i \u0000\uFFFE\u0000
<i \u0000\u0000\uFFFE
** test: Hani < surrogates < U+FFFD
# Note: compareUTF8() treats unpaired surrogates like U+FFFD,
# so with that the strings with surrogates will compare equal to each other
# and equal to the string with U+FFFD.
@ root
% strength=identical
* compare
<1 abz
<1 a\u4e00z
<1 a\U00020000z
<1 a\ud800z
<1 a\udbffz
<1 a\udc00z
<1 a\udfffz
<1 a\ufffdz
** test: script reordering
@ root
% reorder Hani Zzzz digit
* compare
<1 ?
<1 +
<1 丂
<1 a
<1 α
<1 5
% reorder default
* compare
<1 ?
<1 +
<1 5
<1 a
<1 α
<1 丂
** test: empty rules
@ rules
* compare
<1 a
<2 ä
<3 Ä
<1 b
** test: very simple rules
@ rules
&a=e<<<<q<<<<r<x<<<X<<y<<<Y;z,Z
% strength=quaternary
* compare
<1 a
= e
<4 q
<4 r
<1 x
<3 X
<2 y
<3 Y
<2 z
<3 Z
** test: tailoring twice before a root position: primary
@ rules
&[before 1]b<p
&[before 1]b<q
* compare
<1 a
<1 p
<1 q
<1 b
** test: tailoring twice before a root position: secondary
@ rules
&[before 2]ſ<<p
&[before 2]ſ<<q
* compare
<1 s
<2 p
<2 q
<2 ſ
# secondary-before common weight
@ rules
&[before 2]b<<p
&[before 2]b<<q
* compare
<1 a
<1 p
<2 q
<2 b
** test: tailoring twice before a root position: tertiary
@ rules
&[before 3]B<<<p
&[before 3]B<<<q
* compare
<1 b
<3 p
<3 q
<3 B
# tertiary-before common weight
@ rules
&[before 3]b<<<p
&[before 3]b<<<q
* compare
<1 a
<1 p
<3 q
<3 b
@ rules
&[before 2]b<<s
&[before 3]s<<<p
&[before 3]s<<<q
* compare
<1 a
<1 p
<3 q
<3 s
<2 b
** test: tailor after completely ignorable
@ rules
&\x00<<<x<<y
* compare
= \x00
= \x1F
<3 x
<2 y
** test: secondary tailoring gaps, ICU ticket 9362
@ rules
&[before 2]s<<'_'
&s<<r # secondary between s and ſ (long s)
&ſ<<*a-q # more than 15 between ſ and secondary CE boundary
&[before 2][first primary ignorable]<<u<<v # between secondary CE boundary & lowest secondary CE
&[last primary ignorable]<<y<<z
* compare
<2 u
<2 v
<2 \u0332 # lowest secondary CE
<2 \u0308
<2 y
<2 z
<1 s_
<2 ss
<2 sr
<2 sſ
<2 sa
<2 sb
<2 sp
<2 sq
<2 sus
<2 svs
<2 rs
** test: tertiary tailoring gaps, ICU ticket 9362
@ rules
&[before 3]t<<<'_'
&t<<<r # tertiary between t and fullwidth t
&ᵀ<<<*a-q # more than 15 between ᵀ (modifier letter T) and tertiary CE boundary
&[before 3][first secondary ignorable]<<<u<<<v # between tertiary CE boundary & lowest tertiary CE
&[last secondary ignorable]<<<y<<<z
* compare
<3 u
<3 v
# Note: The root collator currently does not map any characters to tertiary CEs.
<3 y
<3 z
<1 t_
<3 tt
<3 tr
<3 t
<3 tᵀ
<3 ta
<3 tb
<3 tp
<3 tq
<3 tut
<3 tvt
<3 rt
** test: secondary & tertiary around root character
@ rules
&[before 2]m<<r
&m<<s
&[before 3]m<<<u
&m<<<v
* compare
<1 l
<1 r
<2 u
<3 m
<3 v
<2 s
<1 n
** test: secondary & tertiary around tailored item
@ rules
&m<x
&[before 2]x<<r
&x<<s
&[before 3]x<<<u
&x<<<v
* compare
<1 m
<1 r
<2 u
<3 x
<3 v
<2 s
<1 n
** test: more nesting of secondary & tertiary before
@ rules
&[before 3]m<<<u
&[before 2]m<<r
&[before 3]r<<<q
&m<<<w
&m<<t
&[before 3]w<<<v
&w<<<x
&w<<s
* compare
<1 l
<1 q
<3 r
<2 u
<3 m
<3 v
<3 w
<3 x
<2 s
<2 t
<1 n
** test: case bits
@ rules
&w<x # tailored CE getting case bits
=uv=uV=Uv=UV # 2 chars -> 1 CE
&ae=ch=cH=Ch=CH # 2 chars -> 2 CEs
&rst=yz=yZ=Yz=YZ # 2 chars -> 3 CEs
% caseFirst=lower
* compare
<1 ae
= ch
<3 cH
<3 Ch
<3 CH
<1 rst
= yz
<3 yZ
<3 Yz
<3 YZ
<1 w
<1 x
= uv
<3 uV
= Uv # mixed case on single CE cannot distinguish variations
<3 UV
** test: tertiary CEs, tertiary, caseLevel=off, caseFirst=lower
@ rules
&\u0001<<<t<<<T # tertiary CEs
% caseFirst=lower
* compare
<1 aa
<3 aat
<3 aaT
<3 aA
<3 aAt
<3 ata
<3 aTa
** test: tertiary CEs, tertiary, caseLevel=off, caseFirst=upper
% caseFirst=upper
* compare
<1 aA
<3 aAt
<3 aa
<3 aat
<3 aaT
<3 ata
<3 aTa
** test: reset on expansion, ICU tickets 9415 & 9593
@ rules
&æ<x # tailor the last primary CE so that x sorts between ae and af
&æb=bæ # copy all reset CEs to make bæ sort the same
&각<h # copy/tailor 3 CEs to make h sort before the next Hangul syllable 갂
&⒀<<y # copy/tailor 4 CEs to make y sort with only a secondary difference
&l·=z # handle the pre-context for · when fetching reset CEs
<<u # copy/tailor 2 CEs
* compare
<1 ae
<2 æ
<1 x
<1 af
* compare
<1 aeb
<2 æb
= bæ
* compare
<1 각
<1 h
<1 갂
<1 갃
* compare
<1 · # by itself: primary CE
<1 l
<2 l· # l+middle dot has only a secondary difference from l
= z
<2 u
* compare
<1 (13)
<3 ⒀ # DUCET sets special tertiary weights in all CEs
<2 y
<1 (13[
% alternate=shifted
* compare
<1 (13)
= 13
<3 ⒀
= y # alternate=shifted removes the tailoring difference on the last CE
<1 14
** test: contraction inside extension, ICU ticket 9378
@ rules
&а<<х/й # all letters are Cyrillic
* compare
<1 ай
<2 х
** test: no duplicate tailored CEs for different reset positions with same CEs, ICU ticket 10104
@ rules
&t<x &ᵀ<y # same primary weights
&q<u &[before 1]ꝗ<v # q and ꝗ are primary adjacent
* compare
<1 q
<1 u
<1 v
<1 ꝗ
<1 t
<3 ᵀ
<1 y
<1 x
# Principle: Each rule builds on the state of preceding rules and ignores following rules.
** test: later rule does not affect earlier reset position, ICU ticket 10105
@ rules
&a < u < v < w &ov < x &b < v
* compare
<1 oa
<1 ou
<1 x # CE(o) followed by CE between u and w
<1 ow
<1 ob
<1 ov
** test: later rule does not affect earlier extension (1), ICU ticket 10105
@ rules
&a=x/b &v=b
% strength=secondary
* compare
<1 B
<1 c
<1 v
= b
* compare
<1 AB
= x
<1 ac
<1 av
= ab
** test: later rule does not affect earlier extension (2), ICU ticket 10105
@ rules
&a <<< c / e &g <<< e / l
% strength=secondary
* compare
<1 AE
= c
<2 æ
<1 agl
= ae
** test: later rule does not affect earlier extension (3), ICU ticket 10105
@ rules
&a = b / c &d = c / e
% strength=secondary
* compare
<1 AC # C is still only tertiary different from the original c
= b
<1 ade
= ac
** test: extension contains tailored character, ICU ticket 10105
@ rules
&a=e &b=u/e
* compare
<1 a
= e
<1 ba
= be
= u
** test: add simple mappings for characters with root context
@ rules
&z=· # middle dot has a prefix mapping in the CLDR root
&n=и # и (U+0438) has contractions in the root
* compare
<1 l
<2 l· # root mapping for l|· still works
<1 z
= ·
* compare
<1 n
= и
<1 И
<1 и\u0306 # root mapping for й=и\u0306 still works
= й
<3 Й
** test: add context mappings around characters with root context
@ rules
&z=·h # middle dot has a prefix mapping in the CLDR root
&n=ә|и # и (U+0438) has contractions in the root
* compare
<1 l
<2 l· # root mapping for l|· still works
<1 z
= ·h
* compare
<1 и
<3 И
<1 и\u0306 # root mapping for й=и\u0306 still works
= й
* compare
<1 әn
= әи
<1 әo
** test: many secondary CEs at the top of their range
@ rules
&[last primary ignorable]<<*\u2801-\u28ff
* compare
<2 \u0308
<2 \u2801
<2 \u2802
<2 \u2803
<2 \u2804
<2 \u28fd
<2 \u28fe
<2 \u28ff
<1 \x20
** test: many tertiary CEs at the top of their range
@ rules
&[last secondary ignorable]<<<*a-z
* compare
<3 a
<3 b
<3 c
<3 d
# e..w
<3 x
<3 y
<3 z
<2 \u0308
** test: tailor contraction together with nearly equivalent prefix, ICU ticket 10101
@ rules
&a=p|x &b=px &c=op
* compare
<1 b
= px
<3 B
<1 c
= op
<3 C
* compare
<1 ca
= opx # first contraction op, then prefix p|x
<3 cA
<3 Ca
** test: reset position with prefix (pre-context), ICU ticket 10102
@ rules
&a=p|x &px=y
* compare
<1 pa
= px
= y
<3 pA
<1 q
<1 x
** test: prefix+contraction together (1), ICU ticket 10071
@ rules
&x=a|bc
* compare
<1 ab
<1 Abc
<1 abd
<1 ac
<1 aw
<1 ax
= abc
<3 aX
<3 Ax
<1 b
<1 bb
<1 bc
<3 bC
<3 Bc
<1 bd
** test: prefix+contraction together (2), ICU ticket 10071
@ rules
&w=bc &x=a|b
* compare
<1 w
= bc
<3 W
* compare
<1 aw
<1 ax
= ab
<3 aX
<1 axb
<1 axc
= abc # prefix match a|b takes precedence over contraction match bc
<3 abC
<1 abd
<1 ay
** test: prefix+contraction together (3), ICU ticket 10071
@ rules
&x=a|b &w=bc # reverse order of rules as previous test, order should not matter here
* compare # same "compare" sequences as previous test
<1 w
= bc
<3 W
* compare
<1 aw
<1 ax
= ab
<3 aX
<1 axb
<1 axc
= abc # prefix match a|b takes precedence over contraction match bc
<3 abC
<1 abd
<1 ay
** test: no mapping p|c, falls back to contraction ch, CLDR ticket 5962
@ rules
&d=ch &v=p|ci
* compare
<1 pc
<3 pC
<1 pcH
<1 pcI
<1 pd
= pch # no-prefix contraction ch matches
<3 pD
<1 pv
= pci # prefix+contraction p|ci matches
<3 pV
** test: tailor in & around compact ranges of root primaries
# The Ogham characters U+1681..U+169A are in simple ascending order of primary CEs
# which should be reliably encoded as one range in the root elements data.
@ rules
&[before 1]ᚁ<a
&ᚁ<b
&[before 1]ᚂ<c
&ᚂ<d
&[before 1]ᚚ<y
&ᚚ<z
&[before 2]ᚁ<<r
&ᚁ<<s
&[before 3]ᚚ<<<t
&ᚚ<<<u
* compare
<1 ᣵ # U+18F5 last Canadian Aboriginal
<1 a
<1 r
<2 ᚁ
<2 s
<1 b
<1 c
<1 ᚂ
<1 d
<1 ᚃ
<1 ᚙ
<1 y
<1 t
<3 ᚚ
<3 u
<1 z
<1 ᚠ # U+16A0 first Runic
** test: suppressContractions
@ rules
&z<ch<әж [suppressContractions [·cә]]
* compare
<1 ch
<3 cH # ch was suppressed
<1 l
<1 l· # primary difference, not secondary, because l|· was suppressed
<1 ә
<2 ә\u0308 # secondary difference, not primary, because contractions for ә were suppressed
<1 әж
<3 әЖ
** test: Hangul & Jamo
@ rules
&L=\u1100 # first Jamo L
&V=\u1161 # first Jamo V
&T=\u11A8 # first Jamo T
&\uAC01<<*\u4E00-\u4EFF # first Hangul LVT syllable & lots of secondary diffs
* compare
<1 Lv
<3 LV
= \u1100\u1161
= \uAC00
<1 LVt
<3 LVT
= \u1100\u1161\u11A8
= \uAC00\u11A8
= \uAC01
<2 LVT\u0308
<2 \u4E00
<2 \u4E01
<2 \u4E80
<2 \u4EFF
<2 LV\u0308T
<1 \uAC02
** test: adjust special reset positions according to previous rules, CLDR ticket 6070
@ rules
&[last variable]<x
[maxVariable space] # has effect only after building, no effect on following rules
&[last variable]<y
&[before 1][first regular]<z
* compare
<1 ? # some punctuation
<1 x
<1 y
<1 z
<1 $ # some symbol
@ rules
&[last primary ignorable]<<x<<<y
&[last primary ignorable]<<z
* compare
<2 \u0358
<2 x
<3 y
<2 z
<1 \x20
@ rules
&[last secondary ignorable]<<<x
&[last secondary ignorable]<<<y
* compare
<3 x
<3 y
<2 \u0358
@ rules
&[before 2][first variable]<<z
&[before 2][first variable]<<y
&[before 3][first variable]<<<x
&[before 3][first variable]<<<w
&[before 1][first variable]<v
&[before 2][first variable]<<u
&[before 3][first variable]<<<t
&[before 2]\uFDD1\xA0<<s # FractionalUCA.txt: FDD1 00A0, SPACE first primary
* compare
<2 \u0358
<1 s
<2 \uFDD1\xA0
<1 t
<3 u
<2 v
<1 w
<3 x
<3 y
<2 z
<2 \t
@ rules
&[before 2][first regular]<<z
&[before 3][first regular]<<<y
&[before 1][first regular]<x
&[before 3][first regular]<<<w
&[before 2]\uFDD1\u263A<<v # FractionalUCA.txt: FDD1 263A, SYMBOL first primary
&[before 3][first regular]<<<u
&[before 1][first regular]<p # primary before the boundary: becomes variable
&[before 3][first regular]<<<t # not affected by p
&[last variable]<q # after p!
* compare
<1 ?
<1 p
<1 q
<1 t
<3 u
<3 v
<1 w
<3 x
<1 y
<3 z
<1 $
# check that p & q are indeed variable
% alternate=shifted
* compare
= ?
= p
= q
<1 t
<3 u
<3 v
<1 w
<3 x
<1 y
<3 z
<1 $
@ rules
&[before 2][first trailing]<<z
&[before 1][first trailing]<y
&[before 3][first trailing]<<<x
* compare
<1 \u4E00 # first Han, first implicit
<1 \uFDD1\uFDD0 # FractionalUCA.txt: unassigned first primary
# Note: The root collator currently does not map any characters to the trailing first boundary primary.
<1 x
<3 y
<1 z
<2 \uFFFD # The root collator currently maps U+FFFD to the first real trailing primary.
@ rules
&[before 2][first primary ignorable]<<z
&[before 2][first primary ignorable]<<y
&[before 3][first primary ignorable]<<<x
&[before 3][first primary ignorable]<<<w
* compare
= \x01
<2 w
<3 x
<3 y
<2 z
<2 \u0301
@ rules
&[before 3][first secondary ignorable]<<<y
&[before 3][first secondary ignorable]<<<x
* compare
= \x01
<3 x
<3 y
<2 \u0301
** test: canonical closure
@ rules
&X=A &U=Â
* compare
<1 U
= Â
= A\u0302
<2 Ú # U with acute
= U\u0301
= Ấ # A with circumflex & acute
= Â\u0301
= A\u0302\u0301
<1 X
= A
<2 X\u030A # with ring above
= Å
= A\u030A
= \u212B # Angstrom sign
@ rules
&x=\u5140\u55C0
* compare
<1 x
= \u5140\u55C0
= \u5140\uFA0D
= \uFA0C\u55C0
= \uFA0C\uFA0D # CJK compatibility characters
<3 X
# canonical closure on prefix rules, ICU ticket 9444
@ rules
&x=ä|ŝ
* compare
<1 äs # not tailored
<1 äx
= äŝ
= a\u0308s\u0302
= a\u0308ŝ
= äs\u0302
<3 äX
** test: conjoining Jamo map to expansions
@ rules
&gg=\u1101 # Jamo Lead consonant GG
&nj=\u11AC # Jamo Trail consonant NJ
* compare
<1 gg\u1161nj
= \u1101\u1161\u11AC
= \uAE4C\u11AC
= \uAE51
<3 gg\u1161nJ
<1 \u1100\u1100
** test: canonical tail closure, ICU ticket 5913
@ rules
&a<â
* compare
<1 a
<1 â # tailored
= a\u0302
<2 a\u0323\u0302 # discontiguous contraction
= ạ\u0302 # equivalent
= ậ # equivalent
<1 b
@ rules
&a<ạ
* compare
<1 a
<1 ạ # tailored
= a\u0323
<2 a\u0323\u0302 # contiguous contraction plus extra diacritic
= ạ\u0302 # equivalent
= ậ # equivalent
<1 b
# Tail closure should work even if there is a prefix and/or contraction.
@ rules
&a<\u5140|câ
# In order to find discontiguous contractions for \u5140|câ
# there must exist a mapping for \u5140|ca, regardless of what it maps to.
# (This follows from the UCA spec.)
&x=\u5140|ca
* compare
<1 \u5140a
= \uFA0Ca
<1 \u5140câ # tailored
= \uFA0Ccâ
= \u5140ca\u0302
= \uFA0Cca\u0302
<2 \u5140ca\u0323\u0302 # discontiguous contraction
= \uFA0Cca\u0323\u0302
= \u5140cạ\u0302
= \uFA0Ccạ\u0302
= \u5140cậ
= \uFA0Ccậ
<1 \u5140b
= \uFA0Cb
<1 \u5140x
= \u5140ca
# Double-check that without the extra mapping there will be no discontiguous match.
@ rules
&a<\u5140|câ
* compare
<1 \u5140a
= \uFA0Ca
<1 \u5140câ # tailored
= \uFA0Ccâ
= \u5140ca\u0302
= \uFA0Cca\u0302
<1 \u5140b
= \uFA0Cb
<1 \u5140ca\u0323\u0302 # no discontiguous contraction
= \uFA0Cca\u0323\u0302
= \u5140cạ\u0302
= \uFA0Ccạ\u0302
= \u5140cậ
= \uFA0Ccậ
@ rules
&a<cạ
* compare
<1 a
<1 cạ # tailored
= ca\u0323
<2 ca\u0323\u0302 # contiguous contraction plus extra diacritic
= cạ\u0302 # equivalent
= cậ # equivalent
<1 b
# ᾢ = U+1FA2 GREEK SMALL LETTER OMEGA WITH PSILI AND VARIA AND YPOGEGRAMMENI
# = 03C9 0313 0300 0345
# ccc = 0, 230, 230, 240
@ rules
&δ=αῳ
# In order to find discontiguous contractions for αῳ
# there must exist a mapping for αω, regardless of what it maps to.
# (This follows from the UCA spec.)
&ε=αω
* compare
<1 δ
= αῳ
= αω\u0345
<2 αω\u0313\u0300\u0345 # discontiguous contraction
= αὠ\u0300\u0345
= αὢ\u0345
= αᾢ
<2 αω\u0300\u0313\u0345
= αὼ\u0313\u0345
= αῲ\u0313 # not FCD
<1 ε
= αω
# Double-check that without the extra mapping there will be no discontiguous match.
@ rules
&δ=αῳ
* compare
<1 αω\u0313\u0300\u0345 # no discontiguous contraction
= αὠ\u0300\u0345
= αὢ\u0345
= αᾢ
<2 αω\u0300\u0313\u0345
= αὼ\u0313\u0345
= αῲ\u0313 # not FCD
<1 δ
= αῳ
= αω\u0345
# Add U+0315 COMBINING COMMA ABOVE RIGHT which has ccc=232.
# Tests code paths where the tailored string has a combining mark
# that does not occur in any composite's decomposition.
@ rules
&δ=αὼ\u0315
* compare
<1 αω\u0313\u0300\u0315 # Not tailored: The grave accent blocks the comma above.
= αὠ\u0300\u0315
= αὢ\u0315
<1 δ
= αὼ\u0315
= αω\u0300\u0315
<2 αω\u0300\u0315\u0345
= αὼ\u0315\u0345
= αῲ\u0315 # not FCD
** test: danish a+a vs. a-umlaut, ICU ticket 9319
@ rules
&z<aa
* compare
<1 z
<1 aa
<2 aa\u0308
= aä
** test: Jamo L with and in prefix
# Useful for the Korean "searchjl" tailoring (instead of contractions of pairs of Jamo L).
@ rules
# Jamo Lead consonant G after G or GG
&[last primary ignorable]<<\u1100|\u1100=\u1101|\u1100
# Jamo Lead consonant GG sorts like G+G
&\u1100\u1100=\u1101
# Note: Making G|GG and GG|GG sort the same as G|G+G
# would require the ability to reset on G|G+G,
# or we could make G-after-G equal to some secondary-CE character,
# and reset on a pair of those.
# (It does not matter much if there are at most two G in a row in real text.)
* compare
<1 \u1100
<2 \u1100\u1100 # only one primary from a sequence of G lead consonants
= \u1101
<2 \u1100\u1100\u1100
= \u1101\u1100
# but not = \u1100\u1101, see above
<1 \u1100\u1161
= \uAC00
<2 \u1100\u1100\u1161
= \u1100\uAC00 # prefix match from the L of the LV syllable
= \u1101\u1161
= \uAE4C
** test: proposed Korean "searchjl" tailoring with prefixes, CLDR ticket 6546
@ rules
# Low secondary CEs for Jamo V & T.
# Note: T should sort before V for proper syllable order.
&\u0332 # COMBINING LOW LINE (first primary ignorable)
<<\u1161<<\u1162
# Korean Jamo lead consonant search rules, part 2:
# Make modern compound L jamo primary equivalent to non-compound forms.
# Secondary CEs for Jamo L-after-L, greater than Jamo V & T.
&\u0313 # COMBINING COMMA ABOVE (second primary ignorable)
=\u1100|\u1100
=\u1103|\u1103
=\u1107|\u1107
=\u1109|\u1109
=\u110C|\u110C
# Compound L Jamo map to equivalent expansions of primary+secondary CE.
&\u1100\u0313=\u1101<<<\u3132 # HANGUL CHOSEONG SSANGKIYEOK, HANGUL LETTER SSANGKIYEOK
&\u1103\u0313=\u1104<<<\u3138 # HANGUL CHOSEONG SSANGTIKEUT, HANGUL LETTER SSANGTIKEUT
&\u1107\u0313=\u1108<<<\u3143 # HANGUL CHOSEONG SSANGPIEUP, HANGUL LETTER SSANGPIEUP
&\u1109\u0313=\u110A<<<\u3146 # HANGUL CHOSEONG SSANGSIOS, HANGUL LETTER SSANGSIOS
&\u110C\u0313=\u110D<<<\u3149 # HANGUL CHOSEONG SSANGCIEUC, HANGUL LETTER SSANGCIEUC
* compare
<1 \u1100\u1161
= \uAC00
<2 \u1100\u1162
= \uAC1C
<2 \u1100\u1100\u1161
= \u1100\uAC00
= \u1101\u1161
= \uAE4C
<3 \u3132\u1161
** test: Hangul syllables in prefix & in the interior of a contraction
@ rules
&x=\u1100\u1161|a\u1102\u1162z
* compare
<1 \u1100\u1161x
= \u1100\u1161a\u1102\u1162z
= \u1100\u1161a\uB0B4z
= \uAC00a\u1102\u1162z
= \uAC00a\uB0B4z
** test: digits are unsafe-backwards when numeric=on
@ root
% numeric=on
* compare
# If digits are not unsafe, then numeric collation sees "1"=="01" and "b">"a".
# We need to back up before the identical prefix "1" and compare the full numbers.
<1 11b
<1 101a
** test: simple locale data test
@ locale de
* compare
<1 a
<2 ä
<1 ae
<2 æ
@ locale de-u-co-phonebk
* compare
<1 a
<1 ae
<2 ä
<2 æ
# The following test cases were moved here from ICU 52's DataDrivenCollationTest.txt.
** test: DataDrivenCollationTest/TestMorePinyin
# Testing the primary strength.
@ locale zh
% strength=primary
* compare
< lā
= lĀ
= Lā
= LĀ
< lān
= lĀn
< lē
= lĒ
= Lē
= LĒ
< lēn
= lĒn
** test: DataDrivenCollationTest/TestLithuanian
# Lithuanian sort order.
@ locale lt
* compare
< cz
< č
< d
< iz
< j
< sz
< š
< t
< zz
< ž
** test: DataDrivenCollationTest/TestLatvian
# Latvian sort order.
@ locale lv
* compare
< cz
< č
< d
< gz
< ģ
< h
< iz
< j
< kz
< ķ
< l
< lz
< ļ
< m
< nz
< ņ
< o
< rz
< ŗ
< s
< sz
< š
< t
< zz
< ž
** test: DataDrivenCollationTest/TestEstonian
# Estonian sort order.
@ locale et
* compare
< sy
< š
< šy
< z
< zy
< ž
< v
< va
< w
< õ
< õy
< ä
< äy
< ö
< öy
< ü
< üy
< x
** test: DataDrivenCollationTest/TestAlbanian
# Albanian sort order.
@ locale sq
* compare
< cz
< ç
< d
< dz
< dh
< e
< ez
< ë
< f
< gz
< gj
< h
< lz
< ll
< m
< nz
< nj
< o
< rz
< rr
< s
< sz
< sh
< t
< tz
< th
< u
< xz
< xh
< y
< zz
< zh
** test: DataDrivenCollationTest/TestSimplifiedChineseOrder
# Sorted file has different order.
@ root
# normalization=on turned on & off automatically.
* compare
< \u5F20
< \u5F20\u4E00\u8E3F
** test: DataDrivenCollationTest/TestTibetanNormalizedIterativeCrash
# This pretty much crashes.
@ root
* compare
< \u0f71\u0f72\u0f80\u0f71\u0f72
< \u0f80
** test: DataDrivenCollationTest/TestThaiPartialSortKeyProblems
# These are examples of strings that caused trouble in partial sort key testing.
@ locale th-TH
* compare
< \u0E01\u0E01\u0E38\u0E18\u0E20\u0E31\u0E13\u0E11\u0E4C
< \u0E01\u0E01\u0E38\u0E2A\u0E31\u0E19\u0E42\u0E18
* compare
< \u0E01\u0E07\u0E01\u0E32\u0E23
< \u0E01\u0E07\u0E42\u0E01\u0E49
* compare
< \u0E01\u0E23\u0E19\u0E17\u0E32
< \u0E01\u0E23\u0E19\u0E19\u0E40\u0E0A\u0E49\u0E32
* compare
< \u0E01\u0E23\u0E30\u0E40\u0E08\u0E35\u0E22\u0E27
< \u0E01\u0E23\u0E30\u0E40\u0E08\u0E35\u0E4A\u0E22\u0E27
* compare
< \u0E01\u0E23\u0E23\u0E40\u0E0A\u0E2D
< \u0E01\u0E23\u0E23\u0E40\u0E0A\u0E49\u0E32
** test: DataDrivenCollationTest/TestJavaStyleRule
# java.text allows rules to start as '<<<x<<<y...'
# we emulate this by assuming a &[first tertiary ignorable] in this case.
@ rules
&\u0001=equal<<<z<<x<<<w &[first tertiary ignorable]=a &[first primary ignorable]=b
* compare
= a
= equal
< z
< x
= b # x had become the new first primary ignorable
< w
** test: DataDrivenCollationTest/TestShiftedIgnorable
# The UCA states that primary ignorables should be completely
# ignorable when following a shifted code point.
@ root
% alternate=shifted
% strength=quaternary
* compare
< a\u0020b
= a\u0020\u0300b
= a\u0020\u0301b
< a_b
= a_\u0300b
= a_\u0301b
< A\u0020b
= A\u0020\u0300b
= A\u0020\u0301b
< A_b
= A_\u0300b
= A_\u0301b
< a\u0301b
< A\u0301b
< a\u0300b
< A\u0300b
** test: DataDrivenCollationTest/TestNShiftedIgnorable
# The UCA states that primary ignorables should be completely
# ignorable when following a shifted code point.
@ root
% alternate=non-ignorable
% strength=tertiary
* compare
< a\u0020b
< A\u0020b
< a\u0020\u0301b
< A\u0020\u0301b
< a\u0020\u0300b
< A\u0020\u0300b
< a_b
< A_b
< a_\u0301b
< A_\u0301b
< a_\u0300b
< A_\u0300b
< a\u0301b
< A\u0301b
< a\u0300b
< A\u0300b
** test: DataDrivenCollationTest/TestSafeSurrogates
# It turned out that surrogates were not skipped properly
# when iterating backwards if they were in the middle of a
# contraction. This test assures that this is fixed.
@ rules
&a < x\ud800\udc00b
* compare
< a
< x\ud800\udc00b
** test: DataDrivenCollationTest/da_TestPrimary
# This test goes through primary strength cases
@ locale da
% strength=primary
* compare
< Lvi
< Lwi
* compare
< L\u00e4vi
< L\u00f6wi
* compare
< L\u00fcbeck
= Lybeck
** test: DataDrivenCollationTest/da_TestTertiary
# This test goes through tertiary strength cases
@ locale da
% strength=tertiary
* compare
< Luc
< luck
* compare
< luck
< L\u00fcbeck
* compare
< lybeck
< L\u00fcbeck
* compare
< L\u00e4vi
< L\u00f6we
* compare
< L\u00f6ww
< mast
* compare
< A/S
< ANDRE
< ANDR\u00c9
< ANDREAS
< AS
< CA
< \u00c7A
< CB
< \u00c7C
< D.S.B.
< DA
< \u00d0A
< DB
< \u00d0C
< DSB
< DSC
< EKSTRA_ARBEJDE
< EKSTRABUD0
< H\u00d8ST
< HAAG
< H\u00c5NDBOG
< HAANDV\u00c6RKSBANKEN
< Karl
< karl
< NIELS\u0020J\u00d8RGEN
< NIELS-J\u00d8RGEN
< NIELSEN
< R\u00c9E,\u0020A
< REE,\u0020B
< R\u00c9E,\u0020L
< REE,\u0020V
< SCHYTT,\u0020B
< SCHYTT,\u0020H
< SCH\u00dcTT,\u0020H
< SCHYTT,\u0020L
< SCH\u00dcTT,\u0020M
< SS
< \u00df
< SSA
< STORE\u0020VILDMOSE
< STOREK\u00c6R0
< STORM\u0020PETERSEN
< STORMLY
< THORVALD
< THORVARDUR
< \u00feORVAR\u00d0UR
< THYGESEN
< VESTERG\u00c5RD,\u0020A
< VESTERGAARD,\u0020A
< VESTERG\u00c5RD,\u0020B
< \u00c6BLE
< \u00c4BLE
< \u00d8BERG
< \u00d6BERG
* compare
< andere
< chaque
< chemin
< cote
< cot\u00e9
< c\u00f4te
< c\u00f4t\u00e9
< \u010du\u010d\u0113t
< Czech
< hi\u0161a
< irdisch
< lie
< lire
< llama
< l\u00f5ug
< l\u00f2za
< lu\u010d
< luck
< L\u00fcbeck
< lye
< l\u00e4vi
< L\u00f6wen
< m\u00e0\u0161ta
< m\u00eer
< myndig
< M\u00e4nner
< m\u00f6chten
< pi\u00f1a
< pint
< pylon
< \u0161\u00e0ran
< savoir
< \u0160erb\u016bra
< Sietla
< \u015blub
< subtle
< symbol
< s\u00e4mtlich
< verkehrt
< vox
< v\u00e4ga
< waffle
< wood
< yen
< yuan
< yucca
< \u017eal
< \u017eena
< \u017den\u0113va
< zoo0
< Zviedrija
< Z\u00fcrich
< zysk0
< \u00e4ndere
** test: DataDrivenCollationTest/hi_TestNewRules
# This test goes through new rules and tests against old rules
@ locale hi
* compare
< कॐ
< कं
< कँ
< कः
** test: DataDrivenCollationTest/ro_TestNewRules
# This test goes through new rules and tests against old rules
@ locale ro
* compare
< xAx
< xă
< xĂ
< Xă
< XĂ
< xăx
< xĂx
< xâ
< xÂ
< Xâ
< XÂ
< xâx
< xÂx
< xb
< xIx
< xî
< xÎ
< Xî
< XÎ
< xîx
< xÎx
< xj
< xSx
< xș
= xş
< xȘ
= xŞ
< Xș
= Xş
< XȘ
= XŞ
< xșx
= xşx
< xȘx
= xŞx
< xT
< xTx
< xț
= xţ
< xȚ
= xŢ
< Xț
= Xţ
< XȚ
= XŢ
< xțx
= xţx
< xȚx
= xŢx
< xU
** test: DataDrivenCollationTest/testOffsets
# This tests cases where forwards and backwards iteration get different offsets
@ locale en
% strength=tertiary
* compare
< a\uD800\uDC00\uDC00
< b\uD800\uDC00\uDC00
* compare
< \u0301A\u0301\u0301
< \u0301B\u0301\u0301
* compare
< abcd\r\u0301
< abce\r\u0301
# TODO: test offsets in new CollationTest
# End of test cases moved here from ICU 52's DataDrivenCollationTest.txt.
** test: was ICU 52 cmsccoll/TestRedundantRules
@ rules
& a < b < c < d& [before 1] c < m
* compare
<1 a
<1 b
<1 m
<1 c
<1 d
@ rules
& a < b <<< c << d <<< e& [before 3] e <<< x
* compare
<1 a
<1 b
<3 c
<2 d
<3 x
<3 e
@ rules
& a < b <<< c << d <<< e <<< f < g& [before 1] g < x
* compare
<1 a
<1 b
<3 c
<2 d
<3 e
<3 f
<1 x
<1 g
@ rules
& a <<< b << c < d& a < m
* compare
<1 a
<3 b
<2 c
<1 m
<1 d
@ rules
&a<b<<b\u0301 &z<b
* compare
<1 a
<1 b\u0301
<1 z
<1 b
@ rules
&z<m<<<q<<<m
* compare
<1 z
<1 q
<3 m
@ rules
&z<<<m<q<<<m
* compare
<1 z
<1 q
<3 m
@ rules
& a < b < c < d& r < c
* compare
<1 a
<1 b
<1 d
<1 r
<1 c
@ rules
& a < b < c < d& c < m
* compare
<1 a
<1 b
<1 c
<1 m
<1 d
@ rules
& a < b < c < d& a < m
* compare
<1 a
<1 m
<1 b
<1 c
<1 d
** test: was ICU 52 cmsccoll/TestExpansionSyntax
# The following two rules should sort the particular list of strings the same.
@ rules
&AE <<< a << b <<< c &d <<< f
* compare
<1 AE
<3 a
<2 b
<3 c
<1 d
<3 f
@ rules
&A <<< a / E << b / E <<< c /E &d <<< f
* compare
<1 AE
<3 a
<2 b
<3 c
<1 d
<3 f
# The following two rules should sort the particular list of strings the same.
@ rules
&AE <<< a <<< b << c << d < e < f <<< g
* compare
<1 AE
<3 a
<3 b
<2 c
<2 d
<1 e
<1 f
<3 g
@ rules
&A <<< a / E <<< b / E << c / E << d / E < e < f <<< g
* compare
<1 AE
<3 a
<3 b
<2 c
<2 d
<1 e
<1 f
<3 g
# The following two rules should sort the particular list of strings the same.
@ rules
&AE <<< B <<< C / D <<< F
* compare
<1 AE
<3 B
<3 F
<1 AED
<3 C
@ rules
&A <<< B / E <<< C / ED <<< F / E
* compare
<1 AE
<3 B
<3 F
<1 AED
<3 C
** test: never reorder trailing primaries
@ root
% reorder Zzzz Grek
* compare
<1 L
<1 字
<1 Ω
<1 \uFFFD
<1 \uFFFF
** test: fall back to mappings with shorter prefixes, not immediately to ones with no prefixes
@ rules
&u=ab|cd
&v=b|ce
* compare
<1 abc
<1 abcc
<1 abcf
<1 abcd
= abu
<1 abce
= abv
# With the following rules, there is only one prefix per composite ĉ or ç,
# but both prefixes apply to just c in NFD form.
# We would get different results for composed vs. NFD input
# if we fell back directly from longest-prefix mappings to no-prefix mappings.
@ rules
&x=op|ĉ
&y=p|ç
* compare
<1 opc
<2 opć
<1 opcz
<1 opd
<1 opĉ
= opc\u0302
= opx
<1 opç
= opc\u0327
= opy
# The mapping is used which has the longest matching prefix for which
# there is also a suffix match, with the longest suffix match among several for that prefix.
@ rules
&❶=d
&❷=de
&❸=def
&①=c|d
&②=c|de
&③=c|def
&④=bc|d
&⑤=bc|de
&⑥=bc|def
&⑦=abc|d
&⑧=abc|de
&⑨=abc|def
* compare
<1 9aadzz
= 9aa❶zz
<1 9aadez
= 9aa❷z
<1 9aadef
= 9aa❸
<1 9acdzz
= 9ac①zz
<1 9acdez
= 9ac②z
<1 9acdef
= 9ac③
<1 9bcdzz
= 9bc④zz
<1 9bcdez
= 9bc⑤z
<1 9bcdef
= 9bc⑥
<1 abcdzz
= abc⑦zz
<1 abcdez
= abc⑧z
<1 abcdef
= abc⑨
** test: prefix + discontiguous contraction with missing prefix contraction
# Unfortunate terminology: The first "prefix" here is the pre-context,
# the second "prefix" refers to the contraction/relation string that is
# one shorter than the one being tested.
@ rules
&x=p|e
&y=p|ê
&z=op|ê
# No mapping for op|e:
# Discontiguous contraction matching should not match op|ê in opệ
# because it would have to skip the dot below and extend a match on op|e by the circumflex,
# but there is no match on op|e.
* compare
<1 oPe
<1 ope
= opx
<1 opệ
= opy\u0323 # y not z
<1 opê
= opz
# We cannot test for fallback by whether the contraction default CE32
# is for another contraction. With the following rules, there is no mapping for op|e,
# and the fallback to prefix p has no contractions.
@ rules
&x=p|e
&z=op|ê
* compare
<1 oPe
<1 ope
= opx
<2 opệ
= opx\u0323\u0302 # x not z
<1 opê
= opz
# One more variation: Fallback to the simple code point, no shorter non-empty prefix.
@ rules
&x=e
&z=op|ê
* compare
<1 ope
= opx
<3 oPe
= oPx
<2 opệ
= opx\u0323\u0302 # x not z
<1 opê
= opz
** test: maxVariable via rules
@ rules
[maxVariable space][alternate shifted]
* compare
= \u0020
= \u000A
<1 .
<1 ° # degree sign
<1 $
<1 0
** test: maxVariable via setting
@ root
% maxVariable=currency
% alternate=shifted
* compare
= \u0020
= \u000A
= .
= ° # degree sign
= $
<1 0
** test: ICU4J CollationMiscTest/TestContractionClosure (ää)
# This tests canonical closure, but it also tests that CollationFastLatin
# bails out properly for contractions with combining marks.
# For that we need pairs of strings that remain in the Latin fastpath
# long enough, hence the extra "= b" lines.
@ rules
&b=\u00e4\u00e4
* compare
<1 b
= \u00e4\u00e4
= b
= a\u0308a\u0308
= b
= \u00e4a\u0308
= b
= a\u0308\u00e4
** test: ICU4J CollationMiscTest/TestContractionClosure (Å)
@ rules
&b=\u00C5
* compare
<1 b
= \u00C5
= b
= A\u030A
= b
= \u212B
** test: reset-before on already-tailored characters, ICU ticket 10108
@ rules
&a<w<<x &[before 2]x<<y
* compare
<1 a
<1 w
<2 y
<2 x
@ rules
&a<<w<<<x &[before 2]x<<y
* compare
<1 a
<2 y
<2 w
<3 x
@ rules
&a<w<x &[before 2]x<<y
* compare
<1 a
<1 w
<1 y
<2 x
@ rules
&a<w<<<x &[before 2]x<<y
* compare
<1 a
<1 y
<2 w
<3 x
** test: numeric collation with other settings, ICU ticket 9092
@ root
% strength=identical
% caseFirst=upper
% numeric=on
* compare
<1 100\u0020a
<1 101
** test: collation type fallback from unsupported type, ICU ticket 10149
@ locale fr-CA-u-co-phonebk
# Expect the same result as with fr-CA, using backwards-secondary order.
# That is, we should fall back from the unsupported collation type
# to the locale's default collation type.
* compare
<1 cote
<2 côte
<2 coté
<2 côté
** test: @ is equivalent to [backwards 2], ICU ticket 9956
@ rules
&b<a @ &v<<w
* compare
<1 b
<1 a
<1 cote
<2 côte
<2 coté
<2 côté
<1 v
<2 w
<1 x
** test: shifted+reordering, ICU ticket 9507
@ root
% reorder Grek punct space
% alternate=shifted
% strength=quaternary
# Which primaries are "variable" should be determined without script reordering,
# and then primaries should be reordered whether they are shifted to quaternary or not.
* compare
<4 ( # punctuation
<4 )
<4 \u0020 # space
<1 ` # symbol
<1 ^
<1 $ # currency symbol
<1 €
<1 0 # numbers
<1 ε # Greek
<1 e # Latin
<1 e(e
<4 e)e
<4 e\u0020e
<4 ee
<3 e(E
<4 e)E
<4 e\u0020E
<4 eE
** test: "uppercase first" could sort a string before its prefix, ICU ticket 9351
@ rules
&\u0001<<<b<<<B
% caseFirst=upper
* compare
<1 aaa
<3 aaaB
** test: secondary+case ignores secondary ignorables, ICU ticket 9355
@ rules
&\u0001<<<b<<<B
% strength=secondary
% caseLevel=on
* compare
<1 a
= ab
= aB
** test: custom collation rules involving tail of a contraction in Malayalam, ICU ticket 6328
@ rules
&[before 2] ൌ << ൗ # U+0D57 << U+0D4C == 0D46+0D57
* compare
<1 ൗx
<2 ൌx
<1 ൗy
<2 ൌy
** test: quoted apostrophe in compact syntax, ICU ticket 8204
@ rules
&q<<*a''c
* compare
<1 d
<1 p
<1 q
<2 a
<2 \u0027
<2 c
<1 r
# ICU ticket #8260 "Support all collation-related keywords in Collator.getInstance()"
** test: locale -u- with collation keywords, ICU ticket 8260
@ locale de-u-kv-sPace-ka-shifTed-kn-kk-falsE-kf-Upper-kc-tRue-ks-leVel4
* compare
<4 \u0020 # space is shifted, strength=quaternary
<1 ! # punctuation is regular
<1 2
<1 12 # numeric sorting
<1 B
<c b # uppercase first on case level
<1 x\u0301\u0308
<2 x\u0308\u0301 # normalization off
** test: locale @ with collation keywords, ICU ticket 8260
@ locale fr@colbAckwards=yes;ColStrength=Quaternary;kv=currencY;colalternate=shifted
* compare
<4 $ # currency symbols are shifted, strength=quaternary
<1 àla
<2 alà # backwards secondary level
** test: locale -u- with script reordering, ICU ticket 8260
@ locale el-u-kr-kana-SYMBOL-Grek-hani-cyrl-latn-digit-armn-deva-ethi-thai
* compare
<1 \u0020
<1 あ
<1 ☂
<1 Ω
<1 丂
<1 ж
<1 L
<1 4
<1 Ձ
<1 अ
<1 ሄ
<1 ฉ
** test: locale @collation=type should be case-insensitive
@ locale de@coLLation=PhoneBook
* compare
<1 ae
<2 ä
<3 Ä
** test: import root search rules plus German phonebook rules, ICU ticket 8962
@ locale de-u-co-search
* compare
<1 =
<1 ≠
<1 a
<1 ae
<2 ä
# Once more, but with runtime builder.
@ rules
[import und-u-co-search][import de-u-co-phonebk]
* compare
<1 =
<1 ≠
<1 a
<1 ae
<2 ä
# Once again, with import from "root" not "und" (as in a proper language tag).
@ rules
[import root-u-co-search][import de-u-co-phonebk]
* compare
<1 =
<1 ≠
<1 a
<1 ae
<2 ä
** test: import rules from a language with non-Latin native script, and reset the reordering, ICU ticket 10998
# Greek should sort Greek first.
@ rules
[import el]
* compare
<1 4
<1 Ω
<1 L
# Import Greek, and then reset the reordering.
@ rules
[import el][reorder Zzzz]
* compare
<1 4
<1 L
<1 Ω
# "others" is a synonym for Zzzz.
@ rules
[import el][reorder others]
* compare
<1 4
<1 L
<1 Ω
** test: regression test for CollationFastLatinBuilder, ICU ticket 11388
@ rules
&x<<aa<<<Aa<<<AA
% strength=secondary
* compare
<1 AA
<2 Aẩ
<2 aą
* compare
<1 AA
<2 aą
** test: tailor tertiary-after a common tertiary where there is a lower one
# Assume that Hiragana small A has a below-common tertiary, and Hiragana A has a common one.
# See ICU ticket 11448 & CLDR ticket 7222.
@ rules
&あ<<<x<<<y<<<z
* compare
<1 ぁ
<3 あ
<3 x
<3 y
<3 z
<3 ァ
<1 い
** test: tailor tertiary-after a below-common tertiary
@ rules
&ぁ<<<x<<<y<<<z
* compare
<1 ぁ
<3 x
<3 y
<3 z
<3 あ
<3 ァ
<1 い
** test: tailor tertiary-before a common tertiary where there is a lower one
@ rules
&[before 3]あ<<<x<<<y<<<z
* compare
<1 ぁ
<3 x
<3 y
<3 z
<3 あ
<3 ァ
<1 い
** test: tailor tertiary-before a below-common tertiary
@ rules
&[before 3]ぁ<<<x<<<y<<<z
* compare
<1 x
<3 y
<3 z
<3 ぁ
<3 あ
<3 ァ
<1 い
** test: reorder single scripts not groups, ICU ticket 11449
@ root
% reorder Goth Latn
* compare
<1 4
<1 𐌰 # Gothic
<1 L
<1 Ω
# Before ICU 55, the following reordered together with Gothic.
<1 𐌈 # Old Italic
<1 𐑐 # Shavian