This eliminates the need for a scratch buffer in Locale::toLanguageTag()
and also the need for counting bytes required in uloc_toLanguageTag(),
something that ByteSink will now handle correctly and thereby
eliminating the bug where too few bytes required was returned.
uloc_forLanguageTag has a few mapping tables to map grandfathered
language tags and deprecated language subtags to their preferred or
modern values.
Update them based on the latest version of the IANA
language subtag registry. [1]
Five grandfathered tags without a preferred value are still mapped to
what ICU has mapped them to for backward compatibility until the
wisdom of continuing to do so is reviewed.
In addition, map redundant language tags to their preferred values
regardless of whether they're followed by other subtags or not. (e.g.
zh-yue vs zh-yue-u-co-pinyin) .
Similary, ja-latn-hepburn-heploc is mapped to ja-latn-alaic97 (the
variant subtag 'hepburn-helploc' with the prefix 'ja-latn' has the
preferred value, 'alaic97') .
Update the mapping for deprecated language subtags (e.g. 'jw' to
'jv' and a bunch of 3-letter language codes).
Add a new table for deprecated region subtags to map them to their
modern values. (e.g. 'DD' to 'DE').
Add a new test case for deprecated language and region mapping and
a few more cases for updated grandfathered and redundant tag mapping.
[1]
https://www.iana.org/assignments/language-subtag-registry/language-subtag-registry
This gets rid of those fixed buffers that caused ICU-13417 to be filed
in the first place, those that prevent handling language tags with very
large amounts of keywords.
A number of fixed buffers will still remain in uloc_tag.cpp (and
elsewhere in the locale handling code) for the time being, but this
change is a necessary first step in cleaning up this code and will
alleviate the most pressing problem encountered by ICU4C users.
An off-by-one error in _getKeywords() caused uloc_canonicalize() to not
write out the final keyword value in case the result would fill up the
buffer exactly, resulting in U_STRING_NOT_TERMINATED_WARNING.
* ICU-20140 Allow duplicated keys in U-extension per RFC 6067
RFC 6067 [1] does allow duplicate keywords, but ICU4C's
uloc_forLanguageCode rejects it as invalid.
Change it to accept duplicate keywords and honor only the
1st one while ignoring subsequent ones per RFC 6067.
[1] Unicode extension to BCP 47:
https://tools.ietf.org/html/rfc6067
* ICU-20140 Add ICU4J test and tweak ICU4C test
ICU4J test diverges from ICU4C tests:
1. Handling of duplicate variants in ICU4J seem to be wrong:
https://unicode-org.atlassian.net/browse/ICU-20148
2. ULocale.forLanguageTag only throws NullPointException so
that ICU4C's test for duplicate attributes cannot be ported.