Commit Graph

34 Commits

Author SHA1 Message Date
Carlos O'Donell
7cd7d36f1f Keep expected behaviour for [a-z] and [A-z] (Bug 23393).
In commit 9479b6d5e0 we updated all of
the collation data to harmonize with the new version of ISO 14651
which is derived from Unicode 9.0.0.  This collation update brought
with it some changes to locales which were not desirable by some
users, in particular it altered the meaning of the
locale-dependent-range regular expression, namely [a-z] and [A-Z], and
for en_US it caused uppercase letters to be matched by [a-z] for the
first time.  The matching of uppercase letters by [a-z] is something
which is already known to users of other locales which have this
property, but this change could cause significant problems to en_US
and other similar locales that had never had this change before.
Whether this behaviour is desirable or not is contentious and GNU Awk
has this to say on the topic:
https://www.gnu.org/software/gawk/manual/html_node/Ranges-and-Locales.html
While the POSIX standard also has this further to say: "RE Bracket
Expression":
http://pubs.opengroup.org/onlinepubs/9699919799/xrat/V4_xbd_chap09.html
"The current standard leaves unspecified the behavior of a range
expression outside the POSIX locale. ... As noted above, efforts were
made to resolve the differences, but no solution has been found that
would be specific enough to allow for portable software while not
invalidating existing implementations."
In glibc we implement the requirement of ISO POSIX-2:1993 and use
collation element order (CEO) to construct the range expression, the
API internally is __collseq_table_lookup().  The fact that we use CEO
and also have 4-level weights on each collation rule means that we can
in practice reorder the collation rules in iso14651_t1_common (the new
data) to provide consistent range expression resolution *and* the
weights should maintain the expected total order.  Therefore this
patch does three things:

* Reorder the collation rules for the LATIN script in
  iso14651_t1_common to deinterlace uppercase and lowercase letters in
  the collation element orders.

* Adds new test data en_US.UTF-8.in for sort-test.sh which exercises
  strcoll* and strxfrm* and ensures the ISO 14651 collation remains.

* Add back tests to tst-fnmatch.input and tst-regexloc.c which
  exercise that [a-z] does not match A or Z.

The reordering of the ISO 14651 data is done in an entirely mechanical
fashion using the following program attached to the bug:
https://sourceware.org/bugzilla/show_bug.cgi?id=23393#c28

It is up for discussion if the iso14651_t1_common data should be
refined further to have 3 very tight collation element ranges that
include only a-z, A-Z, and 0-9, which would implement the solution
sought after in:
https://sourceware.org/bugzilla/show_bug.cgi?id=23393#c12
and implemented here:
https://www.sourceware.org/ml/libc-alpha/2018-07/msg00854.html

No regressions on x86_64.
Verified that removal of the iso14651_t1_common change causes tst-fnmatch
to regress with:
422: fnmatch ("[a-z]", "A", 0) = 0 (FAIL, expected FNM_NOMATCH) ***
...
425: fnmatch ("[A-Z]", "z", 0) = 0 (FAIL, expected FNM_NOMATCH) ***
2018-07-25 17:00:45 -04:00
Mike FABIAN
1597385481 Adapt collation in several locales to the new iso14651_t1_common file
[BZ #22550] - es_ES locale (and other es_* locales): collation should
treat ñ as a primary different character, sync the collation
for Spanish with CLDR
[BZ #21547] - Tibetan script collation broken (Dzongkha and Tibetan)

	* localedata/Makefile: Add new test files.
	* localedata/lv_LV.UTF-8.in: Adapt test file to new collation order.
	* localedata/sv_SE.ISO-8859-1.in: Adapt test file to new collation order.
	* localedata/uk_UA.UTF-8.in: Adapt test file to new collation order.
	* localedata/am_ET.UTF-8.in: New test file.
	* localedata/az_AZ.UTF-8.in: Likewise.
	* localedata/be_BY.UTF-8.in: Likewise.
	* localedata/ber_DZ.UTF-8.in: Likewise.
	* localedata/ber_MA.UTF-8.in: Likewise.
	* localedata/bg_BG.UTF-8.in: Likewise.
	* localedata/br_FR.UTF-8.in: Likewise.
	* localedata/cmn_TW.UTF-8.in: Likewise.
	* localedata/crh_UA.UTF-8.in: Likewise.
	* localedata/csb_PL.UTF-8.in: Likewise.
	* localedata/cv_RU.UTF-8.in: Likewise.
	* localedata/cy_GB.UTF-8.in: Likewise.
	* localedata/dz_BT.UTF-8.in: Likewise.
	* localedata/eo.UTF-8.in: Likewise.
	* localedata/es_ES.UTF-8.in: Likewise.
	* localedata/fa_IR.UTF-8.in: Likewise.
	* localedata/fi_FI.UTF-8.in: Likewise.
	* localedata/fil_PH.UTF-8.in: Likewise.
	* localedata/fur_IT.UTF-8.in: Likewise.
	* localedata/gez_ER.UTF-8@abegede.in: Likewise.
	* localedata/ha_NG.UTF-8.in: Likewise.
	* localedata/ig_NG.UTF-8.in: Likewise.
	* localedata/ik_CA.UTF-8.in: Likewise.
	* localedata/kk_KZ.UTF-8.in: Likewise.
	* localedata/ku_TR.UTF-8.in: Likewise.
	* localedata/ky_KG.UTF-8.in: Likewise.
	* localedata/ln_CD.UTF-8.in: Likewise.
	* localedata/mi_NZ.UTF-8.in: Likewise.
	* localedata/ml_IN.UTF-8.in: Likewise.
	* localedata/mn_MN.UTF-8.in: Likewise.
	* localedata/mr_IN.UTF-8.in: Likewise.
	* localedata/mt_MT.UTF-8.in: Likewise.
	* localedata/nb_NO.UTF-8.in: Likewise.
	* localedata/om_KE.UTF-8.in: Likewise.
	* localedata/os_RU.UTF-8.in: Likewise.
	* localedata/ps_AF.UTF-8.in: Likewise.
	* localedata/ro_RO.UTF-8.in: Likewise.
	* localedata/ru_RU.UTF-8.in: Likewise.
	* localedata/sc_IT.UTF-8.in: Likewise.
	* localedata/se_NO.UTF-8.in: Likewise.
	* localedata/sq_AL.UTF-8.in: Likewise.
	* localedata/sv_SE.UTF-8.in: Likewise.
	* localedata/szl_PL.UTF-8.in: Likewise.
	* localedata/tg_TJ.UTF-8.in: Likewise.
	* localedata/tk_TM.UTF-8.in: Likewise.
	* localedata/tt_RU.UTF-8.in: Likewise.
	* localedata/tt_RU.UTF-8@iqtelif.in: Likewise.
	* localedata/ug_CN.UTF-8.in: Likewise.
	* localedata/uz_UZ.UTF-8.in: Likewise.
	* localedata/vi_VN.UTF-8.in: Likewise.
	* localedata/yi_US.UTF-8.in: Likewise.
	* localedata/yo_NG.UTF-8.in: Likewise.
	* localedata/zh_CN.UTF-8.in: Likewise.
	* localedata/locales/am_ET: Adapt collation rules to new iso14651_t1_common
        file and fix bugs in the collation.
	* localedata/locales/az_AZ: Likewise.
	* localedata/locales/be_BY: Likewise.
	* localedata/locales/ber_DZ: Likewise.
	* localedata/locales/ber_MA: Likewise.
	* localedata/locales/bg_BG: Likewise.
	* localedata/locales/br_FR: Likewise.
	* localedata/locales/br_FR@euro: Likewise.
	* localedata/locales/ca_ES: Likewise.
	* localedata/locales/cns11643_stroke: Likewise.
	* localedata/locales/crh_UA: Likewise.
	* localedata/locales/cs_CZ: Likewise.
	* localedata/locales/csb_PL: Likewise.
	* localedata/locales/cv_RU: Likewise.
	* localedata/locales/cy_GB: Likewise.
	* localedata/locales/da_DK: Likewise.
	* localedata/locales/dz_BT: Likewise.
	* localedata/locales/en_CA: Likewise.
	* localedata/locales/eo: Likewise.
	* localedata/locales/es_CU: Likewise.
	* localedata/locales/es_EC: Likewise.
	* localedata/locales/es_ES: Likewise.
	* localedata/locales/es_US: Likewise.
	* localedata/locales/et_EE: Likewise.
	* localedata/locales/fa_IR: Likewise.
	* localedata/locales/fi_FI: Likewise.
	* localedata/locales/fil_PH: Likewise.
	* localedata/locales/fur_IT: Likewise.
	* localedata/locales/gez_ER@abegede: Likewise.
	* localedata/locales/ha_NG: Likewise.
	* localedata/locales/hr_HR: Likewise.
	* localedata/locales/hsb_DE: Likewise.
	* localedata/locales/hu_HU: Likewise.
	* localedata/locales/ig_NG: Likewise.
	* localedata/locales/ik_CA: Likewise.
	* localedata/locales/is_IS: Likewise.
	* localedata/locales/iso14651_t1_pinyin: Likewise.
	* localedata/locales/kk_KZ: Likewise.
	* localedata/locales/ku_TR: Likewise.
	* localedata/locales/ky_KG: Likewise.
	* localedata/locales/ln_CD: Likewise.
	* localedata/locales/lt_LT: Likewise.
	* localedata/locales/lv_LV: Likewise.
	* localedata/locales/mi_NZ: Likewise.
	* localedata/locales/ml_IN: Likewise.
	* localedata/locales/mn_MN: Likewise.
	* localedata/locales/mr_IN: Likewise.
	* localedata/locales/mt_MT: Likewise.
	* localedata/locales/nb_NO: Likewise.
	* localedata/locales/om_KE: Likewise.
	* localedata/locales/os_RU: Likewise.
	* localedata/locales/pl_PL: Likewise.
	* localedata/locales/ps_AF: Likewise.
	* localedata/locales/ro_RO: Likewise.
	* localedata/locales/ru_RU: Likewise.
	* localedata/locales/ru_UA: Likewise.
	* localedata/locales/sc_IT: Likewise.
	* localedata/locales/se_NO: Likewise.
	* localedata/locales/si_LK: Likewise.
	* localedata/locales/sq_AL: Likewise.
	* localedata/locales/sv_FI: Likewise.
	* localedata/locales/sv_FI@euro: Likewise.
	* localedata/locales/sv_SE: Likewise.
	* localedata/locales/szl_PL: Likewise.
	* localedata/locales/tg_TJ: Likewise.
	* localedata/locales/ti_ER: Likewise.
	* localedata/locales/tk_TM: Likewise.
	* localedata/locales/tl_PH: Likewise.
	* localedata/locales/tr_TR: Likewise.
	* localedata/locales/tt_RU: Likewise.
	* localedata/locales/tt_RU@iqtelif: Likewise.
	* localedata/locales/ug_CN: Likewise.
	* localedata/locales/uk_UA: Likewise.
	* localedata/locales/uz_UZ: Likewise.
	* localedata/locales/uz_UZ@cyrillic: Likewise.
	* localedata/locales/vi_VN: Likewise.
	* localedata/locales/yi_US: Likewise.
	* localedata/locales/yo_NG: Likewise.
2018-02-27 17:47:50 +01:00
Mike FABIAN
df74ef786f Add sections for various scripts to the iso14651_t1_common file
* localedata/locales/iso14651_t1_common: Add sections for various
	scripts to the iso14651_t1_common file.
2018-02-27 16:52:54 +01:00
Mike FABIAN
d5adfbadd4 iso14651_t1_common: make the fourth level the codepoint for characters which are ignorable on all 4 levels
Entries for characters which have “IGNORE” on all 4 levels like:

 <U0001> IGNORE;IGNORE;IGNORE;IGNORE % START OF HEADING (in ISO 6429)

are changed into:

 <U0001> IGNORE;IGNORE;IGNORE;<U0001> % START OF HEADING (in ISO 6429)

i.e. putting the code point of the character into the fourth level
instead of “IGNORE”. Without that change, all such characters
would compare equal which would make a wcscoll test case fail.
It is better to have a clearly defined sort order even for characters
like this so it is good to use the code point as a tie-break.

	* localedata/locales/iso14651_t1_common: Use the code point of a
        character in the fourth collation level instead of IGNORE for all
        entries which have IGNORE on all 4 levels.
2018-02-27 16:50:30 +01:00
Mike FABIAN
5f5a961091 Add convenience symbols like <AFTER-A>, <BEFORE-A> to iso14651_t1_common
* localedata/locales/iso14651_t1_common: Add some convenient collation
	symbols like <AFTER-A>, <BEFORE-A> to make tailoring easier using
	rules similar to those in CLDR.
2018-02-27 16:47:22 +01:00
Mike FABIAN
8a97e9002f Fixing syntax errors after updating the iso14651_t1_common file
* localedata/locales/iso14651_t1_common: The new version of this
	file downloaded from ISO contained several syntax errors which
	are fixed by this patch.
2018-02-27 16:45:30 +01:00
Mike FABIAN
bbdd2fba7d iso14651_t1_common: <U\([0-9A-F][0-9A-F][0-9A-F][0-9A-F][0-9A-F]\)> → <U000\1>
* localedata/locales/iso14651_t1_common: replace all <U.....>
	with <U000.....> because glibc understands only 4 digit or 8 digit
2018-02-27 16:44:03 +01:00
Mike FABIAN
1569e551af Necessary changes after updating the iso14651_t1_common file
* localedata/locales/iso14651_t1_common: Necessary changes
	to make the file downloaded from ISO usable by glibc.
2018-02-27 16:42:14 +01:00
Mike FABIAN
9479b6d5e0 Update iso14651_t1_common file to ISO14651_2016_TABLE1_en.txt [BZ #14095]
[BZ #14095] - Review / update collation data from Unicode / ISO 14651

File downloaded from:
http://standards.iso.org/iso-iec/14651/ed-4/ISO14651_2016_TABLE1_en.txt

Updating this file alone is not enough, there are problems in the new
file which need to be fixed and the collation rules for many locales
need to be adapted. This is done by the following patches.

This update also fixes the problem that many characters are treated as
identical when sorting because they were not yet in the old
iso14651_t1_common file, see:

https://bugzilla.redhat.com/show_bug.cgi?id=1336308
- Infinite (∞) and empty set (∅) are treated as if they were the same character by sort and uniq

	[BZ #14095]
	* localedata/locales/iso14651_t1_common: Update file to
	latest version from ISO (ISO14651_2016_TABLE1_en.txt).
2018-02-27 16:36:31 +01:00
Alexandre Oliva
8da25eec0a Collation fix: make forward accent sorting the default [BZ #17750]
[BZ #17750]
	* Makefile: add fr_CA.UTF-8 to test-input and LOCALES.
	* localedata/fr_CA.UTF-8.in: New file with test data for backward
	accents sorting.
	* localedata/fr_FR.UTF-8.in: Fix test data for forward accents
	sorting.
	* localedata/locales/cs_CZ (LC_COLLATE): Remove “define DIACRIT_FORWARD”
	* localedata/locales/de_DE (LC_COLLATE): Likewise.
	* localedata/locales/hu_HU (LC_COLLATE): Likewise.
	* localedata/locales/lb_LU (LC_COLLATE): Likewise.
	* localedata/locales/yuw_PG (LC_COLLATE): Likewise.
	* localedata/locales/fr_CA (LC_COLLATE): Add “define DIACRIT_BACKWARD”
	* localedata/locales/iso14651_t1_common: Use “ifdef DIACRIT_FORWARD”
	instead of “ifdef DIACRIT_BACKWARD”.

The only locale which currently needs backward accents sorting is fr_CA.
Therefore, forward accents sorting should be the default.

Before this patch, backwards accent sorting was the default and all
locales except fr_CA had to use

    define DIACRIT_FORWARD

before

    copy "iso14651_t1"

Most locales didn’t do that and thus got the inappropriate backwards accents sorting
by accident. Now only the fr_CA locale needs to use

    define DIACRIT_BACKWARD

before

    copy "iso14651_t1"

Original patch slightly modified by: Mike FABIAN <mfabian@redhat.com>
2017-11-29 11:56:46 +01:00
Santhosh Thottingal
b05eca0e1d Correct collation rules for Malayalam.
[BZ #19922]
	* locales/iso14651_t1_common: Add collation rules for U+07DA to U+07DF.

	[BZ #19919]
	* locales/iso14651_t1_common: Correct collation of U+0D36 and U+0D37.
2017-06-11 10:08:37 -04:00
Mike Frysinger
a4cea54b12 localedata: standardize copyright/license information [BZ #11213]
Use the language from the FSF in all locale files to disclaim any
license/copyright on locale data.

See https://sourceware.org/ml/libc-locales/2013-q1/msg00048.html
2016-03-21 02:29:56 -04:00
Ulrich Drepper
b426c80f5f Fix whitespaces 2011-05-15 11:37:52 -04:00
Ulrich Drepper
08ba84136f Move Dzonghka collation rules to common collation rules file 2011-05-15 11:36:07 -04:00
Pravin Satpute
1e5e9ec825 Fix sorting of malayalam letter 'na'. 2010-02-03 03:50:01 -08:00
Ulrich Drepper
6b4f51823c Fix whitespaces. 2010-02-03 03:36:52 -08:00
Pravin Satpute
3e8a75d1b9 Move Tamil collation data to common source file. 2010-02-03 03:32:06 -08:00
Keith Stribley
3c2c4bf6f7 Implement Burmese language locale for Myanmar. 2009-10-30 08:14:02 -07:00
Ulrich Drepper
115a532734 * localedata/locales/bn_BD: Remove comment about missing collation
rules.
	* localedata/locales/iso14651_t1_common: Add Bengali collation rules.
	Patch by Pravin Satpute <psatpute@redhat.com>.
2009-05-04 21:20:20 +00:00
Ulrich Drepper
eee6b14327 [BZ #9759]
* dirent/dirent.h: Adjust prototypes of scandir, scandir64, alphasort,
	alphasort64, versionsort, and versionsort64 to POSIX 2008.
	* dirent/alphasort.c: Adjust implementation to type change.
	* dirent/alphasort64.c: Likewise.
	* dirent/scandir.c: Likewise.
	* dirent/versionsort.c: Likewise.
	* dirent/versionsort64.c: Likewise.
	* sysdeps/wordsize-64/alphasort.c: Add hack to hide alphasort64
	declaration.
	* sysdeps/wordsize-64/versionsort.c: Add hack to hide versionsort64
	declaration.
2009-03-15 21:33:19 +00:00
Ulrich Drepper
638633961d * locales/iso14651_t1_common: Add rules for sorting Malayalam.
Patch by Santhosh Thottingal <santhosh.thottingal@gmail.com>.
2009-02-11 15:42:53 +00:00
Ulrich Drepper
06057297c4 * locales/iso14651_t1_common: Fix sorting of U+0AB3.
Patch by Pravin Satpute <psatpute@redhat.com>.
2008-12-31 14:58:14 +00:00
Ulrich Drepper
6daf1a2fb1 [BZ #6867]
* sysdeps/powerpc/elf/rtld-global-offsets.sym: Fix typo.
2008-10-31 19:03:31 +00:00
Ulrich Drepper
46026b5589 * locales/iso14651_t1_common: Add Kannada collation support.
Patch by Pravin Satpute <psatpute@redhat.com>.
2008-07-11 17:05:42 +00:00
Ulrich Drepper
99ae13c825 * locales/iso14651_t1_common: Add support for Gurumukhi script.
Patch by Pravin Satpute <psatpute@redhat.com>.
2008-06-24 16:59:47 +00:00
Ulrich Drepper
e564d29d8e Remove U0C0D entry added for Telugu. 2008-05-21 15:13:02 +00:00
Ulrich Drepper
74e1338588 * string/strcasestr.c (CMP_FUNC): Use __strncasecmp, not strncasecmp. 2008-05-16 18:19:18 +00:00
Ulrich Drepper
2f9a1be867 [BZ #6442]
* string/endian.h: Add macros for fixed-size endian conversion.
	* bits/byteswap.h: Allow inclusion from <endian.h>.
	* sysdeps/i386/bits/byteswap.h: Likewise.
	* sysdeps/ia64/bits/byteswap.h: Likewise.
	* sysdeps/s390/bits/byteswap.h: Likewise.
	* sysdeps/x86_64/bits/byteswap.h: Likewise.
	* string/Makefile (tests): Add tst-endian.
	* string/tst-endian.c: New file.
2008-05-15 02:54:33 +00:00
Ulrich Drepper
23c37224d3 Fix first weight for U+1E60, U+1E62, U+1E64, U+1E66, and U+1E68. 2008-04-07 23:53:20 +00:00
Ulrich Drepper
4e0b2dbe54 * locales/iso14651_t1_common: Add support for Gujarati script.
Patch by Pravin Satpute <psatpute@redhat.com>.
2008-03-31 14:15:28 +00:00
Ulrich Drepper
85ac24138b * locales/iso14651_t1_common: Add support for Devanagari script.
* locales/mr_IN: Adjust Devanagari sorting for mr_IN.
	Patch by Pravin Satpute <psatpute@redhat.com>.
2008-03-24 05:08:33 +00:00
Ulrich Drepper
3a054d7ab0 * locale/programs/locfile-token.h: Remove tok_elif, add tok_elifdef
and tok_elifndef.
	* locale/programs/locfile-kw.gperf: Likewise.
	* locale/programs/ld-collate.c: Implement primitive preprocessor.
2007-10-11 02:36:04 +00:00
Ulrich Drepper
592a95ee7c * po/pt_BR.po: Fix typo. 2007-09-30 16:57:15 +00:00
Ulrich Drepper
762422d1bd * locale/programs/ld-collate.c (collate_read): Allow order_start
after copy.
2007-04-28 06:51:26 +00:00