Unicode 16.0.0 Support: Character encoding, character type info, and
transliteration tables are all updated to Unicode 16.0.0, using
the generator scripts contributed by Mike FABIAN (Red Hat).
Changes in CHARMAP and WIDTH:
Total added characters in newly generated CHARMAP: 5185
Total removed characters in newly generated WIDTH: 1
Total added characters in newly generated WIDTH: 170
The removed character from WIDTH is U+1171E AHOM CONSONANT SIGN MEDIAL RA.
It changed like this:
UnicodeData.txt 15.1.0: 1171E;AHOM CONSONANT SIGN MEDIAL RA;Mn;0;NSM;;;;;N;;;;;
UnicodeData.txt 16.0.0: 1171E;AHOM CONSONANT SIGN MEDIAL RA;Mc;0;L;;;;;N;;;;;
EastAsianWidth.txt 15.1.0: 1171D..1171F ; N # Mn [3] AHOM CONSONANT SIGN MEDIAL LA..AHOM CONSONANT SIGN MEDIAL LIGATING RA
EastAsianWidth.txt 16.0.0: 1171E ; N # Mc AHOM CONSONANT SIGN MEDIAL RA
I.e it changed from Mn (Mark Nonspacing) to Mc (Mark Spacing
combining). So it should now have width 1 instead of 0, therefore it
is OK that it was removed from WIDTH, characters not in WIDTH get
width 1 by default.
Nothing suspicious when browsing the list of the 170 added characters.
Changes in ctype:
alpha: Added 4452 characters in new ctype which were not in old ctype
combining: Added 51 characters in new ctype which were not in old ctype
combining_level3: Added 43 characters in new ctype which were not in old ctype
graph: Added 5185 characters in new ctype which were not in old ctype
lower: Added 25 characters in new ctype which were not in old ctype
print: Added 5185 characters in new ctype which were not in old ctype
punct: Missing 33 characters of old ctype in new ctype
punct: Added 766 characters in new ctype which were not in old ctype
tolower: Added 27 characters in new ctype which were not in old ctype
totitle: Added 27 characters in new ctype which were not in old ctype
toupper: Added 27 characters in new ctype which were not in old ctype
upper: Added 27 characters in new ctype which were not in old ctype
Nothing suspicous in the additions.
About the 33 characters removed from `punct`:
U+0363 - U+036F are identical in UnicodeData.txt. Difference in DerivedCoreProperties.txt:
DerivedCoreProperties.txt 15.1.0: not there.
DerivedCoreProperties.txt 16.0.0: 0363..036F ; Alphabetic # Mn [13] COMBINING LATIN SMALL LETTER A..COMBINING LATIN SMALL LETTER X
So that’s the reason why they are added to `alpha` and removed from `punct`.
Same for U+1DD3 - U+1DE6, they are identical in UnicodeData.txt but there is a difference in DerivedCoreProperties.txt:
DerivedCoreProperties.txt 15.1.0: 1DE7..1DF4 ; Alphabetic # Mn [14] COMBINING LATIN SMALL LETTER ALPHA..COMBINING LATIN SMALL LETTER U WITH DIAERESIS
DerivedCoreProperties.txt 16.0.0: 1DD3..1DF4 ; Alphabetic # Mn [34] COMBINING LATIN SMALL LETTER FLATTENED OPEN A ABOVE..COMBINING LATIN SMALL LETTER U WITH DIAERESIS
So they became `Alphabetic` and were thus added to `alpha` and removed from `punct`.
Resolves: BZ #32168
Reviewed-by: Carlos O'Donell <carlos@redhat.com>
Macros will automatically use the correct types, without
having to fiddle with internal glibc macros. It's also
impossible to get the types wrong due to aliasing because
support_check_stat_fd and support_check_stat_path do not
depend on the struct stat* types.
The changes reveal some inconsistencies in tests.
Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
Add a FAIL test failure helper analogous to FAIL_RET, that does not
cause the current function to return, providing a standardized way to
report a test failure with a message supplied while permitting the
caller to continue executing, for further reporting, cleaning up, etc.
Update existing test cases that provide a conflicting definition of FAIL
by removing the local FAIL definition and then as follows:
- tst-fortify-syslog: provide a meaningful message in addition to the
file name already added by <support/check.h>; 'support_record_failure'
is already called by 'support_print_failure_impl' invoked by the new
FAIL test failure helper.
- tst-ctype: no update to FAIL calls required, with the name of the file
and the line number within of the failure site additionally included
by the new FAIL test failure helper, and error counting plus count
reporting upon test program termination also already provided by
'support_record_failure' and 'support_report_failure' respectively,
called by 'support_print_failure_impl' and 'adjust_exit_status' also
respectively. However in a number of places 'printf' is called and
the error count adjusted by hand, so update these places to make use
of FAIL instead. And last but not least adjust the final summary just
to report completion, with any error count following as reported by
the test driver.
- test-tgmath2: no update to FAIL calls required, with the name of the
file of the failure site additionally included by the new FAIL test
failure helper. Also there is no need to track the return status by
hand as any call to FAIL will eventually cause the test case to return
an unsuccesful exit status regardless of the return status from the
test function, via a call to 'adjust_exit_status' made by the test
driver.
Reviewed-by: DJ Delorie <dj@redhat.com>
With gcc 14, I get this warning/werror when building the localedata tests:
tests-mbwc/tsp_common.c: In function ‘result.constprop.isra’:
tests-mbwc/tsp_common.c:55:43: error: ‘%s’ directive writing up to 92 bytes into a region of size between 0 and 114 [-Werror=format-overflow=]
55 | sprintf (result_rec, "%s:%s:%d:%d:%d:%c:%s\n", func, loc, rec_no, seq_no,
| ^~
In file included from ../include/bits/stdio2.h:1,
from ../libio/stdio.h:980,
from ../include/stdio.h:14,
from tests-mbwc/tsp_common.c:10:
In function ‘sprintf’,
inlined from ‘result.constprop.isra’ at tests-mbwc/tsp_common.c:55:3:
../libio/bits/stdio2.h:30:10: note: ‘__builtin___sprintf_chk’ output between 20 and 234 bytes into a destination of size 132
30 | return __builtin___sprintf_chk (__s, __USE_FORTIFY_LEVEL - 1,
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
31 | __glibc_objsize (__s), __fmt,
| ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
32 | __va_arg_pack ());
| ~~~~~~~~~~~~~~~~~
cc1: all warnings being treated as errors
This patch now gets rid of using sprintf and the result_rec buffer and just
prints to fp directly.
= `Default_Ignorable_Code_Point`s should have width 0 =
Unicode specifies (https://www.unicode.org/faq/unsup_char.html#3) that characters
with the `Default_Ignorable_Code_Point` property
> should be rendered as completely invisible (and non advancing, i.e. “zero width”),
if not explicitly supported in rendering.
Hence, `wcwidth()` should give them all a width of 0, with two exceptions:
- the soft hyphen (U+00AD SOFT HYPHEN) is assigned width 1 by longstanding precedent
- U+115F HANGUL CHOSEONG FILLER needs a carveout
due to the unique behavior of the conjoining Korean jamo characters.
One composed Hangul "syllable block" like 퓛
is made up of two to three individual component characters, or "jamo".
These are all assigned an `East_Asian_Width` of `Wide`
by Unicode, which would normally mean they would all be assigned
width 2 by glibc; a combination of (leading choseong jamo) +
(medial jungseong jamo) + (trailing jongseong jamo) would then have width 2 + 2 + 2 = 6.
However, glibc (and other wcwidth implementations) special-cases jungseong and jongseong,
assigning them all width 0,
to ensure that the complete block has width 2 + 0 + 0 = 2 as it should.
U+115F is meant for use in syllable blocks
that are intentionally missing a leading jamo;
it must be assigned a width of 2 even though it has no visible display
to ensure that the complete block has width 2.
However, `wcwidth()` currently (before this patch)
incorrectly assigns non-zero width to
U+3164 HANGUL FILLER and U+FFA0 HALFWIDTH HANGUL FILLER;
this commit fixes that.
Unicode spec references:
- Hangul: §3.12 https://www.unicode.org/versions/Unicode15.0.0/ch03.pdf#G24646 and
§18.6 https://www.unicode.org/versions/Unicode15.0.0/ch18.pdf#G31028
- `Default_Ignorable_Code_Point`: §5.21 https://www.unicode.org/versions/Unicode15.0.0/ch05.pdf#G40095.
= Non-`Default_Ignorable_Code_Point` format controls should be visible =
The Unicode Standard, §5.21 - Characters Ignored for Display
(https://www.unicode.org/versions/Unicode15.0.0/ch05.pdf#G40095)
says the following:
> A small number of format characters (General_Category = Cf )
> are also not given the Default_Ignorable_Code_Point property.
> This may surprise implementers, who often assume
> that all format characters are generally ignored in fallback display.
> The exact list of these exceptional format characters
> can be found in the Unicode Character Database.
> There are, however, three important sets of such format characters to note:
>
> - prepended concatenation marks
> - interlinear annotation characters
> - Egyptian hieroglyph format controls
>
> The prepended concatenation marks always have a visible display.
> See “Prepended Concatenation Marks” in [*Section 23.2, Layout Controls*](https://www.unicode.org/versions/Unicode15.1.0/ch23.pdf#M9.35858.HeadingBreak.132.Layout.Controls)
> for more discussion of the use and display of these signs.
>
> The other two notable sets of format characters that exceptionally are not ignored
> in fallback display consist of the interlinear annotation characters,
> U+FFF9 INTERLINEAR ANNOTATION ANCHOR through
> U+FFFB INTERLINEAR ANNOTATION TERMINATOR,
> and the Egyptian hieroglyph format controls,
> U+13430 EGYPTIAN HIEROGLYPH VERTICAL JOINER through
> U+1343F EGYPTIAN HIEROGLYPH END WALLED ENCLOSURE.
> These characters should have a visible glyph display for fallback rendering,
> because if they are not displayed,
> it is too easy to misread the resulting displayed text.
> See “Annotation Characters” in [*Section 23.8, Specials*](https://www.unicode.org/versions/Unicode15.1.0/ch23.pdf#M9.21335.Heading.133.Specials),
> as well as [*Section 11.4, Egyptian Hieroglyphs*](https://www.unicode.org/versions/Unicode15.1.0/ch11.pdf#M9.73291.Heading.1418.Egyptian.Hieroglyphs)
> for more discussion of the use and display of these characters.
glibc currently correctly assigns non-zero width to the prepended concatenation marks,
but it incorrectly gives zero width to the interlinear annotation characters
(which a generic terminal cannot interpret)
and the Egyptian hieroglyph format controls
(which are not widely supported in rendering implementations at present).
This commit fixes both these issues as well.
= Derive Hangul syllable type from Unicode data =
Previosuly, the jungseong and jongseong jamo ranges
were hard-coded into the script. With this commit, they are instead parsed
from the HangulSyllableType.txt data file published by Unicode.
This does not affect the end result.
Signed-off-by: Jules Bertholet <julesbertholet@quoi.xyz>
Delay setting file->decided until the data has been successfully loaded
by _nl_load_locale(). If the function fails to load the data then we
must return and error and leave decided untouched to allow the caller to
attempt to load the data again at a later time. We should not set
decided to 1 early in the function since doing so may prevent attempting
to load it again. We want to try loading it again because that allows an
open to fail and set errno correctly.
On the other side of this problem is that if we are called again with
the same inputs we will fetch the cached version of the object and carry
out no open syscalls and that fails to set errno so we must set errno to
ENOENT in that case. There is a second code path that has to be handled
where the name of the locale matches but the codeset doesn't match.
These changes ensure that errno is correctly set on failure in all the
return paths in _nl_find_locale().
Adds tst-locale-loadlocale to cover the bug.
No regressions on x86_64.
Co-authored-by: Jeff Law <law@redhat.com>
Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
Resolves: BZ # 31239
The correct abbreviated month names were apparently given in the comment above `abmon`.
But the value of `abmon` was apparently just copied from the value of `mon` and this
mistake was hard to see because code point notation <Uxxxx> was used. After converting
to UTF-8 it was obvious that there was apparently a copy and paste mistake.
Resolves: BZ # 31221
glibc can recognise its code, but does not have its data.
This patch remedies that.
Signed-off-by: Janet Blackquill <uhhadd@gmail.com>