glibc/SHARED-FILES
Jules Bertholet 25c9c3789e localedata: Fix several issues with the set of characters considered 0-width [BZ #31370]
= `Default_Ignorable_Code_Point`s should have width 0 =

Unicode specifies (https://www.unicode.org/faq/unsup_char.html#3) that characters
with the `Default_Ignorable_Code_Point` property

> should be rendered as completely invisible (and non advancing, i.e. “zero width”),
if not explicitly supported in rendering.

Hence, `wcwidth()` should give them all a width of 0, with two exceptions:

- the soft hyphen (U+00AD SOFT HYPHEN) is assigned width 1 by longstanding precedent
- U+115F HANGUL CHOSEONG FILLER needs a carveout
  due to the unique behavior of the conjoining Korean jamo characters.
  One composed Hangul "syllable block" like 퓛
  is made up of two to three individual component characters, or "jamo".
  These are all assigned an `East_Asian_Width` of `Wide`
  by Unicode, which would normally mean they would all be assigned
  width 2 by glibc; a combination of (leading choseong jamo) +
  (medial jungseong jamo) + (trailing jongseong jamo) would then have width 2 + 2 + 2 = 6.
  However, glibc (and other wcwidth implementations) special-cases jungseong and jongseong,
  assigning them all width 0,
  to ensure that the complete block has width 2 + 0 + 0 = 2 as it should.
  U+115F is meant for use in syllable blocks
  that are intentionally missing a leading jamo;
  it must be assigned a width of 2 even though it has no visible display
  to ensure that the complete block has width 2.

However, `wcwidth()` currently (before this patch)
incorrectly assigns non-zero width to
U+3164 HANGUL FILLER and U+FFA0 HALFWIDTH HANGUL FILLER;
this commit fixes that.

Unicode spec references:
- Hangul:  §3.12 https://www.unicode.org/versions/Unicode15.0.0/ch03.pdf#G24646 and
  §18.6 https://www.unicode.org/versions/Unicode15.0.0/ch18.pdf#G31028
- `Default_Ignorable_Code_Point`: §5.21 https://www.unicode.org/versions/Unicode15.0.0/ch05.pdf#G40095.

= Non-`Default_Ignorable_Code_Point` format controls should be visible =

The Unicode Standard, §5.21 - Characters Ignored for Display
(https://www.unicode.org/versions/Unicode15.0.0/ch05.pdf#G40095)
says the following:

> A small number of format characters (General_Category = Cf )
> are also not given the Default_Ignorable_Code_Point property.
> This may surprise implementers, who often assume
> that all format characters are generally ignored in fallback display.
> The exact list of these exceptional format characters
> can be found in the Unicode Character Database.
> There are, however, three important sets of such format characters to note:
>
> - prepended concatenation marks
> - interlinear annotation characters
> - Egyptian hieroglyph format controls
>
> The prepended concatenation marks always have a visible display.
> See “Prepended Concatenation Marks” in [*Section 23.2, Layout Controls*](https://www.unicode.org/versions/Unicode15.1.0/ch23.pdf#M9.35858.HeadingBreak.132.Layout.Controls)
> for more discussion of the use and display of these signs.
>
> The other two notable sets of format characters that exceptionally are not ignored
> in fallback display consist of the interlinear annotation characters,
> U+FFF9 INTERLINEAR ANNOTATION ANCHOR through
> U+FFFB INTERLINEAR ANNOTATION TERMINATOR,
> and the Egyptian hieroglyph format controls,
> U+13430 EGYPTIAN HIEROGLYPH VERTICAL JOINER through
> U+1343F EGYPTIAN HIEROGLYPH END WALLED ENCLOSURE.
> These characters should have a visible glyph display for fallback rendering,
> because if they are not displayed,
> it is too easy to misread the resulting displayed text.
> See “Annotation Characters” in [*Section 23.8, Specials*](https://www.unicode.org/versions/Unicode15.1.0/ch23.pdf#M9.21335.Heading.133.Specials),
> as well as [*Section 11.4, Egyptian Hieroglyphs*](https://www.unicode.org/versions/Unicode15.1.0/ch11.pdf#M9.73291.Heading.1418.Egyptian.Hieroglyphs)
> for more discussion of the use and display of these characters.

glibc currently correctly assigns non-zero width to the prepended concatenation marks,
but it incorrectly gives zero width to the interlinear annotation characters
(which a generic terminal cannot interpret)
and the Egyptian hieroglyph format controls
(which are not widely supported in rendering implementations at present).
This commit fixes both these issues as well.

= Derive Hangul syllable type from Unicode data =

Previosuly, the jungseong and jongseong jamo ranges
were hard-coded into the script. With this commit, they are instead parsed
from the HangulSyllableType.txt data file published by Unicode.
This does not affect the end result.

Signed-off-by: Jules Bertholet <julesbertholet@quoi.xyz>
2024-05-15 14:31:06 +02:00

223 lines
6.4 KiB
Plaintext

# Files shared with other projects. Pass a file path to the
# get_glibc_shared_code() function in the python library
# scripts/glibc_shared_code.py to get a dict object with this information. See
# the library sources for more information.
# The headers on most of these files indicate that glibc is the canonical
# source for these files, although in many cases there seem to be useful
# changes in the gnulib versions that could be merged back in. Not all gnulib
# files contain such a header and it is not always consistent in its format, so
# it would be useful to make sure that all gnulib files that are using glibc as
# upstream have a greppable header.
#
# These files are quite hard to find without a header to grep for and each file
# has to be compared manually so this list is likely incomplete or may contain
# errors.
gnulib:
argp/argp-ba.c
argp/argp-ba.c
argp/argp-eexst.c
argp/argp-fmtstream.c
argp/argp-fmtstream.h
argp/argp-fs-xinl.c
argp/argp-help.c
argp/argp-namefrob.h
argp/argp-parse.c
argp/argp-pv.c
argp/argp-pvh.c
argp/argp-xinl.c
argp/argp.h
dirent/alphasort.c
dirent/scandir.c
# Merged from gnulib 2021-09-21
include/intprops.h
# Merged from gnulib 2021-09-21
include/regex.h
locale/programs/3level.h
# Merged from gnulib 2014-6-23
malloc/obstack.c
# Merged from gnulib 2014-6-23
malloc/obstack.h
# Merged from gnulib 2014-07-10
misc/error.c
misc/error.h
misc/getpass.c
misc/mkdtemp.c
# Merged from gnulib 2021-09-21
misc/sys/cdefs.h
posix/fnmatch_loop.c
# Intended to be the same. Gnulib copy contains glibc changes.
posix/getopt.c
# Intended to be the same. Gnulib copy contains glibc changes.
posix/getopt1.c
# Intended to be the same. Gnulib copy contains glibc changes.
posix/getopt_int.h
posix/glob.c
# Merged from gnulib 2021-09-21
posix/regcomp.c
# Merged from gnulib 2021-09-21
posix/regex.c
# Merged from gnulib 2021-09-21
posix/regex.h
# Merged from gnulib 2021-09-21
posix/regex_internal.c
# Merged from gnulib 2021-09-21
posix/regex_internal.h
# Merged from gnulib 2021-09-21
posix/regexec.c
posix/spawn.c
posix/spawn_faction_addclose.c
posix/spawn_faction_adddup2.c
posix/spawn_faction_addopen.c
posix/spawn_faction_destroy.c
posix/spawn_faction_init.c
posix/spawn_int.h
posix/spawnattr_destroy.c
posix/spawnattr_getdefault.c
posix/spawnattr_getflags.c
posix/spawnattr_getpgroup.c
posix/spawnattr_getschedparam.c
posix/spawnattr_getschedpolicy.c
posix/spawnattr_getsigmask.c
posix/spawnattr_init.c
posix/spawnattr_setdefault.c
posix/spawnattr_setflags.c
posix/spawnattr_setpgroup.c
posix/spawnattr_setschedparam.c
posix/spawnattr_setschedpolicy.c
posix/spawnattr_setsigmask.c
posix/spawnp.c
stdlib/atoll.c
stdlib/getsubopt.c
stdlib/setenv.c
stdlib/strtoll.c
stdlib/strtoul.c
# Merged from gnulib 2014-6-26, needs merge back
string/memchr.c
string/memcmp.c
string/memmem.c
string/mempcpy.c
string/memrchr.c
string/rawmemchr.c
string/stpcpy.c
string/stpncpy.c
string/str-two-way.h
string/strcasestr.c
string/strcspn.c
string/strdup.c
string/strndup.c
string/strpbrk.c
string/strsignal.c
string/strstr.c
string/strtok_r.c
string/strverscmp.c
# Merged from gnulib 2024-04-08 (gnulib commit 3238349628)
stdio-common/tmpdir.c
stdio-common/tmpdir.h
sysdeps/generic/pty-private.h
sysdeps/generic/siglist.h
sysdeps/posix/euidaccess.c
sysdeps/posix/gai_strerror.c
sysdeps/posix/getcwd.c
sysdeps/posix/pwrite.c
sysdeps/posix/spawni.c
# Merged from gnulib 2024-04-08 (gnulib commit 3238349628)
sysdeps/posix/tempname.c
# Merged from gnulib 2014-6-27
time/mktime.c
time/mktime-internal.h
time/strptime.c
time/timegm.c
# The last merge was 2014-12-11 and merged gettext 0.19.3 into glibc with a
# patch submitted to the gettext mailing list for changes that could be merged
# back.
#
# This commit was omitted from the merge as it does not appear to be compatible
# with how glibc expects things to work:
#
# commit 279b57fc367251666f00e8e2b599b83703451afb
# Author: Bruno Haible <bruno@clisp.org>
# Date: Fri Jun 14 12:03:49 2002 +0000
#
# Make absolute pathnames inside $LANGUAGE work.
gettext:
intl/bindtextdom.c
intl/dcgettext.c
intl/dcigettext.c
intl/dcngettext.c
intl/dgettext.c
intl/dngettext.c
intl/explodename.c
intl/finddomain.c
intl/gettext.c
intl/gettextP.h
intl/gmo.h
intl/hash-string.c
intl/hash-string.h
intl/l10nflist.c
intl/loadinfo.h
intl/loadmsgcat.c
intl/locale.alias
intl/localealias.c
intl/ngettext.c
intl/plural-exp.c
intl/plural-exp.h
intl/plural.y
intl/textdomain.c
# The following files are shared with the upstream Unicode project and must be
# updated regularly to stay in sync with the upstream unicode releases.
#
# Merged from Unicode 15.1.0 release.
unicode:
localedata/unicode-gen/UnicodeData.txt
localedata/unicode-gen/unicode-license.txt
localedata/unicode-gen/DerivedCoreProperties.txt
localedata/unicode-gen/EastAsianWidth.txt
localedata/unicode-gen/HangulSyllableType.txt
# The following files are shared with the upstream tzcode project and must be
# updated regularly to stay in sync with the upstream releases.
#
# Currently synced to TZDB 2024a, announced and distributed here:
# https://mm.icann.org/pipermail/tz-announce/2024-February/000081.html
# https://data.iana.org/time-zones/releases/tzdb-2024a.tar.lz
tzcode:
timezone/private.h
timezone/tzfile.h
timezone/tzselect.ksh
timezone/version
timezone/zdump.c
timezone/zic.c
# The following files are shared with the upstream tzdata project but is not
# synchronized regularly. The data files themselves are used only for testing
# purposes and their data is never used to generate any output. We synchronize
# them only to stay on top of newer data that might help with testing.
#
# Currently synced to tzcode 2009i, announced and distributed here:
# https://mm.icann.org/pipermail/tz/2009-June/040697.html
# https://data.iana.org/time-zones/releases/tzdata2009i.tar.gz
tzdata:
timezone/africa
timezone/antarctica
timezone/asia
timezone/australasia
timezone/europe
timezone/northamerica
timezone/southamerica
timezone/pacificnew
timezone/etcetera
timezone/factory
timezone/backward
timezone/systemv
timezone/solar87
timezone/solar88
timezone/solar89
timezone/iso3166.tab
timezone/zone.tab
timezone/leapseconds
# This is yearistype.sh in the parent project
timezone/yearistype