mirror of
https://sourceware.org/git/glibc.git
synced 2024-11-23 13:30:06 +00:00
25c9c3789e
= `Default_Ignorable_Code_Point`s should have width 0 = Unicode specifies (https://www.unicode.org/faq/unsup_char.html#3) that characters with the `Default_Ignorable_Code_Point` property > should be rendered as completely invisible (and non advancing, i.e. “zero width”), if not explicitly supported in rendering. Hence, `wcwidth()` should give them all a width of 0, with two exceptions: - the soft hyphen (U+00AD SOFT HYPHEN) is assigned width 1 by longstanding precedent - U+115F HANGUL CHOSEONG FILLER needs a carveout due to the unique behavior of the conjoining Korean jamo characters. One composed Hangul "syllable block" like 퓛 is made up of two to three individual component characters, or "jamo". These are all assigned an `East_Asian_Width` of `Wide` by Unicode, which would normally mean they would all be assigned width 2 by glibc; a combination of (leading choseong jamo) + (medial jungseong jamo) + (trailing jongseong jamo) would then have width 2 + 2 + 2 = 6. However, glibc (and other wcwidth implementations) special-cases jungseong and jongseong, assigning them all width 0, to ensure that the complete block has width 2 + 0 + 0 = 2 as it should. U+115F is meant for use in syllable blocks that are intentionally missing a leading jamo; it must be assigned a width of 2 even though it has no visible display to ensure that the complete block has width 2. However, `wcwidth()` currently (before this patch) incorrectly assigns non-zero width to U+3164 HANGUL FILLER and U+FFA0 HALFWIDTH HANGUL FILLER; this commit fixes that. Unicode spec references: - Hangul: §3.12 https://www.unicode.org/versions/Unicode15.0.0/ch03.pdf#G24646 and §18.6 https://www.unicode.org/versions/Unicode15.0.0/ch18.pdf#G31028 - `Default_Ignorable_Code_Point`: §5.21 https://www.unicode.org/versions/Unicode15.0.0/ch05.pdf#G40095. = Non-`Default_Ignorable_Code_Point` format controls should be visible = The Unicode Standard, §5.21 - Characters Ignored for Display (https://www.unicode.org/versions/Unicode15.0.0/ch05.pdf#G40095) says the following: > A small number of format characters (General_Category = Cf ) > are also not given the Default_Ignorable_Code_Point property. > This may surprise implementers, who often assume > that all format characters are generally ignored in fallback display. > The exact list of these exceptional format characters > can be found in the Unicode Character Database. > There are, however, three important sets of such format characters to note: > > - prepended concatenation marks > - interlinear annotation characters > - Egyptian hieroglyph format controls > > The prepended concatenation marks always have a visible display. > See “Prepended Concatenation Marks” in [*Section 23.2, Layout Controls*](https://www.unicode.org/versions/Unicode15.1.0/ch23.pdf#M9.35858.HeadingBreak.132.Layout.Controls) > for more discussion of the use and display of these signs. > > The other two notable sets of format characters that exceptionally are not ignored > in fallback display consist of the interlinear annotation characters, > U+FFF9 INTERLINEAR ANNOTATION ANCHOR through > U+FFFB INTERLINEAR ANNOTATION TERMINATOR, > and the Egyptian hieroglyph format controls, > U+13430 EGYPTIAN HIEROGLYPH VERTICAL JOINER through > U+1343F EGYPTIAN HIEROGLYPH END WALLED ENCLOSURE. > These characters should have a visible glyph display for fallback rendering, > because if they are not displayed, > it is too easy to misread the resulting displayed text. > See “Annotation Characters” in [*Section 23.8, Specials*](https://www.unicode.org/versions/Unicode15.1.0/ch23.pdf#M9.21335.Heading.133.Specials), > as well as [*Section 11.4, Egyptian Hieroglyphs*](https://www.unicode.org/versions/Unicode15.1.0/ch11.pdf#M9.73291.Heading.1418.Egyptian.Hieroglyphs) > for more discussion of the use and display of these characters. glibc currently correctly assigns non-zero width to the prepended concatenation marks, but it incorrectly gives zero width to the interlinear annotation characters (which a generic terminal cannot interpret) and the Egyptian hieroglyph format controls (which are not widely supported in rendering implementations at present). This commit fixes both these issues as well. = Derive Hangul syllable type from Unicode data = Previosuly, the jungseong and jongseong jamo ranges were hard-coded into the script. With this commit, they are instead parsed from the HangulSyllableType.txt data file published by Unicode. This does not affect the end result. Signed-off-by: Jules Bertholet <julesbertholet@quoi.xyz> |
||
---|---|---|
.. | ||
ctype_compatibility_test_cases.py | ||
ctype_compatibility.py | ||
DerivedCoreProperties.txt | ||
EastAsianWidth.txt | ||
gen_translit_circle.py | ||
gen_translit_cjk_compat.py | ||
gen_translit_combining.py | ||
gen_translit_compat.py | ||
gen_translit_font.py | ||
gen_translit_fraction.py | ||
gen_unicode_ctype.py | ||
HangulSyllableType.txt | ||
Makefile | ||
unicode_utils.py | ||
unicode-license.txt | ||
UnicodeData.txt | ||
utf8_compatibility.py | ||
utf8_gen.py |