mirror of
https://sourceware.org/git/glibc.git
synced 2025-01-13 12:40:08 +00:00
25c9c3789e
= `Default_Ignorable_Code_Point`s should have width 0 = Unicode specifies (https://www.unicode.org/faq/unsup_char.html#3) that characters with the `Default_Ignorable_Code_Point` property > should be rendered as completely invisible (and non advancing, i.e. “zero width”), if not explicitly supported in rendering. Hence, `wcwidth()` should give them all a width of 0, with two exceptions: - the soft hyphen (U+00AD SOFT HYPHEN) is assigned width 1 by longstanding precedent - U+115F HANGUL CHOSEONG FILLER needs a carveout due to the unique behavior of the conjoining Korean jamo characters. One composed Hangul "syllable block" like 퓛 is made up of two to three individual component characters, or "jamo". These are all assigned an `East_Asian_Width` of `Wide` by Unicode, which would normally mean they would all be assigned width 2 by glibc; a combination of (leading choseong jamo) + (medial jungseong jamo) + (trailing jongseong jamo) would then have width 2 + 2 + 2 = 6. However, glibc (and other wcwidth implementations) special-cases jungseong and jongseong, assigning them all width 0, to ensure that the complete block has width 2 + 0 + 0 = 2 as it should. U+115F is meant for use in syllable blocks that are intentionally missing a leading jamo; it must be assigned a width of 2 even though it has no visible display to ensure that the complete block has width 2. However, `wcwidth()` currently (before this patch) incorrectly assigns non-zero width to U+3164 HANGUL FILLER and U+FFA0 HALFWIDTH HANGUL FILLER; this commit fixes that. Unicode spec references: - Hangul: §3.12 https://www.unicode.org/versions/Unicode15.0.0/ch03.pdf#G24646 and §18.6 https://www.unicode.org/versions/Unicode15.0.0/ch18.pdf#G31028 - `Default_Ignorable_Code_Point`: §5.21 https://www.unicode.org/versions/Unicode15.0.0/ch05.pdf#G40095. = Non-`Default_Ignorable_Code_Point` format controls should be visible = The Unicode Standard, §5.21 - Characters Ignored for Display (https://www.unicode.org/versions/Unicode15.0.0/ch05.pdf#G40095) says the following: > A small number of format characters (General_Category = Cf ) > are also not given the Default_Ignorable_Code_Point property. > This may surprise implementers, who often assume > that all format characters are generally ignored in fallback display. > The exact list of these exceptional format characters > can be found in the Unicode Character Database. > There are, however, three important sets of such format characters to note: > > - prepended concatenation marks > - interlinear annotation characters > - Egyptian hieroglyph format controls > > The prepended concatenation marks always have a visible display. > See “Prepended Concatenation Marks” in [*Section 23.2, Layout Controls*](https://www.unicode.org/versions/Unicode15.1.0/ch23.pdf#M9.35858.HeadingBreak.132.Layout.Controls) > for more discussion of the use and display of these signs. > > The other two notable sets of format characters that exceptionally are not ignored > in fallback display consist of the interlinear annotation characters, > U+FFF9 INTERLINEAR ANNOTATION ANCHOR through > U+FFFB INTERLINEAR ANNOTATION TERMINATOR, > and the Egyptian hieroglyph format controls, > U+13430 EGYPTIAN HIEROGLYPH VERTICAL JOINER through > U+1343F EGYPTIAN HIEROGLYPH END WALLED ENCLOSURE. > These characters should have a visible glyph display for fallback rendering, > because if they are not displayed, > it is too easy to misread the resulting displayed text. > See “Annotation Characters” in [*Section 23.8, Specials*](https://www.unicode.org/versions/Unicode15.1.0/ch23.pdf#M9.21335.Heading.133.Specials), > as well as [*Section 11.4, Egyptian Hieroglyphs*](https://www.unicode.org/versions/Unicode15.1.0/ch11.pdf#M9.73291.Heading.1418.Egyptian.Hieroglyphs) > for more discussion of the use and display of these characters. glibc currently correctly assigns non-zero width to the prepended concatenation marks, but it incorrectly gives zero width to the interlinear annotation characters (which a generic terminal cannot interpret) and the Egyptian hieroglyph format controls (which are not widely supported in rendering implementations at present). This commit fixes both these issues as well. = Derive Hangul syllable type from Unicode data = Previosuly, the jungseong and jongseong jamo ranges were hard-coded into the script. With this commit, they are instead parsed from the HangulSyllableType.txt data file published by Unicode. This does not affect the end result. Signed-off-by: Jules Bertholet <julesbertholet@quoi.xyz> |
||
---|---|---|
.. | ||
ANSI_X3.4-1968 | ||
ANSI_X3.110-1983 | ||
ARMSCII-8 | ||
ASMO_449 | ||
BIG5 | ||
BIG5-HKSCS | ||
BRF | ||
BS_4730 | ||
BS_VIEWDATA | ||
CP737 | ||
CP770 | ||
CP771 | ||
CP772 | ||
CP773 | ||
CP774 | ||
CP775 | ||
CP949 | ||
CP1125 | ||
CP1250 | ||
CP1251 | ||
CP1252 | ||
CP1253 | ||
CP1254 | ||
CP1255 | ||
CP1256 | ||
CP1257 | ||
CP1258 | ||
CP10007 | ||
CSA_Z243.4-1985-1 | ||
CSA_Z243.4-1985-2 | ||
CSA_Z243.4-1985-GR | ||
CSN_369103 | ||
CWI | ||
DEC-MCS | ||
DIN_66003 | ||
DS_2089 | ||
EBCDIC-AT-DE | ||
EBCDIC-AT-DE-A | ||
EBCDIC-CA-FR | ||
EBCDIC-DK-NO | ||
EBCDIC-DK-NO-A | ||
EBCDIC-ES | ||
EBCDIC-ES-A | ||
EBCDIC-ES-S | ||
EBCDIC-FI-SE | ||
EBCDIC-FI-SE-A | ||
EBCDIC-FR | ||
EBCDIC-IS-FRISS | ||
EBCDIC-IT | ||
EBCDIC-PT | ||
EBCDIC-UK | ||
EBCDIC-US | ||
ECMA-CYRILLIC | ||
ES | ||
ES2 | ||
EUC-JISX0213 | ||
EUC-JP | ||
EUC-JP-MS | ||
EUC-KR | ||
EUC-TW | ||
GB2312 | ||
GB18030 | ||
GB_1988-80 | ||
GBK | ||
GEORGIAN-ACADEMY | ||
GEORGIAN-PS | ||
GOST_19768-74 | ||
GREEK7 | ||
GREEK7-OLD | ||
GREEK-CCITT | ||
HP-GREEK8 | ||
HP-ROMAN8 | ||
HP-ROMAN9 | ||
HP-THAI8 | ||
HP-TURKISH8 | ||
IBM037 | ||
IBM038 | ||
IBM256 | ||
IBM273 | ||
IBM274 | ||
IBM275 | ||
IBM277 | ||
IBM278 | ||
IBM280 | ||
IBM281 | ||
IBM284 | ||
IBM285 | ||
IBM290 | ||
IBM297 | ||
IBM420 | ||
IBM423 | ||
IBM424 | ||
IBM437 | ||
IBM500 | ||
IBM850 | ||
IBM851 | ||
IBM852 | ||
IBM855 | ||
IBM856 | ||
IBM857 | ||
IBM858 | ||
IBM860 | ||
IBM861 | ||
IBM862 | ||
IBM863 | ||
IBM864 | ||
IBM865 | ||
IBM866 | ||
IBM866NAV | ||
IBM868 | ||
IBM869 | ||
IBM870 | ||
IBM871 | ||
IBM874 | ||
IBM875 | ||
IBM880 | ||
IBM891 | ||
IBM903 | ||
IBM904 | ||
IBM905 | ||
IBM918 | ||
IBM922 | ||
IBM1004 | ||
IBM1026 | ||
IBM1047 | ||
IBM1124 | ||
IBM1129 | ||
IBM1132 | ||
IBM1133 | ||
IBM1160 | ||
IBM1161 | ||
IBM1162 | ||
IBM1163 | ||
IBM1164 | ||
IEC_P27-1 | ||
INIS | ||
INIS-8 | ||
INIS-CYRILLIC | ||
INVARIANT | ||
ISIRI-3342 | ||
ISO_646.BASIC | ||
ISO_646.IRV | ||
ISO_2033-1983 | ||
ISO_5427 | ||
ISO_5427-EXT | ||
ISO_5428 | ||
ISO_6937 | ||
ISO_6937-2-25 | ||
ISO_6937-2-ADD | ||
ISO_8859-1,GL | ||
ISO_8859-SUPP | ||
ISO_10367-BOX | ||
ISO_10646 | ||
ISO_11548-1 | ||
ISO-8859-1 | ||
ISO-8859-2 | ||
ISO-8859-3 | ||
ISO-8859-4 | ||
ISO-8859-5 | ||
ISO-8859-6 | ||
ISO-8859-7 | ||
ISO-8859-8 | ||
ISO-8859-9 | ||
ISO-8859-9E | ||
ISO-8859-10 | ||
ISO-8859-11 | ||
ISO-8859-13 | ||
ISO-8859-14 | ||
ISO-8859-15 | ||
ISO-8859-16 | ||
ISO-IR-90 | ||
ISO-IR-197 | ||
ISO-IR-209 | ||
IT | ||
JIS_C6220-1969-JP | ||
JIS_C6220-1969-RO | ||
JIS_C6229-1984-A | ||
JIS_C6229-1984-B | ||
JIS_C6229-1984-B-ADD | ||
JIS_C6229-1984-HAND | ||
JIS_C6229-1984-HAND-ADD | ||
JIS_C6229-1984-KANA | ||
JIS_X0201 | ||
JOHAB | ||
JUS_I.B1.002 | ||
JUS_I.B1.003-MAC | ||
JUS_I.B1.003-SERB | ||
KOI8-R | ||
KOI8-RU | ||
KOI8-T | ||
KOI8-U | ||
KOI-8 | ||
KSC5636 | ||
LATIN-GREEK | ||
LATIN-GREEK-1 | ||
MAC-CENTRALEUROPE | ||
MAC-CYRILLIC | ||
MAC-IS | ||
MAC-SAMI | ||
MAC-UK | ||
MACINTOSH | ||
MIK | ||
MSZ_7795.3 | ||
NATS-DANO | ||
NATS-DANO-ADD | ||
NATS-SEFI | ||
NATS-SEFI-ADD | ||
NC_NC00-10 | ||
NEXTSTEP | ||
NF_Z_62-010 | ||
NF_Z_62-010_1973 | ||
NS_4551-1 | ||
NS_4551-2 | ||
PT | ||
PT2 | ||
PT154 | ||
RK1048 | ||
SAMI | ||
SAMI-WS2 | ||
SEN_850200_B | ||
SEN_850200_C | ||
SHIFT_JIS | ||
SHIFT_JISX0213 | ||
T.61-7BIT | ||
T.61-8BIT | ||
T.101-G2 | ||
TCVN5712-1 | ||
TIS-620 | ||
TSCII | ||
UTF-8 | ||
VIDEOTEX-SUPPL | ||
VISCII | ||
WINDOWS-31J |