IDN Character Categorization

%date%, MED

This page lists all Unicode characters relevant to IDN in a chart, broken down by category. Characters are grouped first by script, and then by subcategory.

The "output" IDN characters are ones that can result from nameprep, while the "input" characters are those that are allowed in input, but transformed (remapped or deleted). Tool-tips provide the character code and name (in enabled browsers). The following table described the subcategories. Within each subcategory characters are sorted according to the default UCA order.

Key
Type Subcategory Description
Output Atomic Characters that don't fall into any of the following subcategories
Atomic-no-uppercase For bicameral scripts, Atomic characters without an uppercase. These need to be examined to see which are used in modern languages.
Pattern_Syntax Characters recommended as a basis for use in pattern syntax. Excludes the additional word characters.
Non-XID Characters not recommended as a basis for identifiers, excluding Pattern_Syntax and additional word characters.
NFD-Decomposable Characters with NFD (canonical) decompositions. These are broken out separately because certain spoofing techniques are applied to them via their decompositions.
Input IDN-Remapped-Case-Atomic Atomic characters remapped by IDN due to case folding [StringPrep Section 3.2].
IDN-Remapped-Case-NFD-Decomposable Characters that are NFD (canonical) decomposable and that are remapped by IDN due to case folding [StringPrep Section 3.2].
IDN-Remapped Characters remapped by IDN due to compatibility (NFKD) mapping. [StringPrep Section 4]
IDN-Deleted Characters deleted by IDN, that is, mapped to nothing [StringPrep Section 3.1]
Prohibited IDN-Prohibited Characters prohibited in IDN [StringPrep Section 5] (Note: most of these are due to IDN's using an old version of Unicode. IDN does treat unassigned characters differently than explicitly prohibited characters, but for our purposes this distinction doesn't matter.)

Characters that are normally invisible are represented in the chart by their Unicode number, such as "U+FE00".

At the end of this document, there is an additional section that lists all visible non-spacing marks. These are sorted first by combining character class (modified), then by script, then by code point..

For comparison of Indic characters, see indic-trans.html.

Additional Word Characters

This is a draft list of characters based on Section 4 Word Boundaries of UAX# 29, in the Word_Break property and notes at the end of the section. While not currently a part of the recommended characters for programming identifiers (XID_Continue), these characters have been identified as being necessary for more "natural language" identifiers, since some words in some modern languages could not be constructed without them. See also UAX #31: Identifier and Pattern Syntax. These characters are listed in the plain text file, as described below.

Plain-Text Version

The information in the categorization is also available in a plain-text file, at idn-chars.txt. It can be viewed as is, or loaded into a spreadsheet for sorting and filtering to view the data in different ways. The format is:

code ; script ; subcategory # general-category (character) character-name

Examples:

0061          ; LATIN ; Atomic # ; L& (a) LATIN SMALL LETTER A
026B          ; LATIN ; Atomic-no-uppercase # L& (?) LATIN SMALL LETTER L WITH MIDDLE TILDE
2015          ; COMMON ; Pattern_Syntax # Pd (―) HORIZONTAL BAR
058A          ; ARMENIAN ; Atomic-no-uppercase # ; Pd (֊) ARMENIAN HYPHEN
20AC          ; COMMON ; Non-XID # ; Sc (€) EURO SIGN

At the end of idn-chars.txt is a section called ADDITIONAL WORD CHARACTERS, defined as described above. Below that is a section of FOR REVIEW characters, sorted by Unicode general category (an additional category of XX is added for the odd characters whose names include: MUSICAL SYMBOL, DINGBAT, or RADICAL.) We need review of that list to check for characters that are needed for words in modern languages, that is, that should be moved up into the ADDITIONAL WORD CHARACTERS list. Each character in the FOR REVIEW list is collected because it either:

  1. would not otherwise count as part of an XID, or
  2. is part of a bicameral script and doesn't have an uppercase (eg, the situation for U+026B above)

In either case there is prima facie reason for some level of scrutiny, if the goal to be initially conservative in repertoire.

Categorization