65e8ccde28
X-SVN-Rev: 17717
129 lines
7.1 KiB
HTML
129 lines
7.1 KiB
HTML
<html>
|
|
|
|
<head>
|
|
<meta http-equiv="Content-Language" content="en-us">
|
|
<meta name="GENERATOR" content="Microsoft FrontPage 5.0">
|
|
<meta name="ProgId" content="FrontPage.Editor.Document">
|
|
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
|
|
<title>IDN Characters</title>
|
|
<link rel="stylesheet" type="text/css" href="idn-chars.css">
|
|
</head>
|
|
|
|
<body>
|
|
|
|
<h1>IDN Character Categorization</h1>
|
|
<p><i>%date%, MED</i></p>
|
|
<p>This page lists all Unicode characters relevant to IDN in a <a href="#Categorization">chart</a>,
|
|
broken down by category. Characters are grouped first by script, and then by subcategory.</p>
|
|
<p>The "output" IDN characters are ones that can result from nameprep, while the "input" characters
|
|
are those that are allowed in input, but transformed (remapped or deleted). Tool-tips provide the
|
|
character code and name (in enabled browsers). The following table described the subcategories.
|
|
Within each subcategory characters are sorted according to the default
|
|
<a href="http://www.unicode.org/reports/tr10/">UCA</a> order.</p>
|
|
<blockquote>
|
|
<table border="1" cellpadding="2" cellspacing="0">
|
|
<caption><b><font size="4">Key</font></b></caption>
|
|
<tr>
|
|
<th>Type</th>
|
|
<th>Subcategory</th>
|
|
<th>Description</th>
|
|
</tr>
|
|
<tr>
|
|
<th rowspan="5">Output</th>
|
|
<td class="Atomic"><a name="Atomic">Atomic</a></td>
|
|
<td>Characters that don't fall into any of the following subcategories</td>
|
|
</tr>
|
|
<tr>
|
|
<td class="Atomic-no-uppercase"><a name="Atomic-no-uppercase">Atomic-no-uppercase</a></td>
|
|
<td>For bicameral scripts, Atomic characters without an uppercase. These need to be examined
|
|
to see which are used in modern languages.</td>
|
|
</tr>
|
|
<tr>
|
|
<td class="Pattern_Syntax"><a name="Pattern_Syntax">Pattern_Syntax</a></td>
|
|
<td>Characters recommended as a basis for use in pattern syntax. Excludes the
|
|
<a href="#Word_Characters">additional word characters</a>.</td>
|
|
</tr>
|
|
<tr>
|
|
<td class="Non-XID"><a name="Non-XID">Non-XID</a></td>
|
|
<td>Characters not recommended as a basis for identifiers, excluding Pattern_Syntax and
|
|
<a href="#Word_Characters">additional word characters</a>.</td>
|
|
</tr>
|
|
<tr>
|
|
<td class="NFD-Decomposable"><a name="NFD-Decomposable">NFD-Decomposable</a></td>
|
|
<td>Characters with NFD (canonical) decompositions. These are broken out separately because
|
|
certain spoofing techniques are applied to them <i>via their decompositions.</i></td>
|
|
</tr>
|
|
<tr>
|
|
<th rowspan="4">Input</th>
|
|
<td class="IDN-Remapped-Case-Atomic"><a name="IDN-Remapped-Case-Atomic">
|
|
IDN-Remapped-Case-Atomic</a></td>
|
|
<td>Atomic characters remapped by IDN due to case folding [<a href="http://ietf.org/rfc/rfc3454.txt">StringPrep</a>
|
|
Section 3.2].</td>
|
|
</tr>
|
|
<tr>
|
|
<td class="IDN-Remapped-Case-NFD-Decomposable"><a name="IDN-Remapped-Case-NFD-Decomposable">
|
|
IDN-Remapped-Case-NFD-Decomposable</a></td>
|
|
<td>Characters that are NFD (canonical) decomposable and that are remapped by IDN due to case
|
|
folding [<a href="http://ietf.org/rfc/rfc3454.txt">StringPrep</a> Section 3.2].</td>
|
|
</tr>
|
|
<tr>
|
|
<td class="IDN-Remapped-Compat"><a name="IDN-Remapped-Compat">IDN-Remapped</a></td>
|
|
<td>Characters remapped by IDN due to compatibility (NFKD) mapping. [<a href="http://ietf.org/rfc/rfc3454.txt">StringPrep</a>
|
|
Section 4]</td>
|
|
</tr>
|
|
<tr>
|
|
<td class="IDN-Deleted"><a name="IDN-Deleted">IDN-Deleted</a></td>
|
|
<td>Characters deleted by IDN, that is, mapped to nothing [<a href="http://ietf.org/rfc/rfc3454.txt">StringPrep</a>
|
|
Section 3.1]</td>
|
|
</tr>
|
|
<tr>
|
|
<th>Prohibited</th>
|
|
<td class="IDN-Prohibited"><a name="IDN-Prohibited">IDN-Prohibited </a></td>
|
|
<td>Characters prohibited in IDN [<a href="http://ietf.org/rfc/rfc3454.txt">StringPrep</a>
|
|
Section 5] (Note: most of these are due to IDN's using an old version of Unicode. IDN does
|
|
treat unassigned characters differently than explicitly prohibited characters, but for our
|
|
purposes this distinction doesn't matter.)</td>
|
|
</tr>
|
|
</table>
|
|
</blockquote>
|
|
<p>Characters that are normally invisible are represented in the chart by their Unicode number, such as "U+FE00".</p>
|
|
<p>At the end of this document, there is an additional section that lists all <a href='#Visible_Combining_Marks_0'>visible non-spacing marks</a>.
|
|
These are sorted first by combining character class (modified), then by script, then by code point..</p>
|
|
<p>For comparison of Indic characters, see <a href='indic-trans.html'>indic-trans.html</a>.</p>
|
|
<h3>Additional <a name="Word_Characters">Word Characters</a></h3>
|
|
<p>This is a draft list of characters based on <i>Section 4 Word Boundaries</i> of
|
|
<a href="http://www.unicode.org/reports/tr29/tr29-9.html#Word_Boundaries">UAX# 29</a>, in the
|
|
Word_Break property and notes at the end of the section. While not currently a part of the
|
|
recommended characters for programming identifiers (XID_Continue), these characters have been
|
|
identified as being necessary for more "natural language" identifiers, since some words in some
|
|
modern languages could not be constructed without them. See also
|
|
<a href="http://www.unicode.org/reports/tr31/tr31-5.html">UAX #31: Identifier and Pattern Syntax</a>.
|
|
These characters are listed in the plain text file, as described below.</p>
|
|
<h2>Plain-Text Version</h2>
|
|
<p>The information in the categorization is also available in a plain-text file, at
|
|
<a href="idn-chars.txt">idn-chars.txt</a>. It can be viewed as is, or loaded into a spreadsheet for
|
|
sorting and filtering to view the data in different ways. The format is:</p>
|
|
<blockquote>
|
|
<p>code ; script ; subcategory # general-category (character) character-name</p>
|
|
</blockquote>
|
|
<p><i>Examples:</i></p>
|
|
<pre>0061 ; LATIN ; Atomic # ; L& (a) LATIN SMALL LETTER A
|
|
<code>026B ; LATIN ; Atomic-no-uppercase # L& (?) LATIN SMALL LETTER L WITH MIDDLE TILDE</code>
|
|
2015 ; COMMON ; Pattern_Syntax # Pd (―) HORIZONTAL BAR
|
|
058A ; ARMENIAN ; Atomic-no-uppercase # ; Pd (֊) ARMENIAN HYPHEN
|
|
20AC ; COMMON ; Non-XID # ; Sc (€) EURO SIGN</pre>
|
|
<p>At the end of <a href="idn-chars.txt">idn-chars.txt</a> is a section called ADDITIONAL WORD
|
|
CHARACTERS, defined as described above. Below that is a section of FOR REVIEW characters,
|
|
sorted by Unicode general category (an additional category of XX is added for the odd characters
|
|
whose names include: <span style="font-variant: small-caps">MUSICAL SYMBOL, DINGBAT, or RADICAL</span>.)
|
|
We need review of that list to check for characters that are needed for words in modern languages,
|
|
that is, that should be moved up into the ADDITIONAL WORD CHARACTERS list. Each character in the FOR
|
|
REVIEW list is collected because it either: </p>
|
|
<ol>
|
|
<li>would not otherwise count as part of an XID, or</li>
|
|
<li>is part of a bicameral script and doesn't have an uppercase (eg, the situation for U+026B
|
|
above)</li>
|
|
</ol>
|
|
<p>In either case there is prima facie reason for some level of scrutiny, if the goal to be
|
|
initially conservative in repertoire.</p>
|
|
<h2><a name="Categorization">Categorization</a></h2> |