mirror of
https://sourceware.org/git/glibc.git
synced 2024-11-22 13:00:06 +00:00
Update.
1999-01-12 Ulrich Drepper <drepper@cygnus.com> * manual/charset.texi: Add many corrections. Patch by Benjamin Kosnik <bkoz@cygnus.com>.
This commit is contained in:
parent
c1b2d47280
commit
d731df03bd
@ -1,3 +1,8 @@
|
||||
1999-01-12 Ulrich Drepper <drepper@cygnus.com>
|
||||
|
||||
* manual/charset.texi: Add many corrections.
|
||||
Patch by Benjamin Kosnik <bkoz@cygnus.com>.
|
||||
|
||||
1999-01-12 Andreas Jaeger <aj@arthur.rhein-neckar.de>
|
||||
|
||||
* manual/charset.texi: Fix some typos.
|
||||
|
@ -8,13 +8,14 @@
|
||||
@end macro
|
||||
@end ifnottex
|
||||
|
||||
Character sets used in the early days of computers had only six, seven,
|
||||
or eight bits for each character. In no case more bits than would fit
|
||||
into one byte which nowadays is almost exclusively @w{8 bits} wide.
|
||||
This of course leads to several problems once not all characters needed
|
||||
at one time can be represented by the up to 256 available characters.
|
||||
This chapter shows the functionality which was added to the C library to
|
||||
overcome this problem.
|
||||
Character sets used in the early days of computing had only six, seven,
|
||||
or eight bits for each character: there was never a case where more than
|
||||
eight bits (one byte) were used to represent a single character. The
|
||||
limitations of this approach became more apparent as more people
|
||||
grappled with non-Roman character sets, where not all the characters
|
||||
that make up a language's character set can be represented by @math{2^8}
|
||||
choices. This chapter shows the functionality which was added to the C
|
||||
library to correctly support multiple character sets.
|
||||
|
||||
@menu
|
||||
* Extended Char Intro:: Introduction to Extended Characters.
|
||||
@ -30,18 +31,20 @@ overcome this problem.
|
||||
@node Extended Char Intro
|
||||
@section Introduction to Extended Characters
|
||||
|
||||
To overcome the limitations of character sets with a 1:1 relation
|
||||
between bytes and characters people came up with a variety of solutions.
|
||||
The remainder of this section gives a few examples to help understanding
|
||||
the design decision made while developing the functionality of the @w{C
|
||||
library} to support them.
|
||||
A variety of solutions to overcome the differences between
|
||||
character sets with a 1:1 relation between bytes and characters and
|
||||
character sets with ratios of 2:1 or 4:1 exist. The remainder of this
|
||||
section gives a few examples to help understand the design decisions
|
||||
made while developing the functionality of the @w{C library}.
|
||||
|
||||
@cindex internal representation
|
||||
A distinction we have to make right away is between internal and
|
||||
external representation. @dfn{Internal representation} means the
|
||||
representation used by a program while keeping the text in memory.
|
||||
External representations are used when text is stored or transmitted
|
||||
through whatever communication channel.
|
||||
through whatever communication channel. Examples of external
|
||||
representations include files lying in a directory that are going to be
|
||||
read and parsed.
|
||||
|
||||
Traditionally there was no difference between the two representations.
|
||||
It was equally comfortable and useful to use the same one-byte
|
||||
@ -49,24 +52,24 @@ representation internally and externally. This changes with more and
|
||||
larger character sets.
|
||||
|
||||
One of the problems to overcome with the internal representation is
|
||||
handling text which were externally encoded using different character
|
||||
handling text which is externally encoded using different character
|
||||
sets. Assume a program which reads two texts and compares them using
|
||||
some metric. The comparison can be usefully done only if the texts are
|
||||
internally kept in a common format.
|
||||
|
||||
@cindex wide character
|
||||
For such a common format (@math{=} character set) eight bits are certainly
|
||||
not enough anymore. So the smallest entity will have to grow: @dfn{wide
|
||||
characters} will be used. Here instead of one byte one uses two or four
|
||||
(three are not good to address in memory and more than four bytes seem
|
||||
not to be necessary).
|
||||
no longer enough. So the smallest entity will have to grow: @dfn{wide
|
||||
characters} will now be used. Instead of one byte, two or four will
|
||||
be used instead. (Three are not good to address in memory and more
|
||||
than four bytes seem not to be necessary).
|
||||
|
||||
@cindex Unicode
|
||||
@cindex ISO 10646
|
||||
As shown in some other part of this manual
|
||||
As shown in some other part of this manual,
|
||||
@c !!! Ahem, wide char string functions are not yet covered -- drepper
|
||||
there exists a completely new family of functions which can handle texts
|
||||
of this kinds in memory. The most commonly used character set for such
|
||||
of this kind in memory. The most commonly used character set for such
|
||||
internal wide character representations are Unicode and @w{ISO 10646}.
|
||||
The former is a subset of the later and used when wide characters are
|
||||
chosen to by 2 bytes (@math{= 16} bits) wide. The standard names of the
|
||||
@ -75,11 +78,11 @@ chosen to by 2 bytes (@math{= 16} bits) wide. The standard names of the
|
||||
encodings used in these cases are UCS2 (@math{= 16} bits) and UCS4
|
||||
(@math{= 32} bits).
|
||||
|
||||
To represent wide characters the @code{char} type is certainly not
|
||||
suitable. For this reason the @w{ISO C} standard introduces a new type
|
||||
which is designed to keep one character of a wide character string. To
|
||||
maintain the similarity there is also a type corresponding to @code{int}
|
||||
for those functions which take a single wide character.
|
||||
To represent wide characters the @code{char} type is not suitable. For
|
||||
this reason the @w{ISO C} standard introduces a new type which is
|
||||
designed to keep one character of a wide character string. To maintain
|
||||
the similarity there is also a type corresponding to @code{int} for
|
||||
those functions which take a single wide character.
|
||||
|
||||
@comment stddef.h
|
||||
@comment ISO
|
||||
@ -98,7 +101,7 @@ But for GNU systems this type is always 32 bits wide. It is therefore
|
||||
capable to represent all UCS4 value therefore covering all of @w{ISO
|
||||
10646}. Some Unix systems define @code{wchar_t} as a 16 bit type and
|
||||
thereby follow Unicode very strictly. This is perfectly fine with the
|
||||
standard but it also means that to represent all characters fro Unicode
|
||||
standard but it also means that to represent all characters from Unicode
|
||||
and @w{ISO 10646} one has to use surrogate character which is in fact a
|
||||
multi-wide-character encoding. But this contradicts the purpose of the
|
||||
@code{wchar_t} type.
|
||||
@ -183,26 +186,30 @@ defined in @file{wchar.h}.
|
||||
|
||||
|
||||
These internal representations present problems when it comes to storing
|
||||
and transmitting them. Since a single wide character consists of more
|
||||
and transmittal, since a single wide character consists of more
|
||||
than one byte they are effected by byte-ordering. I.e., machines with
|
||||
different endianesses would see different value accessing the same data.
|
||||
This also applies for communication protocols which are all byte-based
|
||||
and therefore the sender has to decide about splitting the wide
|
||||
character in bytes. A last but not least important point is that wide
|
||||
character in bytes. A last (but not least important) point is that wide
|
||||
characters often require more storage space than an customized byte
|
||||
oriented character set.
|
||||
|
||||
@cindex multibyte character
|
||||
This is why most of the time an external encoding which is different
|
||||
from the internal encoding is used if the later is UCS2 or UCS4. The
|
||||
external encoding is byte-based and can be chosen appropriately for the
|
||||
environment and for the texts to be handled. There exists a variety of
|
||||
different character sets which can be used which is too much to be
|
||||
handled completely here. We restrict ourself here to a description of
|
||||
the major groups. All of the ASCII-based character sets fulfill one
|
||||
requirement: they are ``filesystem safe''. This means that the
|
||||
character @code{'/'} is used in the encoding @emph{only} to represent
|
||||
itself. Things are a bit different for character like EBCDIC but if the
|
||||
@cindex EBCDIC
|
||||
For all the above reasons, an external encoding which is different
|
||||
from the internal encoding is often used if the later is UCS2 or UCS4.
|
||||
The external encoding is byte-based and can be chosen appropriately for
|
||||
the environment and for the texts to be handled. There exist a variety
|
||||
of different character sets which can be used for this external
|
||||
encoding. Information which will not be exhaustively presented
|
||||
here--instead, a description of the major groups will suffice. All of
|
||||
the ASCII-based character sets [_bkoz_: do you mean Roman character
|
||||
sets? If not, what do you mean here?] fulfill one requirement: they are
|
||||
"filesystem safe". This means that the character @code{'/'} is used in
|
||||
the encoding @emph{only} to represent itself. Things are a bit
|
||||
different for character sets like EBCDIC (Extended Binary Coded Decimal
|
||||
Interchange Code, a character set family used by IBM) but if the
|
||||
operation system does not understand EBCDIC directly the parameters to
|
||||
system calls have to be converted first anyhow.
|
||||
|
||||
@ -212,7 +219,7 @@ The simplest character sets are one-byte character sets. There can be
|
||||
only up to 256 characters (for @w{8 bit} character sets) which is not
|
||||
sufficient to cover all languages but might be sufficient to handle a
|
||||
specific text. Another reason to choose this is because of constraints
|
||||
from interaction with other programs.
|
||||
from interaction with other programs (which might not be 8-bit clean).
|
||||
|
||||
@cindex ISO 2022
|
||||
@item
|
||||
@ -243,12 +250,12 @@ Examples for this are ISO-2022-JP, ISO-2022-KR, and ISO-2022-CN.
|
||||
@cindex ISO 6937
|
||||
Early attempts to fix 8 bit character sets for other languages using the
|
||||
Roman alphabet lead to character sets like @w{ISO 6937}. Here bytes
|
||||
representing characters like the acute accent do not produce output on
|
||||
there on. One has to combine them with other characters. E.g., the
|
||||
byte sequence @code{0xc2 0x61} (non-spacing acute accent, following by
|
||||
lower-case `a') to get the ``small a with acute'' character. To get the
|
||||
acute accent character on its on one has to write @code{0xc2 0x20} (the
|
||||
non-spacing acute followed by a space).
|
||||
representing characters like the acute accent do not produce output
|
||||
themselves: one has to combine them with other characters to get the
|
||||
desired result. E.g., the byte sequence @code{0xc2 0x61} (non-spacing
|
||||
acute accent, following by lower-case `a') to get the ``small a with
|
||||
acute'' character. To get the acute accent character on its on one has
|
||||
to write @code{0xc2 0x20} (the non-spacing acute followed by a space).
|
||||
|
||||
This type of characters sets is quite frequently used in embedded
|
||||
systems such as video text.
|
||||
@ -265,29 +272,29 @@ encoding: UTF-8. This encoding is able to represent all of @w{ISO
|
||||
There were a few other attempts to encode @w{ISO 10646} such as UTF-7
|
||||
but UTF-8 is today the only encoding which should be used. In fact,
|
||||
UTF-8 will hopefully soon be the only external which has to be
|
||||
supported. It proofs to be universally usable and the only disadvantage
|
||||
is that it favor Latin languages very much by making the byte string
|
||||
supported. It proves to be universally usable and the only disadvantage
|
||||
is that it favor Roman languages very much by making the byte string
|
||||
representation of other scripts (Cyrillic, Greek, Asian scripts) longer
|
||||
than necessary if using a specific character set for these scripts. But
|
||||
with methods like the Unicode compression scheme one can overcome these
|
||||
problems and the ever growing memory and storage capacities do the rest.
|
||||
than necessary if using a specific character set for these scripts.
|
||||
Methods like the Unicode compression scheme can alleviate these
|
||||
problems.
|
||||
@end itemize
|
||||
|
||||
The question remaining now is: how to select the character set or
|
||||
encoding to use. The answer is mostly: you cannot decide about it
|
||||
yourself, it is decided by the developers of the system or the majority
|
||||
of the users. Since the goal is interoperability one has to use
|
||||
whatever the other people one works with use. If there are no
|
||||
constraints the selection is based on the requirements the expected
|
||||
circle of users will have. I.e., if a project is expected to only be
|
||||
used in, say, Russia it is fine to use KOI8-R or a similar character
|
||||
set. But if at the same time people from, say, Greek are participating
|
||||
one should use a character set which allows all people to collaborate.
|
||||
The question remaining is: how to select the character set or encoding
|
||||
to use. The answer: you cannot decide about it yourself, it is decided
|
||||
by the developers of the system or the majority of the users. Since the
|
||||
goal is interoperability one has to use whatever the other people one
|
||||
works with use. If there are no constraints the selection is based on
|
||||
the requirements the expected circle of users will have. I.e., if a
|
||||
project is expected to only be used in, say, Russia it is fine to use
|
||||
KOI8-R or a similar character set. But if at the same time people from,
|
||||
say, Greek are participating one should use a character set which allows
|
||||
all people to collaborate.
|
||||
|
||||
A general advice here could be: go with the most general character set,
|
||||
namely @w{ISO 10646}. Use UTF-8 as the external encoding and problems
|
||||
about users not being able to use their own language adequately are a
|
||||
thing of the past.
|
||||
The most widely useful solution seems to be: go with the most general
|
||||
character set, namely @w{ISO 10646}. Use UTF-8 as the external encoding
|
||||
and problems about users not being able to use their own language
|
||||
adequately are a thing of the past.
|
||||
|
||||
One final comment about the choice of the wide character representation
|
||||
is necessary at this point. We have said above that the natural choice
|
||||
@ -314,7 +321,7 @@ standard, is unfortunately the least useful one. In fact, these
|
||||
functions should be avoided whenever possible, especially when
|
||||
developing libraries (as opposed to applications).
|
||||
|
||||
The second family o functions got introduced in the early Unix standards
|
||||
The second family of functions got introduced in the early Unix standards
|
||||
(XPG2) and is still part of the latest and greatest Unix standard:
|
||||
@w{Unix 98}. It is also the most powerful and useful set of functions.
|
||||
But we will start with the functions defined in the second amendment to
|
||||
@ -370,8 +377,7 @@ We already said above that the currently selected locale for the
|
||||
by the functions we are about to describe. Each locale uses its own
|
||||
character set (given as an argument to @code{localedef}) and this is the
|
||||
one assumed as the external multibyte encoding. The wide character
|
||||
character set always is UCS4. So we can see here already where the
|
||||
limitations of these conversion functions are.
|
||||
character set always is UCS4.
|
||||
|
||||
A characteristic of each multibyte character set is the maximum number
|
||||
of bytes which can be necessary to represent one character. This
|
||||
@ -425,8 +431,8 @@ The code in the inner loop is expected to have always enough bytes in
|
||||
the array @var{buf} to convert one multibyte character. The array
|
||||
@var{buf} has to be sized statically since many compilers do not allow a
|
||||
variable size. The @code{fread} call makes sure that always
|
||||
@code{MB_CUR_MAX} bytes are available in @var{buf}. Note that it is no
|
||||
problem if @code{MB_CUR_MAX} is not a compile-time constant.
|
||||
@code{MB_CUR_MAX} bytes are available in @var{buf}. Note that it isn't
|
||||
a problem if @code{MB_CUR_MAX} is not a compile-time constant.
|
||||
|
||||
|
||||
@node Keeping the state
|
||||
@ -546,7 +552,7 @@ is declared in @file{wchar.h}.
|
||||
|
||||
Despite the limitation that the single byte value always is interpreted
|
||||
in the initial state this function is actually useful most of the time.
|
||||
Most character are either entirely single-byte character sets or they
|
||||
Most characters are either entirely single-byte character sets or they
|
||||
are extension to ASCII. But then it is possible to write code like this
|
||||
(not that this specific example is useful):
|
||||
|
||||
@ -563,19 +569,18 @@ itow (unsigned long int val)
|
||||
val /= 10;
|
||||
@}
|
||||
if (wcp == &buf[29])
|
||||
*--wcp = btowc ('0');
|
||||
*--wcp = L'0';
|
||||
return wcp;
|
||||
@}
|
||||
@end smallexample
|
||||
|
||||
The question is why is it necessary to use such a complicated
|
||||
implementation and not simply cast L'0' to a wide character. The answer
|
||||
is that there is no guarantee that the compiler knows about the wide
|
||||
character set used at runtime. Even if the wide character equivalent of
|
||||
a given single-byte character is simply the equivalent to casting a
|
||||
single-byte character to @code{wchar_t} this is no guarantee that this
|
||||
is the case everywhere.
|
||||
Why is it necessary to use such a complicated implementation and not
|
||||
simply cast @code{'0' + val %10} to a wide character? The answer is
|
||||
that there is no guarantee that one can perform this kind of arithmetic
|
||||
on the character of the character set used for @code{wchar_t}
|
||||
representation.
|
||||
|
||||
@noindent
|
||||
There also is a function for the conversion in the other direction.
|
||||
|
||||
@comment wchar.h
|
||||
@ -897,7 +902,7 @@ the buffer size. Please note the @code{NULL} argument for the
|
||||
destination buffer in the new @code{wcrtomb} call; since we are not
|
||||
interested in the result at this point this is a nice way to express
|
||||
this. The most unusual thing about this piece of code certainly is the
|
||||
duplication of the conversion state object. But think about it: if a
|
||||
duplication of the conversion state object. But think about this: if a
|
||||
change of the state is necessary to emit the next multibyte character we
|
||||
want to have the same shift state change performed in the real
|
||||
conversion. Therefore we have to preserve the initial shift state
|
||||
@ -912,8 +917,8 @@ This example is only meant for educational purposes.
|
||||
The functions described in the previous section only convert a single
|
||||
character at a time. Most operations to be performed in real-world
|
||||
programs include strings and therefore the @w{ISO C} standard also
|
||||
defines conversions on entire strings. The defined set of functions is
|
||||
quite limited, though. Therefore contains the GNU C library a few
|
||||
defines conversions on entire strings. However, the defined set of
|
||||
functions is quite limited, thus the GNU C library contains a few
|
||||
extensions which are necessary in some important situations.
|
||||
|
||||
@comment wchar.h
|
||||
@ -986,19 +991,18 @@ the newline in the original text could be something different than the
|
||||
initial shift state and therefore the first character of the next line
|
||||
is encoded using this state. But the state in question is never
|
||||
accessible to the user since the conversion stops after the NUL byte.
|
||||
Fortunately most stateful character sets in use today require that the
|
||||
shift state after a newline is the initial state but this is no
|
||||
Most stateful character sets in use today require that the shift state
|
||||
after a newline is the initial state--but this is not a strict
|
||||
guarantee. Therefore simply NUL terminating a piece of a running text
|
||||
is not always the adequate solution.
|
||||
is not always an adequate solution.
|
||||
|
||||
The generic conversion
|
||||
@comment XXX reference to iconv
|
||||
interface does not have this limitation (it simply works on buffers, not
|
||||
strings) but there is another way. The GNU C library contains a set of
|
||||
functions why take additional parameters specifying maximal number of
|
||||
bytes which are consumed from the input string. This way the problem of
|
||||
above's example could be solved by determining the line length and
|
||||
passing this length to the function.
|
||||
The generic conversion interface (see @xref{Generic Charset Conversion})
|
||||
does not have this limitation (it simply works on buffers, not
|
||||
strings),and the GNU C library contains a set of functions which take
|
||||
additional parameters specifying the maximal number of bytes which are
|
||||
consumed from the input string. This way the problem of
|
||||
@code{mbsrtowcs}'s example above could be solved by determining the line
|
||||
length and passing this length to the function.
|
||||
|
||||
@comment wchar.h
|
||||
@comment ISO
|
||||
@ -1065,7 +1069,7 @@ inserting NUL bytes and the effect of NUL bytes on the conversion state.
|
||||
@end deftypefun
|
||||
|
||||
A function to convert a multibyte string into a wide character string
|
||||
and display it could be written like this (this is no really useful
|
||||
and display it could be written like this (this is not a really useful
|
||||
example):
|
||||
|
||||
@smallexample
|
||||
@ -1092,11 +1096,10 @@ showmbs (const char *src, FILE *fp)
|
||||
@}
|
||||
@end smallexample
|
||||
|
||||
There is no more problem with the state after a call to
|
||||
@code{mbsnrtowcs}. Since we don't insert characters in the strings
|
||||
which were not in there right from the beginning and we use @var{state}
|
||||
only for the conversion of the given buffer there is no problem with
|
||||
mixing the state up.
|
||||
There is no problem with the state after a call to @code{mbsnrtowcs}.
|
||||
Since we don't insert characters in the strings which were not in there
|
||||
right from the beginning and we use @var{state} only for the conversion
|
||||
of the given buffer there is no problem with altering the state.
|
||||
|
||||
@comment wchar.h
|
||||
@comment GNU
|
||||
@ -1120,7 +1123,7 @@ helps in situations where no NUL terminated input strings are available.
|
||||
@subsection A Complete Multibyte Conversion Example
|
||||
|
||||
The example programs given in the last sections are only brief and do
|
||||
not contain all the error checking etc. Therefore here comes a complete
|
||||
not contain all the error checking etc. Presented here is a complete
|
||||
and documented example. It features the @code{mbrtowc} function but it
|
||||
should be easy to derive versions using the other functions.
|
||||
|
||||
@ -1216,19 +1219,19 @@ are not described in the first place is that they are almost entirely
|
||||
useless.
|
||||
|
||||
The problem is that all the functions for conversion defined in @w{ISO
|
||||
C89} use a local state. This does not only mean that multiple
|
||||
conversions at the same time (not only when using threads) cannot be
|
||||
done. It also means that you cannot first convert single characters and
|
||||
the strings since you cannot say the conversion functions which state to
|
||||
use.
|
||||
C89} use a local state. This implies that multiple conversions at the
|
||||
same time (not only when using threads) cannot be done, and that you
|
||||
cannot first convert single characters and then strings since you cannot
|
||||
tell the conversion functions which state to use.
|
||||
|
||||
These functions are therefore usable only in a very limited set of
|
||||
situation. One most complete converting the entire string before
|
||||
situations. One most complete converting the entire string before
|
||||
starting a new one and each string/text must be converted with the same
|
||||
function (there is no problem with the library itself; it is guaranteed
|
||||
that no library function changes the state of any of these functions).
|
||||
For these reasons it is @emph{highly} requested to use the functions
|
||||
from the last section.
|
||||
@strong{For the above reasons it is highly requested that the functions
|
||||
from the last section are used in place of non-reentrant conversion
|
||||
functions.}
|
||||
|
||||
@menu
|
||||
* Non-reentrant Character Conversion:: Non-reentrant Conversion of Single
|
||||
@ -1456,13 +1459,13 @@ scan_string (char *s)
|
||||
@{
|
||||
int length = strlen (s);
|
||||
|
||||
/* @r{Initialize shift state.} */
|
||||
/* @r{Initialize shift state.} */
|
||||
mblen (NULL, 0);
|
||||
|
||||
while (1)
|
||||
@{
|
||||
int thischar = mblen (s, length);
|
||||
/* @r{Deal with end of string and invalid characters.} */
|
||||
/* @r{Deal with end of string and invalid characters.} */
|
||||
if (thischar == 0)
|
||||
break;
|
||||
if (thischar == -1)
|
||||
@ -1470,7 +1473,7 @@ scan_string (char *s)
|
||||
error ("invalid multibyte character");
|
||||
break;
|
||||
@}
|
||||
/* @r{Advance past this character.} */
|
||||
/* @r{Advance past this character.} */
|
||||
s += thischar;
|
||||
length -= thischar;
|
||||
@}
|
||||
@ -1491,7 +1494,7 @@ common that they operate on character sets which are not directly
|
||||
specified by the functions. The multibyte encoding used is specified by
|
||||
the currently selected locale for the @code{LC_CTYPE} category. The
|
||||
wide character set is fixed by the implementation (in the case of GNU C
|
||||
library it always is @w{ISO 10646}.
|
||||
library it always is UCS4 encoded @w{ISO 10646}.
|
||||
|
||||
This has of course several problems when it comes to general character
|
||||
conversion:
|
||||
@ -1533,12 +1536,12 @@ source and destination. Only the set of available conversions is
|
||||
limiting them. The standard does not specify that any conversion at all
|
||||
must be available. It is a measure of the quality of the implementation.
|
||||
|
||||
In the following text first the interface will be described. It is here
|
||||
shortly named @code{iconv}-interface after the name of the conversion
|
||||
function. Then the implementation is described as far as interesting to
|
||||
the advanced user who wants to extend the conversion capabilities.
|
||||
Comparisons with other implementations will show what trapfalls lie on
|
||||
the way of portable applications.
|
||||
In the following text first the interface to @code{iconv}, the
|
||||
conversion function, will be described. Comparisons with other
|
||||
implementations will show what pitfalls lie on the way of portable
|
||||
applications. At last, the implementation is described as far as
|
||||
interesting to the advanced user who wants to extend the conversion
|
||||
capabilities.
|
||||
|
||||
@menu
|
||||
* Generic Conversion Interface:: Generic Character Set Conversion Interface.
|
||||
@ -1603,8 +1606,7 @@ The conversion from @var{fromcode} to @var{tocode} is not supported.
|
||||
It is not possible to use the same descriptor in different threads to
|
||||
perform independent conversions. Within the data structures associated
|
||||
with the descriptor there is information about the conversion state.
|
||||
This must of course not be messed up by using it in different
|
||||
conversions.
|
||||
This must not be messed up by using it in different conversions.
|
||||
|
||||
An @code{iconv} descriptor is like a file descriptor as for every use a
|
||||
new descriptor must be created. The descriptor does not stand for all
|
||||
@ -1631,8 +1633,8 @@ effect.
|
||||
@pindex iconv.h
|
||||
This function got introduced early in the X/Open Portability Guide,
|
||||
@w{version 2}. It is supported by all commercial Unices as it is
|
||||
required for the Unix branding. The quality and completeness of the
|
||||
implementation varies widely, though. The function is declared in
|
||||
required for the Unix branding. However, the quality and completeness
|
||||
of the implementation varies widely. The function is declared in
|
||||
@file{iconv.h}.
|
||||
@end deftypefun
|
||||
|
||||
@ -1759,11 +1761,11 @@ This function was introduced in the XPG2 standard and is declared in the
|
||||
The definition of the @code{iconv} function is quite good overall. It
|
||||
provides quite flexible functionality. The only problems lie in the
|
||||
boundary cases which are incomplete byte sequences at the end of the
|
||||
input buffer and invalid input. A third problem, which is not really a
|
||||
design problem, is the way conversions are selected. The standard does
|
||||
not say anything about the legitimate names, a minimal set of available
|
||||
conversions. We will see how this has negative impacts in the
|
||||
discussion of other implementations further down.
|
||||
input buffer and invalid input. A third problem, which is not really
|
||||
a design problem, is the way conversions are selected. The standard
|
||||
does not say anything about the legitimate names, a minimal set of
|
||||
available conversions. We will see how this negatively impacts other
|
||||
implementations, as is demonstrated below.
|
||||
|
||||
|
||||
@node iconv Examples
|
||||
@ -1904,8 +1906,8 @@ of the @code{iconv} functions can lead to portability issues.
|
||||
The first thing to notice is that due to the large number of character
|
||||
sets in use it is certainly not practical to encode the conversions
|
||||
directly in the C library. Therefore the conversion information must
|
||||
come from files outside the C library. This is usually in one or both
|
||||
of the following ways:
|
||||
come from files outside the C library. This is usually done in one or
|
||||
both of the following ways:
|
||||
|
||||
@itemize @bullet
|
||||
@item
|
||||
@ -1913,9 +1915,9 @@ The C library contains a set of generic conversion functions which can
|
||||
read the needed conversion tables and other information from data files.
|
||||
These files get loaded when necessary.
|
||||
|
||||
This solution is problematic as it is only with very much effort
|
||||
applicable to all character set (maybe it is even impossible). The
|
||||
differences in structure of the different character sets is so large
|
||||
This solution is problematic as it requires a great deal of effort to
|
||||
apply to all character sets (potentially an infinite set). The
|
||||
differences in the structure of the different character sets is so large
|
||||
that many different variants of the table processing functions must be
|
||||
developed. On top of this the generic nature of these functions make
|
||||
them slower than specifically implemented functions.
|
||||
@ -1933,27 +1935,27 @@ dynamic loading must be available.
|
||||
@end itemize
|
||||
|
||||
Some implementations in commercial Unices implement a mixture of these
|
||||
possibilities, the majority only the second solution. This often leads
|
||||
to problems, though. Since the modules with the conversion modules must
|
||||
be dynamically loaded the system must have this possibility for all
|
||||
programs. But this is not the case. At least some platforms (if not
|
||||
all) are not able to dynamically load objects if the program is linked
|
||||
statically. This is often solved by outlawing static linking entirely
|
||||
but sure it is a weak solution. The GNU C library does not have this
|
||||
restriction though it also uses dynamic loading. The danger is that one
|
||||
get acquainted with this and forgets about the restriction on other
|
||||
systems.
|
||||
these possibilities, the majority only the second solution. Using
|
||||
loadable modules moves the code out of the library itself and keeps the
|
||||
door open for extensions and improvements. But this design is also
|
||||
limiting on some platforms since not many platforms support dynamic
|
||||
loading in statically linked programs. On platforms without his
|
||||
capability it is therefore not possible to use this interface in
|
||||
statically linked programs. The GNU C library has on ELF platforms no
|
||||
problems with dynamic loading in in these situations and therefore this
|
||||
point is mood. The danger is that one gets acquainted with this and
|
||||
forgets about the restrictions on other systems.
|
||||
|
||||
A second thing to know about other @code{iconv} implementations is that
|
||||
the number of available conversions is often very limited. Some
|
||||
implementations provide in the standard release (not the special
|
||||
international release, if something exists) at most 100 to 200
|
||||
conversion possibilities. This does not mean 200 different character
|
||||
sets are supported. E.g., conversions from one character set to a set
|
||||
of, say, 10 others counts as 10 conversion. Together with the other
|
||||
direction this makes already 20. One can imagine the thin coverage
|
||||
these platform provide. Some Unix vendors even provide only a handful
|
||||
of conversions which renders them useless for almost all uses.
|
||||
implementations provide in the standard release (not special
|
||||
international or developer releases) at most 100 to 200 conversion
|
||||
possibilities. This does not mean 200 different character sets are
|
||||
supported. E.g., conversions from one character set to a set of, say,
|
||||
10 others counts as 10 conversion. Together with the other direction
|
||||
this makes already 20. One can imagine the thin coverage these platform
|
||||
provide. Some Unix vendors even provide only a handful of conversions
|
||||
which renders them useless for almost all uses.
|
||||
|
||||
This directly leads to a third and probably the most problematic point.
|
||||
The way the @code{iconv} conversion functions are implemented on all
|
||||
@ -1976,10 +1978,10 @@ does fail according to the assumption above. But what does the program
|
||||
do now? The conversion is really necessary and therefore simply giving
|
||||
up is no possibility.
|
||||
|
||||
First this is of course a nuisance. The @code{iconv} function should
|
||||
take care of this. But second, how should the program proceed from here
|
||||
on? If it would try to convert to character set @math{@cal{B}} first
|
||||
the two @code{iconv_open} calls
|
||||
This is a nuisance. The @code{iconv} function should take care of this.
|
||||
But how should the program proceed from here on? If it would try to
|
||||
convert to character set @math{@cal{B}} first the two @code{iconv_open}
|
||||
calls
|
||||
|
||||
@smallexample
|
||||
cd1 = iconv_open ("@math{@cal{B}}", "@math{@cal{A}}");
|
||||
@ -1995,10 +1997,10 @@ cd2 = iconv_open ("@math{@cal{C}}", "@math{@cal{B}}");
|
||||
@noindent
|
||||
will succeed but how to find @math{@cal{B}}?
|
||||
|
||||
The answer is unfortunately: there is no general solution. On some
|
||||
Unfortunately, the answer is: there is no general solution. On some
|
||||
systems guessing might help. On those systems most character sets can
|
||||
convert to and from UTF8 encoded @w{ISO 10646} or Unicode text. Beside
|
||||
this only some very system-specific methods can help. Since the
|
||||
convert to and from UTF8 encoded @w{ISO 10646} or Unicode text.
|
||||
Beside this only some very system-specific methods can help. Since the
|
||||
conversion functions come from loadable modules and these modules must
|
||||
be stored somewhere in the filesystem, one @emph{could} try to find them
|
||||
and determine from the available file which conversions are available
|
||||
@ -2016,12 +2018,12 @@ routes.
|
||||
@subsection The @code{iconv} Implementation in the GNU C library
|
||||
|
||||
After reading about the problems of @code{iconv} implementations in the
|
||||
last section it is certainly good to read here that the implementation
|
||||
in the GNU C library has none of the problems mentioned above. But step
|
||||
by step now. We will now address the points raised above. The
|
||||
last section it is certainly good to note that the implementation in
|
||||
the GNU C library has none of the problems mentioned above. What
|
||||
follows is a step-by-step analysis of the points raised above. The
|
||||
evaluation is based on the current state of the development (as of
|
||||
January 1999). The development of the @code{iconv} functions is not
|
||||
entirely finished by now but things can only get better.
|
||||
complete, but basic funtionality has solidified.
|
||||
|
||||
The GNU C library's @code{iconv} implementation uses shared loadable
|
||||
modules to implement the conversions. A very small number of
|
||||
@ -2029,48 +2031,50 @@ conversions are built into the library itself but these are only rather
|
||||
trivial conversions.
|
||||
|
||||
All the benefits of loadable modules are available in the GNU C library
|
||||
implementation. This is especially interesting since the interface is
|
||||
implementation. This is especially appealing since the interface is
|
||||
well documented (see below) and it therefore is easy to write new
|
||||
conversion modules. The drawback of using loadable object is not a
|
||||
conversion modules. The drawback of using loadable objects is not a
|
||||
problem in the GNU C library, at least on ELF systems. Since the
|
||||
library is able to load shared objects even in statically linked
|
||||
binaries this means that static linking needs not to be forbidden in case
|
||||
one wants to use @code{iconv}.
|
||||
binaries this means that static linking needs not to be forbidden in
|
||||
case one wants to use @code{iconv}.
|
||||
|
||||
The second mentioned problems is the number of supported conversions.
|
||||
First, the GNU C library supports more than 150 character sets. And the
|
||||
The second mentioned problem is the number of supported conversions.
|
||||
Currently, the GNU C library supports more than 150 character sets. The
|
||||
way the implementation is designed the number of supported conversions
|
||||
is greater than 22350 (@math{150} times @math{149}). If any conversion
|
||||
from or to a character set is missing it can easily be added.
|
||||
|
||||
This high number is due to the fact that the GNU C library
|
||||
implementation of @code{iconv} does not have the third problem mentioned
|
||||
above. I.e., whenever there is a conversion from a character set
|
||||
@math{@cal{A}} to @math{@cal{B}} and from @math{@cal{B}} to
|
||||
@math{@cal{C}} it is always possible to convert from @math{@cal{A}} to
|
||||
@math{@cal{C}} directly. If the @code{iconv_open} returns an error and
|
||||
sets @code{errno} to @code{EINVAL} this really means there is no known
|
||||
way, directly or indirectly, to perform the wanted conversion.
|
||||
Particularly impressive as it may be, this high number is due to the
|
||||
fact that the GNU C library implementation of @code{iconv} does not have
|
||||
the third problem mentioned above. I.e., whenever there is a conversion
|
||||
from a character set @math{@cal{A}} to @math{@cal{B}} and from
|
||||
@math{@cal{B}} to @math{@cal{C}} it is always possible to convert from
|
||||
@math{@cal{A}} to @math{@cal{C}} directly. If the @code{iconv_open}
|
||||
returns an error and sets @code{errno} to @code{EINVAL} this really
|
||||
means there is no known way, directly or indirectly, to perform the
|
||||
wanted conversion.
|
||||
|
||||
@cindex triangulation
|
||||
This is achieved by providing for each character set a conversion from
|
||||
and to UCS4 encoded @w{ISO 10646}. Using @w{ISO 10646} as an
|
||||
intermediate representation it is possible to ``triangulate''.
|
||||
intermediate representation it is possible to @dfn{triangulate}, i.e.,
|
||||
converting with an intermediate representation.
|
||||
|
||||
There is no inherent requirement to provide a conversion to @w{ISO
|
||||
10646} for a new character set and it is also possible to provide other
|
||||
conversions where neither source not destination character set is @w{ISO
|
||||
conversions where neither source nor destination character set is @w{ISO
|
||||
10646}. The currently existing set of conversions is simply meant to
|
||||
convert all conversions which might be of interest. What could be done
|
||||
in future is improving the speed of certain conversions.
|
||||
cover all conversions which might be of interest.
|
||||
|
||||
@cindex ISO-2022-JP
|
||||
@cindex EUC-JP
|
||||
Since all currently available conversions use the triangulation methods
|
||||
often used conversion run unnecessarily slow. If, e.g., somebody often
|
||||
needs the conversion from ISO-2022-JP to EUC-JP it is not the best way
|
||||
to convert the input to @w{ISO 10646} first. The two character sets of
|
||||
interest are much more similar to each other than to @w{ISO 10646}.
|
||||
All currently available conversions use the triangulation method above,
|
||||
making conversion run unnecessarily slow. If, e.g., somebody often
|
||||
needs the conversion from ISO-2022-JP to EUC-JP, a quicker solution
|
||||
would involve direct conversion between the two character sets, skipping
|
||||
the input to @w{ISO 10646} first. The two character sets of interest
|
||||
are much more similar to each other than to @w{ISO 10646}.
|
||||
|
||||
In such a situation one can easy write a new conversion and provide it
|
||||
as a better alternative. The GNU C library @code{iconv} implementation
|
||||
@ -2124,7 +2128,7 @@ relative values of the sums of costs for all possible conversion paths.
|
||||
Below is a more precise description of the use of the cost value.
|
||||
@end itemize
|
||||
|
||||
Coming back to the example where one has written a module to directly
|
||||
Returning to the example above where one has written a module to directly
|
||||
convert from ISO-2022-JP to EUC-JP and back. All what has to be done is
|
||||
to put the new module, be its name ISO2022JP-EUCJP.so, in a directory
|
||||
and add a file @file{gconv-modules} with the following content in the
|
||||
@ -2135,8 +2139,8 @@ module ISO-2022-JP// EUC-JP// ISO2022JP-EUCJP 1
|
||||
module EUC-JP// ISO-2022-JP// ISO2022JP-EUCJP 1
|
||||
@end smallexample
|
||||
|
||||
To see why this is enough it is necessary to understand how the
|
||||
conversion used by @code{iconv} and described in the descriptor is
|
||||
To see why this is sufficient, it is necessary to understand how the
|
||||
conversion used by @code{iconv} (and described in the descriptor) is
|
||||
selected. The approach to this problem is quite simple.
|
||||
|
||||
At the first call of the @code{iconv_open} function the program reads
|
||||
@ -2148,30 +2152,33 @@ them.
|
||||
@subsubsection Finding the conversion path in @code{iconv}
|
||||
|
||||
The set of available conversions form a directed graph with weighted
|
||||
edges. The weights on the edges are of course the costs specified in
|
||||
the @file{gconv-modules} files. The @code{iconv_open} function
|
||||
therefore uses an algorithm suitable to search for the best path in such
|
||||
a graph and so constructs a list of conversions which must be performed
|
||||
in succession to get the transformation from the source to the
|
||||
destination character set.
|
||||
edges. The weights on the edges are the costs specified in the
|
||||
@file{gconv-modules} files. The @code{iconv_open} function uses an
|
||||
algorithm suitable for search for the best path in such a graph and so
|
||||
constructs a list of conversions which must be performed in succession
|
||||
to get the transformation from the source to the destination character
|
||||
set.
|
||||
|
||||
Now it can be easily seen why the above @file{gconv-modules} files
|
||||
allows the @code{iconv} implementation to pick up the specific
|
||||
ISO-2022-JP to EUC-JP conversion module instead of the conversion coming
|
||||
with the library itself. Since the later conversion takes two steps
|
||||
(from ISO-2022-JP to @w{ISO 10646} and then from @w{ISO 10646} to
|
||||
Explaining why the above @file{gconv-modules} files allows the
|
||||
@code{iconv} implementation to resolve the specific ISO-2022-JP to
|
||||
EUC-JP conversion module instead of the conversion coming with the
|
||||
library itself is straighforward. Since the later conversion takes two
|
||||
steps (from ISO-2022-JP to @w{ISO 10646} and then from @w{ISO 10646} to
|
||||
EUC-JP) the cost is @math{1+1 = 2}. But the above @file{gconv-modules}
|
||||
file specifies that the new conversion modules can perform this
|
||||
conversion with only the cost of @math{1}.
|
||||
|
||||
A bit mysterious about the @file{gconv-modules} file above (and also the
|
||||
file coming with the GNU C library) are the names of the character sets
|
||||
specified in the @code{module} lines. Why do almost all the names end
|
||||
in @code{//}? And this is not all: the names can actually be regular
|
||||
expressions. At this point of time this mystery should not be revealed.
|
||||
Sorry! @strong{The part of the implementation where this is used is not
|
||||
yet finished. For now please simply follow the existing examples.
|
||||
It'll become clearer once it is. --drepper}
|
||||
A mysterious piece about the @file{gconv-modules} file above (and also
|
||||
the file coming with the GNU C library) are the names of the character
|
||||
sets specified in the @code{module} lines. Why do almost all the names
|
||||
end in @code{//}? And this is not all: the names can actually be
|
||||
regular expressions. At this point of time this mystery should not be
|
||||
revealed, unless you have the relevant spell-casting materials: ashes
|
||||
from an original @w{DOS 6.2} boot disk burnt in effigy, a crucifix
|
||||
blessed by St.@: Emacs, assorted herbal roots from Central America, sand
|
||||
from Cebu, etc. Sorry! @strong{The part of the implementation where
|
||||
this is used is not yet finished. For now please simply follow the
|
||||
existing examples. It'll become clearer once it is. --drepper}
|
||||
|
||||
A last remark about the @file{gconv-modules} is about the names not
|
||||
ending with @code{//}. There often is a character set named
|
||||
@ -2588,10 +2595,10 @@ gconv_end (struct gconv_step *data)
|
||||
@end smallexample
|
||||
@end deftypevr
|
||||
|
||||
The most important function of course is the conversion function itself.
|
||||
It can get quite complicated for complex character sets. But since this
|
||||
is not of interest here we will only describe a possible skeleton for
|
||||
the conversion function.
|
||||
The most important function is the conversion function itself. It can
|
||||
get quite complicated for complex character sets. But since this is not
|
||||
of interest here we will only describe a possible skeleton for the
|
||||
conversion function.
|
||||
|
||||
@comment gconv.h
|
||||
@comment GNU
|
||||
|
@ -1858,7 +1858,7 @@ treated with care.
|
||||
This structure is used to control and monitor kernel time in a greater
|
||||
level of detail. It contains the following members:
|
||||
@table @code
|
||||
@item unsigned int mode
|
||||
@item unsigned int modes
|
||||
This variable controls whether and which values are set. Several
|
||||
symbolic constants have to be combined with @emph{binary or} to specify
|
||||
the effective mode. These constants start with @code{MOD_}.
|
||||
@ -1866,7 +1866,7 @@ the effective mode. These constants start with @code{MOD_}.
|
||||
@item long int offset
|
||||
This value indicates the current offset of the local clock from the true
|
||||
time. The value is given in microseconds. If bit @code{MOD_OFFSET} is
|
||||
set in @code{mode}, the offset (and possibly other dependent values) can
|
||||
set in @code{modes}, the offset (and possibly other dependent values) can
|
||||
be set. The offset's absolute value must not exceed @code{MAXPHASE}.
|
||||
|
||||
@item long int frequency
|
||||
@ -1946,7 +1946,9 @@ exceeded the threshold.
|
||||
@deftypefun int ntp_adjtime (struct timex *@var{tptr})
|
||||
The @code{ntp_adjtime} function sets the structure specified by
|
||||
@var{tptr} to current values. In addition, values passed in @var{tptr}
|
||||
can be used to replace existing settings.
|
||||
can be used to replace existing settings. To do this the @code{modes}
|
||||
element of the @code{struct timex} must be set appropriately. Setting
|
||||
it to zero selects reading the current state.
|
||||
|
||||
The return value is @code{0} on success and other values on failure. The
|
||||
following @code{errno} error conditions are defined for this function:
|
||||
|
Loading…
Reference in New Issue
Block a user