mirror of
https://sourceware.org/git/glibc.git
synced 2024-11-08 22:30:07 +00:00
manual: fix typos in character set handling
This commit is contained in:
parent
f0f308c104
commit
d987d2196b
@ -1,3 +1,7 @@
|
||||
2016-06-16 Rical Jasan <ricaljasan@pacific.net>
|
||||
|
||||
* manual/charset.texi: Fix typos & grammar errors.
|
||||
|
||||
2016-06-16 Rical Jasan <ricaljasan@pacific.net>
|
||||
|
||||
* manual/string.texi: Fix typos.
|
||||
|
@ -31,7 +31,7 @@ library to support multiple character sets.
|
||||
@node Extended Char Intro
|
||||
@section Introduction to Extended Characters
|
||||
|
||||
A variety of solutions is available to overcome the differences between
|
||||
A variety of solutions are available to overcome the differences between
|
||||
character sets with a 1:1 relation between bytes and characters and
|
||||
character sets with ratios of 2:1 or 4:1. The remainder of this
|
||||
section gives a few examples to help understand the design decisions
|
||||
@ -202,7 +202,7 @@ defined in @file{wchar.h}.
|
||||
@end deftypevr
|
||||
|
||||
|
||||
These internal representations present problems when it comes to storing
|
||||
These internal representations present problems when it comes to storage
|
||||
and transmittal. Because each single wide character consists of more
|
||||
than one byte, they are affected by byte-ordering. Thus, machines with
|
||||
different endianesses would see different values when accessing the same
|
||||
@ -389,7 +389,7 @@ the conversion is necessary take a look at the @code{iconv} functions
|
||||
@subsection Selecting the conversion and its properties
|
||||
|
||||
We already said above that the currently selected locale for the
|
||||
@code{LC_CTYPE} category decides about the conversion that is performed
|
||||
@code{LC_CTYPE} category decides the conversion that is performed
|
||||
by the functions we are about to describe. Each locale uses its own
|
||||
character set (given as an argument to @code{localedef}) and this is the
|
||||
one assumed as the external multibyte encoding. The wide character
|
||||
@ -549,7 +549,7 @@ necessary output code (@pxref{Converting Strings}). Please note that with
|
||||
@theglibc{} it is not necessary to perform this extra action for the
|
||||
conversion from multibyte text to wide character text since the wide
|
||||
character encoding is not stateful. But there is nothing mentioned in
|
||||
any standard that prohibits making @code{wchar_t} using a stateful
|
||||
any standard that prohibits making @code{wchar_t} use a stateful
|
||||
encoding.
|
||||
|
||||
@node Converting a Character
|
||||
@ -559,7 +559,7 @@ The most fundamental of the conversion functions are those dealing with
|
||||
single characters. Please note that this does not always mean single
|
||||
bytes. But since there is very often a subset of the multibyte
|
||||
character set that consists of single byte sequences, there are
|
||||
functions to help with converting bytes. Frequently, ASCII is a subpart
|
||||
functions to help with converting bytes. Frequently, ASCII is a subset
|
||||
of the multibyte character set. In such a scenario, each ASCII character
|
||||
stands for itself, and all other characters have at least a first byte
|
||||
that is beyond the range @math{0} to @math{127}.
|
||||
@ -596,7 +596,7 @@ and is declared in @file{wchar.h}.
|
||||
Despite the limitation that the single byte value is always interpreted
|
||||
in the initial state, this function is actually useful most of the time.
|
||||
Most characters are either entirely single-byte character sets or they
|
||||
are extension to ASCII. But then it is possible to write code like this
|
||||
are extensions to ASCII. But then it is possible to write code like this
|
||||
(not that this specific example is very useful):
|
||||
|
||||
@smallexample
|
||||
@ -643,7 +643,7 @@ value of this function is this character. Otherwise the return value is
|
||||
is declared in @file{wchar.h}.
|
||||
@end deftypefun
|
||||
|
||||
There are more general functions to convert single character from
|
||||
There are more general functions to convert single characters from
|
||||
multibyte representation to wide characters and vice versa. These
|
||||
functions pose no limit on the length of the multibyte representation
|
||||
and they also do not require it to be in the initial state.
|
||||
@ -731,7 +731,7 @@ bytes is adjusted.
|
||||
|
||||
The only non-obvious thing about @code{mbrtowc} might be the way memory
|
||||
is allocated for the result. The above code uses the fact that there
|
||||
can never be more wide characters in the converted results than there are
|
||||
can never be more wide characters in the converted result than there are
|
||||
bytes in the multibyte input string. This method yields a pessimistic
|
||||
guess about the size of the result, and if many wide character strings
|
||||
have to be constructed this way or if the strings are long, the extra
|
||||
@ -813,7 +813,7 @@ Therefore, the @code{mbrlen} function will never read invalid memory.
|
||||
|
||||
Now that this function is available (just to make this clear, this
|
||||
function is @emph{not} part of @theglibc{}) we can compute the
|
||||
number of wide character required to store the converted multibyte
|
||||
number of wide characters required to store the converted multibyte
|
||||
character string @var{s} using
|
||||
|
||||
@smallexample
|
||||
@ -879,7 +879,7 @@ multibyte'') converts a single wide character into a multibyte string
|
||||
corresponding to that wide character.
|
||||
|
||||
If @var{s} is a null pointer, the function resets the state stored in
|
||||
the objects pointed to by @var{ps} (or the internal @code{mbstate_t}
|
||||
the object pointed to by @var{ps} (or the internal @code{mbstate_t}
|
||||
object) to the initial state. This can also be achieved by a call like
|
||||
this:
|
||||
|
||||
@ -1020,7 +1020,7 @@ extensions that can help in some important situations.
|
||||
@deftypefun size_t mbsrtowcs (wchar_t *restrict @var{dst}, const char **restrict @var{src}, size_t @var{len}, mbstate_t *restrict @var{ps})
|
||||
@safety{@prelim{}@mtunsafe{@mtasurace{:mbsrtowcs/!ps}}@asunsafe{@asucorrupt{} @ascuheap{} @asulock{} @ascudlopen{}}@acunsafe{@acucorrupt{} @aculock{} @acsmem{} @acsfd{}}}
|
||||
The @code{mbsrtowcs} function (``multibyte string restartable to wide
|
||||
character string'') converts a NUL-terminated multibyte character
|
||||
character string'') converts the NUL-terminated multibyte character
|
||||
string at @code{*@var{src}} into an equivalent wide character string,
|
||||
including the NUL wide character at the end. The conversion is started
|
||||
using the state information from the object pointed to by @var{ps} or
|
||||
@ -1061,7 +1061,7 @@ declared in @file{wchar.h}.
|
||||
The definition of the @code{mbsrtowcs} function has one important
|
||||
limitation. The requirement that @var{dst} has to be a NUL-terminated
|
||||
string provides problems if one wants to convert buffers with text. A
|
||||
buffer is normally no collection of NUL-terminated strings but instead a
|
||||
buffer is not normally a collection of NUL-terminated strings but instead a
|
||||
continuous collection of lines, separated by newline characters. Now
|
||||
assume that a function to convert one line from a buffer is needed. Since
|
||||
the line is not NUL-terminated, the source pointer cannot directly point
|
||||
@ -1078,7 +1078,7 @@ guess.
|
||||
@cindex stateful
|
||||
There is still a problem with the method of NUL-terminating a line right
|
||||
after the newline character, which could lead to very strange results.
|
||||
As said in the description of the @code{mbsrtowcs} function above the
|
||||
As said in the description of the @code{mbsrtowcs} function above, the
|
||||
conversion state is guaranteed to be in the initial shift state after
|
||||
processing the NUL byte at the end of the input string. But this NUL
|
||||
byte is not really part of the text (i.e., the conversion state after
|
||||
@ -1110,7 +1110,7 @@ multibyte string'') converts the NUL-terminated wide character string at
|
||||
stores the result in the array pointed to by @var{dst}. The NUL wide
|
||||
character is also converted. The conversion starts in the state
|
||||
described in the object pointed to by @var{ps} or by a state object
|
||||
locally to @code{wcsrtombs} in case @var{ps} is a null pointer. If
|
||||
local to @code{wcsrtombs} in case @var{ps} is a null pointer. If
|
||||
@var{dst} is a null pointer, the conversion is performed as usual but the
|
||||
result is not available. If all characters of the input string were
|
||||
successfully converted and if @var{dst} is not a null pointer, the
|
||||
@ -1123,13 +1123,13 @@ variable @code{errno} to @code{EILSEQ}, and returns @code{(size_t) -1}.
|
||||
Another reason for a premature stop is if @var{dst} is not a null
|
||||
pointer and the next converted character would require more than
|
||||
@var{len} bytes in total to the array @var{dst}. In this case (and if
|
||||
@var{dest} is not a null pointer) the pointer pointed to by @var{src} is
|
||||
@var{dst} is not a null pointer) the pointer pointed to by @var{src} is
|
||||
assigned a value pointing to the wide character right after the last one
|
||||
successfully converted.
|
||||
|
||||
Except in the case of an encoding error the return value of the
|
||||
@code{wcsrtombs} function is the number of bytes in all the multibyte
|
||||
character sequences stored in @var{dst}. Before returning the state in
|
||||
character sequences stored in @var{dst}. Before returning, the state in
|
||||
the object pointed to by @var{ps} (or the internal object in case
|
||||
@var{ps} is a null pointer) is updated to reflect the state after the
|
||||
last conversion. The state is the initial shift state in case the
|
||||
@ -1158,11 +1158,11 @@ This new parameter specifies how many bytes at most can be used from the
|
||||
multibyte character string. In other words, the multibyte character
|
||||
string @code{*@var{src}} need not be NUL-terminated. But if a NUL byte
|
||||
is found within the @var{nmc} first bytes of the string, the conversion
|
||||
stops here.
|
||||
stops there.
|
||||
|
||||
This function is a GNU extension. It is meant to work around the
|
||||
problems mentioned above. Now it is possible to convert a buffer with
|
||||
multibyte character text piece for piece without having to care about
|
||||
multibyte character text piece by piece without having to care about
|
||||
inserting NUL bytes and the effect of NUL bytes on the conversion state.
|
||||
@end deftypefun
|
||||
|
||||
@ -1603,7 +1603,7 @@ common that they operate on character sets that are not directly
|
||||
specified by the functions. The multibyte encoding used is specified by
|
||||
the currently selected locale for the @code{LC_CTYPE} category. The
|
||||
wide character set is fixed by the implementation (in the case of @theglibc{}
|
||||
it is always UCS-4 encoded @w{ISO 10646}.
|
||||
it is always UCS-4 encoded @w{ISO 10646}).
|
||||
|
||||
This has of course several problems when it comes to general character
|
||||
conversion:
|
||||
@ -1681,7 +1681,7 @@ This data type is an abstract type defined in @file{iconv.h}. The user
|
||||
must not assume anything about the definition of this type; it must be
|
||||
completely opaque.
|
||||
|
||||
Objects of this type can get assigned handles for the conversions using
|
||||
Objects of this type can be assigned handles for the conversions using
|
||||
the @code{iconv} functions. The objects themselves need not be freed, but
|
||||
the conversions for which the handles stand for have to.
|
||||
@end deftp
|
||||
@ -1716,7 +1716,7 @@ returns @code{(iconv_t) -1}. In this case the global variable
|
||||
@item EMFILE
|
||||
The process already has @code{OPEN_MAX} file descriptors open.
|
||||
@item ENFILE
|
||||
The system limit of open file is reached.
|
||||
The system limit of open files is reached.
|
||||
@item ENOMEM
|
||||
Not enough memory to carry out the operation.
|
||||
@item EINVAL
|
||||
@ -1778,7 +1778,7 @@ the @code{iconv_open} function.
|
||||
|
||||
If the function call was successful the return value is @math{0}.
|
||||
Otherwise it is @math{-1} and @code{errno} is set appropriately.
|
||||
Defined error are:
|
||||
Defined errors are:
|
||||
|
||||
@table @code
|
||||
@item EBADF
|
||||
@ -1847,7 +1847,7 @@ stop is that the output buffer is full. And the third reason is that
|
||||
the input contains invalid characters.
|
||||
|
||||
In all of these cases the buffer pointers after the last successful
|
||||
conversion, for input and output buffer, are stored in @var{inbuf} and
|
||||
conversion, for the input and output buffers, are stored in @var{inbuf} and
|
||||
@var{outbuf}, and the available room in each buffer is stored in
|
||||
@var{inbytesleft} and @var{outbytesleft}.
|
||||
|
||||
@ -2087,7 +2087,7 @@ possibilities. This does not mean 200 different character sets are
|
||||
supported; for example, conversions from one character set to a set of 10
|
||||
others might count as 10 conversions. Together with the other direction
|
||||
this makes 20 conversion possibilities used up by one character set. One
|
||||
can imagine the thin coverage these platform provide. Some Unix vendors
|
||||
can imagine the thin coverage these platforms provide. Some Unix vendors
|
||||
even provide only a handful of conversions, which renders them useless for
|
||||
almost all uses.
|
||||
|
||||
@ -2133,7 +2133,7 @@ will succeed, but how to find @math{@cal{B}}?
|
||||
|
||||
Unfortunately, the answer is: there is no general solution. On some
|
||||
systems guessing might help. On those systems most character sets can
|
||||
convert to and from UTF-8 encoded @w{ISO 10646} or Unicode text. Beside
|
||||
convert to and from UTF-8 encoded @w{ISO 10646} or Unicode text. Besides
|
||||
this only some very system-specific methods can help. Since the
|
||||
conversion functions come from loadable modules and these modules must
|
||||
be stored somewhere in the filesystem, one @emph{could} try to find them
|
||||
@ -2143,7 +2143,7 @@ and whether there is an indirect route from @math{@cal{A}} to
|
||||
|
||||
This example shows one of the design errors of @code{iconv} mentioned
|
||||
above. It should at least be possible to determine the list of available
|
||||
conversion programmatically so that if @code{iconv_open} says there is no
|
||||
conversions programmatically so that if @code{iconv_open} says there is no
|
||||
such conversion, one could make sure this also is true for indirect
|
||||
routes.
|
||||
|
||||
@ -2235,7 +2235,7 @@ achieve the same result as when using the real character set name.
|
||||
|
||||
This is quite important as a character set has often many different
|
||||
names. There is normally an official name but this need not correspond to
|
||||
the most popular name. Beside this many character sets have special
|
||||
the most popular name. Besides this many character sets have special
|
||||
names that are somehow constructed. For example, all character sets
|
||||
specified by the ISO have an alias of the form @code{ISO-IR-@var{nnn}}
|
||||
where @var{nnn} is the registration number. This allows programs that
|
||||
@ -2371,7 +2371,7 @@ itself).
|
||||
@itemx const char *__modname
|
||||
@itemx int __counter
|
||||
All these elements of the structure are used internally in the C library
|
||||
to coordinate loading and unloading the shared. One must not expect any
|
||||
to coordinate loading and unloading the shared object. One must not expect any
|
||||
of the other elements to be available or initialized.
|
||||
|
||||
@item const char *__from_name
|
||||
@ -2438,7 +2438,7 @@ These elements specify the output buffer for the conversion step. The
|
||||
@code{__outbuf} element points to the beginning of the buffer, and
|
||||
@code{__outbufend} points to the byte following the last byte in the
|
||||
buffer. The conversion function must not assume anything about the size
|
||||
of the buffer but it can be safely assumed the there is room for at
|
||||
of the buffer but it can be safely assumed there is room for at
|
||||
least one complete character in the output buffer.
|
||||
|
||||
Once the conversion is finished, if the conversion is the last step, the
|
||||
@ -2673,7 +2673,7 @@ Next, a data structure, which contains the necessary information about
|
||||
which conversion is selected, is allocated. The data structure
|
||||
@code{struct iso2022jp_data} is locally defined since, outside the
|
||||
module, this data is not used at all. Please note that if all four
|
||||
conversions this modules supports are requested there are four data
|
||||
conversions this module supports are requested there are four data
|
||||
blocks.
|
||||
|
||||
One interesting thing is the initialization of the @code{__min_} and
|
||||
@ -2686,7 +2686,7 @@ the conversion from @code{INTERNAL} to ISO-2022-JP we have to take into
|
||||
account that escape sequences might be necessary to switch the character
|
||||
sets. Therefore the @code{__max_needed_to} element for this direction
|
||||
gets assigned @code{MAX_NEEDED_FROM + 2}. This takes into account the
|
||||
two bytes needed for the escape sequences to single the switching. The
|
||||
two bytes needed for the escape sequences to signal the switching. The
|
||||
asymmetry in the maximum values for the two directions can be explained
|
||||
easily: when reading ISO-2022-JP text, escape sequences can be handled
|
||||
alone (i.e., it is not necessary to process a real character since the
|
||||
@ -2694,7 +2694,7 @@ effect of the escape sequence can be recorded in the state information).
|
||||
The situation is different for the other direction. Since it is in
|
||||
general not known which character comes next, one cannot emit escape
|
||||
sequences to change the state in advance. This means the escape
|
||||
sequences that have to be emitted together with the next character.
|
||||
sequences have to be emitted together with the next character.
|
||||
Therefore one needs more room than only for the character itself.
|
||||
|
||||
The possible return values of the initialization function are:
|
||||
@ -2740,7 +2740,7 @@ conversion function.
|
||||
@comment gconv.h
|
||||
@comment GNU
|
||||
@deftypevr {Data type} int {(*__gconv_fct)} (struct __gconv_step *, struct __gconv_step_data *, const char **, const char *, size_t *, int)
|
||||
The conversion function can be called for two basic reason: to convert
|
||||
The conversion function can be called for two basic reasons: to convert
|
||||
text or to reset the state. From the description of the @code{iconv}
|
||||
function it can be seen why the flushing mode is necessary. What mode
|
||||
is selected is determined by the sixth argument, an integer. This
|
||||
@ -2817,7 +2817,7 @@ therefore will look similar to this:
|
||||
But this is not yet all. Once the function call returns the conversion
|
||||
function might have some more to do. If the return value of the function
|
||||
is @code{__GCONV_EMPTY_INPUT}, more room is available in the output
|
||||
buffer. Unless the input buffer is empty the conversion, functions start
|
||||
buffer. Unless the input buffer is empty, the conversion functions start
|
||||
all over again and process the rest of the input buffer. If the return
|
||||
value is not @code{__GCONV_EMPTY_INPUT}, something went wrong and we have
|
||||
to recover from this.
|
||||
|
Loading…
Reference in New Issue
Block a user