manual: fix typos in character set handling

This commit is contained in:
Rical Jasan 2016-05-06 00:54:34 -07:00 committed by Mike Frysinger
parent f0f308c104
commit d987d2196b
2 changed files with 38 additions and 34 deletions

View File

@ -1,3 +1,7 @@
2016-06-16 Rical Jasan <ricaljasan@pacific.net>
* manual/charset.texi: Fix typos & grammar errors.
2016-06-16 Rical Jasan <ricaljasan@pacific.net>
* manual/string.texi: Fix typos.

View File

@ -31,7 +31,7 @@ library to support multiple character sets.
@node Extended Char Intro
@section Introduction to Extended Characters
A variety of solutions is available to overcome the differences between
A variety of solutions are available to overcome the differences between
character sets with a 1:1 relation between bytes and characters and
character sets with ratios of 2:1 or 4:1. The remainder of this
section gives a few examples to help understand the design decisions
@ -202,7 +202,7 @@ defined in @file{wchar.h}.
@end deftypevr
These internal representations present problems when it comes to storing
These internal representations present problems when it comes to storage
and transmittal. Because each single wide character consists of more
than one byte, they are affected by byte-ordering. Thus, machines with
different endianesses would see different values when accessing the same
@ -389,7 +389,7 @@ the conversion is necessary take a look at the @code{iconv} functions
@subsection Selecting the conversion and its properties
We already said above that the currently selected locale for the
@code{LC_CTYPE} category decides about the conversion that is performed
@code{LC_CTYPE} category decides the conversion that is performed
by the functions we are about to describe. Each locale uses its own
character set (given as an argument to @code{localedef}) and this is the
one assumed as the external multibyte encoding. The wide character
@ -549,7 +549,7 @@ necessary output code (@pxref{Converting Strings}). Please note that with
@theglibc{} it is not necessary to perform this extra action for the
conversion from multibyte text to wide character text since the wide
character encoding is not stateful. But there is nothing mentioned in
any standard that prohibits making @code{wchar_t} using a stateful
any standard that prohibits making @code{wchar_t} use a stateful
encoding.
@node Converting a Character
@ -559,7 +559,7 @@ The most fundamental of the conversion functions are those dealing with
single characters. Please note that this does not always mean single
bytes. But since there is very often a subset of the multibyte
character set that consists of single byte sequences, there are
functions to help with converting bytes. Frequently, ASCII is a subpart
functions to help with converting bytes. Frequently, ASCII is a subset
of the multibyte character set. In such a scenario, each ASCII character
stands for itself, and all other characters have at least a first byte
that is beyond the range @math{0} to @math{127}.
@ -596,7 +596,7 @@ and is declared in @file{wchar.h}.
Despite the limitation that the single byte value is always interpreted
in the initial state, this function is actually useful most of the time.
Most characters are either entirely single-byte character sets or they
are extension to ASCII. But then it is possible to write code like this
are extensions to ASCII. But then it is possible to write code like this
(not that this specific example is very useful):
@smallexample
@ -643,7 +643,7 @@ value of this function is this character. Otherwise the return value is
is declared in @file{wchar.h}.
@end deftypefun
There are more general functions to convert single character from
There are more general functions to convert single characters from
multibyte representation to wide characters and vice versa. These
functions pose no limit on the length of the multibyte representation
and they also do not require it to be in the initial state.
@ -731,7 +731,7 @@ bytes is adjusted.
The only non-obvious thing about @code{mbrtowc} might be the way memory
is allocated for the result. The above code uses the fact that there
can never be more wide characters in the converted results than there are
can never be more wide characters in the converted result than there are
bytes in the multibyte input string. This method yields a pessimistic
guess about the size of the result, and if many wide character strings
have to be constructed this way or if the strings are long, the extra
@ -813,7 +813,7 @@ Therefore, the @code{mbrlen} function will never read invalid memory.
Now that this function is available (just to make this clear, this
function is @emph{not} part of @theglibc{}) we can compute the
number of wide character required to store the converted multibyte
number of wide characters required to store the converted multibyte
character string @var{s} using
@smallexample
@ -879,7 +879,7 @@ multibyte'') converts a single wide character into a multibyte string
corresponding to that wide character.
If @var{s} is a null pointer, the function resets the state stored in
the objects pointed to by @var{ps} (or the internal @code{mbstate_t}
the object pointed to by @var{ps} (or the internal @code{mbstate_t}
object) to the initial state. This can also be achieved by a call like
this:
@ -1020,7 +1020,7 @@ extensions that can help in some important situations.
@deftypefun size_t mbsrtowcs (wchar_t *restrict @var{dst}, const char **restrict @var{src}, size_t @var{len}, mbstate_t *restrict @var{ps})
@safety{@prelim{}@mtunsafe{@mtasurace{:mbsrtowcs/!ps}}@asunsafe{@asucorrupt{} @ascuheap{} @asulock{} @ascudlopen{}}@acunsafe{@acucorrupt{} @aculock{} @acsmem{} @acsfd{}}}
The @code{mbsrtowcs} function (``multibyte string restartable to wide
character string'') converts a NUL-terminated multibyte character
character string'') converts the NUL-terminated multibyte character
string at @code{*@var{src}} into an equivalent wide character string,
including the NUL wide character at the end. The conversion is started
using the state information from the object pointed to by @var{ps} or
@ -1061,7 +1061,7 @@ declared in @file{wchar.h}.
The definition of the @code{mbsrtowcs} function has one important
limitation. The requirement that @var{dst} has to be a NUL-terminated
string provides problems if one wants to convert buffers with text. A
buffer is normally no collection of NUL-terminated strings but instead a
buffer is not normally a collection of NUL-terminated strings but instead a
continuous collection of lines, separated by newline characters. Now
assume that a function to convert one line from a buffer is needed. Since
the line is not NUL-terminated, the source pointer cannot directly point
@ -1078,7 +1078,7 @@ guess.
@cindex stateful
There is still a problem with the method of NUL-terminating a line right
after the newline character, which could lead to very strange results.
As said in the description of the @code{mbsrtowcs} function above the
As said in the description of the @code{mbsrtowcs} function above, the
conversion state is guaranteed to be in the initial shift state after
processing the NUL byte at the end of the input string. But this NUL
byte is not really part of the text (i.e., the conversion state after
@ -1110,7 +1110,7 @@ multibyte string'') converts the NUL-terminated wide character string at
stores the result in the array pointed to by @var{dst}. The NUL wide
character is also converted. The conversion starts in the state
described in the object pointed to by @var{ps} or by a state object
locally to @code{wcsrtombs} in case @var{ps} is a null pointer. If
local to @code{wcsrtombs} in case @var{ps} is a null pointer. If
@var{dst} is a null pointer, the conversion is performed as usual but the
result is not available. If all characters of the input string were
successfully converted and if @var{dst} is not a null pointer, the
@ -1123,13 +1123,13 @@ variable @code{errno} to @code{EILSEQ}, and returns @code{(size_t) -1}.
Another reason for a premature stop is if @var{dst} is not a null
pointer and the next converted character would require more than
@var{len} bytes in total to the array @var{dst}. In this case (and if
@var{dest} is not a null pointer) the pointer pointed to by @var{src} is
@var{dst} is not a null pointer) the pointer pointed to by @var{src} is
assigned a value pointing to the wide character right after the last one
successfully converted.
Except in the case of an encoding error the return value of the
@code{wcsrtombs} function is the number of bytes in all the multibyte
character sequences stored in @var{dst}. Before returning the state in
character sequences stored in @var{dst}. Before returning, the state in
the object pointed to by @var{ps} (or the internal object in case
@var{ps} is a null pointer) is updated to reflect the state after the
last conversion. The state is the initial shift state in case the
@ -1158,11 +1158,11 @@ This new parameter specifies how many bytes at most can be used from the
multibyte character string. In other words, the multibyte character
string @code{*@var{src}} need not be NUL-terminated. But if a NUL byte
is found within the @var{nmc} first bytes of the string, the conversion
stops here.
stops there.
This function is a GNU extension. It is meant to work around the
problems mentioned above. Now it is possible to convert a buffer with
multibyte character text piece for piece without having to care about
multibyte character text piece by piece without having to care about
inserting NUL bytes and the effect of NUL bytes on the conversion state.
@end deftypefun
@ -1603,7 +1603,7 @@ common that they operate on character sets that are not directly
specified by the functions. The multibyte encoding used is specified by
the currently selected locale for the @code{LC_CTYPE} category. The
wide character set is fixed by the implementation (in the case of @theglibc{}
it is always UCS-4 encoded @w{ISO 10646}.
it is always UCS-4 encoded @w{ISO 10646}).
This has of course several problems when it comes to general character
conversion:
@ -1681,7 +1681,7 @@ This data type is an abstract type defined in @file{iconv.h}. The user
must not assume anything about the definition of this type; it must be
completely opaque.
Objects of this type can get assigned handles for the conversions using
Objects of this type can be assigned handles for the conversions using
the @code{iconv} functions. The objects themselves need not be freed, but
the conversions for which the handles stand for have to.
@end deftp
@ -1716,7 +1716,7 @@ returns @code{(iconv_t) -1}. In this case the global variable
@item EMFILE
The process already has @code{OPEN_MAX} file descriptors open.
@item ENFILE
The system limit of open file is reached.
The system limit of open files is reached.
@item ENOMEM
Not enough memory to carry out the operation.
@item EINVAL
@ -1778,7 +1778,7 @@ the @code{iconv_open} function.
If the function call was successful the return value is @math{0}.
Otherwise it is @math{-1} and @code{errno} is set appropriately.
Defined error are:
Defined errors are:
@table @code
@item EBADF
@ -1847,7 +1847,7 @@ stop is that the output buffer is full. And the third reason is that
the input contains invalid characters.
In all of these cases the buffer pointers after the last successful
conversion, for input and output buffer, are stored in @var{inbuf} and
conversion, for the input and output buffers, are stored in @var{inbuf} and
@var{outbuf}, and the available room in each buffer is stored in
@var{inbytesleft} and @var{outbytesleft}.
@ -2087,7 +2087,7 @@ possibilities. This does not mean 200 different character sets are
supported; for example, conversions from one character set to a set of 10
others might count as 10 conversions. Together with the other direction
this makes 20 conversion possibilities used up by one character set. One
can imagine the thin coverage these platform provide. Some Unix vendors
can imagine the thin coverage these platforms provide. Some Unix vendors
even provide only a handful of conversions, which renders them useless for
almost all uses.
@ -2133,7 +2133,7 @@ will succeed, but how to find @math{@cal{B}}?
Unfortunately, the answer is: there is no general solution. On some
systems guessing might help. On those systems most character sets can
convert to and from UTF-8 encoded @w{ISO 10646} or Unicode text. Beside
convert to and from UTF-8 encoded @w{ISO 10646} or Unicode text. Besides
this only some very system-specific methods can help. Since the
conversion functions come from loadable modules and these modules must
be stored somewhere in the filesystem, one @emph{could} try to find them
@ -2143,7 +2143,7 @@ and whether there is an indirect route from @math{@cal{A}} to
This example shows one of the design errors of @code{iconv} mentioned
above. It should at least be possible to determine the list of available
conversion programmatically so that if @code{iconv_open} says there is no
conversions programmatically so that if @code{iconv_open} says there is no
such conversion, one could make sure this also is true for indirect
routes.
@ -2235,7 +2235,7 @@ achieve the same result as when using the real character set name.
This is quite important as a character set has often many different
names. There is normally an official name but this need not correspond to
the most popular name. Beside this many character sets have special
the most popular name. Besides this many character sets have special
names that are somehow constructed. For example, all character sets
specified by the ISO have an alias of the form @code{ISO-IR-@var{nnn}}
where @var{nnn} is the registration number. This allows programs that
@ -2371,7 +2371,7 @@ itself).
@itemx const char *__modname
@itemx int __counter
All these elements of the structure are used internally in the C library
to coordinate loading and unloading the shared. One must not expect any
to coordinate loading and unloading the shared object. One must not expect any
of the other elements to be available or initialized.
@item const char *__from_name
@ -2438,7 +2438,7 @@ These elements specify the output buffer for the conversion step. The
@code{__outbuf} element points to the beginning of the buffer, and
@code{__outbufend} points to the byte following the last byte in the
buffer. The conversion function must not assume anything about the size
of the buffer but it can be safely assumed the there is room for at
of the buffer but it can be safely assumed there is room for at
least one complete character in the output buffer.
Once the conversion is finished, if the conversion is the last step, the
@ -2673,7 +2673,7 @@ Next, a data structure, which contains the necessary information about
which conversion is selected, is allocated. The data structure
@code{struct iso2022jp_data} is locally defined since, outside the
module, this data is not used at all. Please note that if all four
conversions this modules supports are requested there are four data
conversions this module supports are requested there are four data
blocks.
One interesting thing is the initialization of the @code{__min_} and
@ -2686,7 +2686,7 @@ the conversion from @code{INTERNAL} to ISO-2022-JP we have to take into
account that escape sequences might be necessary to switch the character
sets. Therefore the @code{__max_needed_to} element for this direction
gets assigned @code{MAX_NEEDED_FROM + 2}. This takes into account the
two bytes needed for the escape sequences to single the switching. The
two bytes needed for the escape sequences to signal the switching. The
asymmetry in the maximum values for the two directions can be explained
easily: when reading ISO-2022-JP text, escape sequences can be handled
alone (i.e., it is not necessary to process a real character since the
@ -2694,7 +2694,7 @@ effect of the escape sequence can be recorded in the state information).
The situation is different for the other direction. Since it is in
general not known which character comes next, one cannot emit escape
sequences to change the state in advance. This means the escape
sequences that have to be emitted together with the next character.
sequences have to be emitted together with the next character.
Therefore one needs more room than only for the character itself.
The possible return values of the initialization function are:
@ -2740,7 +2740,7 @@ conversion function.
@comment gconv.h
@comment GNU
@deftypevr {Data type} int {(*__gconv_fct)} (struct __gconv_step *, struct __gconv_step_data *, const char **, const char *, size_t *, int)
The conversion function can be called for two basic reason: to convert
The conversion function can be called for two basic reasons: to convert
text or to reset the state. From the description of the @code{iconv}
function it can be seen why the flushing mode is necessary. What mode
is selected is determined by the sixth argument, an integer. This
@ -2817,7 +2817,7 @@ therefore will look similar to this:
But this is not yet all. Once the function call returns the conversion
function might have some more to do. If the return value of the function
is @code{__GCONV_EMPTY_INPUT}, more room is available in the output
buffer. Unless the input buffer is empty the conversion, functions start
buffer. Unless the input buffer is empty, the conversion functions start
all over again and process the rest of the input buffer. If the return
value is not @code{__GCONV_EMPTY_INPUT}, something went wrong and we have
to recover from this.