178 lines
7.1 KiB
TeX
178 lines
7.1 KiB
TeX
|
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
|
||
|
%% Name: tmbconv.tex
|
||
|
%% Purpose: Overview of the wxMBConv classes in wxWindows
|
||
|
%% Author: Ove Kaaven
|
||
|
%% Modified by:
|
||
|
%% Created: 25.03.00
|
||
|
%% RCS-ID: $Id$
|
||
|
%% Copyright: (c) 2000 Ove Kaaven
|
||
|
%% Licence: wxWindows license
|
||
|
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
|
||
|
|
||
|
\section{wxMBConv classes overview}\label{mbconvclasses}
|
||
|
|
||
|
Classes: \helpref{wxMBConv}{wxmbconv}, \helpref{wxMBConvFile}{wxmbconvfile},
|
||
|
\helpref{wxMBConvUTF7}{wxmbconvutf7}, \helpref{wxMBConvUTF8}{wxmbconvutf8},
|
||
|
\helpref{wxCSConv}{wxcsconv}
|
||
|
|
||
|
The wxMBConv classes in wxWindows enables an Unicode-aware application to
|
||
|
easily convert between Unicode and the variety of 8-bit encoding systems still
|
||
|
in use.
|
||
|
|
||
|
\subsection{Background: The need for conversion}
|
||
|
|
||
|
As programs are becoming more and more globalized, and users exchange documents
|
||
|
across country boundaries as never before, applications increasingly need to
|
||
|
take into account all the different character sets in use around the world. It
|
||
|
is no longer enough to just depend on the default byte-sized character set that
|
||
|
computers have traditionally used.
|
||
|
|
||
|
A few years ago, a solution was proposed: the Unicode standard. Able to contain
|
||
|
the complete set of characters in use in one unified global coding system,
|
||
|
it would resolve the character set problems once and for all.
|
||
|
|
||
|
But it hasn't happened yet, and the migration towards Unicode has created new
|
||
|
challenges, resulting in "compatibility encodings" such as UTF-8. A large
|
||
|
number of systems out there still depends on the old 8-bit encodings, hampered
|
||
|
by the huge amounts of legacy code still widely deployed. Even sending
|
||
|
Unicode data from one Unicode-aware system to another may need encoding to an
|
||
|
8-bit multibyte encoding (UTF-7 or UTF-8 is typically used for this purpose), to
|
||
|
pass unhindered through any traditional transport channels.
|
||
|
|
||
|
\subsection{Background: The wxString class}
|
||
|
|
||
|
If you have compiled wxWindows in Unicode mode, the wxChar type will become
|
||
|
identical to wchar\_t rather than char, and a wxString stores wxChars. Hence,
|
||
|
all wxString manipulation in your application will then operate on Unicode
|
||
|
strings, and almost as easily as working with ordinary char strings (you
|
||
|
just need to remember to use the wxT() macro to encapsulate any string
|
||
|
literals).
|
||
|
|
||
|
But often, your environment doesn't want Unicode strings. You could be sending
|
||
|
data over a network, or processing a text file for some other application. You
|
||
|
need a way to quickly convert your easily-handled Unicode data to and from a
|
||
|
traditional 8-bit-encoding. And this is what the wxMBConv classes do.
|
||
|
|
||
|
\subsection{wxMBConv classes}
|
||
|
|
||
|
The base class for all these conversions is the wxMBConv class (which itself
|
||
|
implements standard libc locale conversion). Derived classes include
|
||
|
wxMBConvFile, wxMBConvUTF7, wxMBConvUTF8, and wxCSConv, which implement
|
||
|
different kinds of conversions. You can also derive your own class for your
|
||
|
own custom encoding and use it, should you need it. All you need to do is
|
||
|
override the MB2WC and WC2MB methods.
|
||
|
|
||
|
\subsection{wxMBConv objects}
|
||
|
|
||
|
In C++, for a class to be useful and possible to pass around, it needs to be
|
||
|
instantiated. All of the wxWindows-provided wxMBConv classes have predefined
|
||
|
instances (wxConvLibc, wxConvFile, wxConvUTF7, wxConvUTF8, wxConvLocal).
|
||
|
You can use these predefined objects directly, or you can instantiate your own
|
||
|
objects.
|
||
|
|
||
|
A variable, wxConvCurrent, points to the conversion object that the user interface
|
||
|
is supposed to use, in the case that the user interface is not Unicode-based (like
|
||
|
with GTK+ 1.2). By default, it points to wxConvLibc or wxConvLocal, depending on
|
||
|
which works best on the current platform.
|
||
|
|
||
|
\subsection{wxCSConv}
|
||
|
|
||
|
The wxCSConv class is special because when it is instantiated, you can tell it
|
||
|
which character set it should use, which makes it meaningful to keep many
|
||
|
instances of them around, each with a different character set (or you can
|
||
|
create a wxCSConv instance on the fly).
|
||
|
|
||
|
The predefined wxCSConv instance, wxConvLocal, is preset to use the
|
||
|
default user character set, but you should rarely need to use it directly,
|
||
|
it is better to go through wxConvCurrent.
|
||
|
|
||
|
\subsection{Converting strings}
|
||
|
|
||
|
Once you have chosen which object you want to use to convert your text,
|
||
|
here is how you would use them with wxString. These examples all assume
|
||
|
that you are using a Unicode build of wxWindows, although they will still
|
||
|
compile in a non-Unicode build (they just won't convert anything).
|
||
|
|
||
|
Example 1: Constructing a wxString from input in current encoding.
|
||
|
|
||
|
\begin{verbatim}
|
||
|
wxString str(input_data, *wxConvCurrent);
|
||
|
\end{verbatim}
|
||
|
|
||
|
Example 2: Input in UTF-8 encoding.
|
||
|
|
||
|
\begin{verbatim}
|
||
|
wxString str(input_data, wxConvUTF8);
|
||
|
\end{verbatim}
|
||
|
|
||
|
Example 3: Input in KOI8-R. Construction of wxCSConv instance on the fly.
|
||
|
|
||
|
\begin{verbatim}
|
||
|
wxString str(input_data, wxCSConv(wxT("koi8-r")));
|
||
|
\end{verbatim}
|
||
|
|
||
|
Example 4: Printing a wxString to stdout in UTF-8 encoding.
|
||
|
|
||
|
\begin{verbatim}
|
||
|
puts(str.mb_str(wxConvUTF8));
|
||
|
\end{verbatim}
|
||
|
|
||
|
Example 5: Printing a wxString to stdout in custom encoding.
|
||
|
Using preconstructed wxCSConv instance.
|
||
|
|
||
|
\begin{verbatim}
|
||
|
wxCSConv cust(user_encoding);
|
||
|
printf("Data: %s\n", (const char*) str.mb_str(cust));
|
||
|
\end{verbatim}
|
||
|
|
||
|
Note: Since mb_str() returns a temporary wxCharBuffer to hold the result
|
||
|
of the conversion, you need to explicitly cast it to const char* if you use
|
||
|
it in a vararg context (like with printf).
|
||
|
|
||
|
\subsection{Converting buffers}
|
||
|
|
||
|
If you have specialized needs, or just don't want to use wxString, you
|
||
|
can also use the conversion methods of the conversion objects directly.
|
||
|
This can even be useful if you need to do conversion in a non-Unicode
|
||
|
build of wxWindows; converting a string from UTF-8 to the current
|
||
|
encoding should be possible by doing this:
|
||
|
|
||
|
\begin{verbatim}
|
||
|
wxString str(wxConvUTF8.cMB2WC(input_data), *wxConvCurrent);
|
||
|
\end{verbatim}
|
||
|
|
||
|
Here, cMB2WC of the UTF8 object returns a wxWCharBuffer containing a Unicode
|
||
|
string. The wxString constructor then converts it back to an 8-bit character
|
||
|
set using the passed conversion object, *wxConvCurrent. (In a Unicode build
|
||
|
of wxWindows, the constructor ignores the passed conversion object and
|
||
|
retains the Unicode data.)
|
||
|
|
||
|
This could also be done by first making a wxString of the original data:
|
||
|
|
||
|
\begin{verbatim}
|
||
|
wxString input_str(input_data);
|
||
|
wxString str(input_str.wc_str(wxConvUTF8), *wxConvCurrent);
|
||
|
\end{verbatim}
|
||
|
|
||
|
To print a wxChar buffer to a non-Unicode stdout:
|
||
|
|
||
|
\begin{verbatim}
|
||
|
printf("Data: %s\n", (const char*) wxConvCurrent->cWX2MB(unicode_data));
|
||
|
\end{verbatim}
|
||
|
|
||
|
If you need to do more complex processing on the converted data, you
|
||
|
may want to store the temporary buffer in a local variable:
|
||
|
|
||
|
\begin{verbatim}
|
||
|
const wxWX2MBbuf tmp_buf = wxConvCurrent->cWX2MB(unicode_data);
|
||
|
const char *tmp_str = (const char*) tmp_buf;
|
||
|
printf("Data: %s\n", tmp_str);
|
||
|
process_data(tmp_str);
|
||
|
\end{verbatim}
|
||
|
|
||
|
If a conversion had taken place in cWX2MB (i.e. in a Unicode build),
|
||
|
the buffer will be deallocated as soon as tmp_buf goes out of scope.
|
||
|
(The macro wxWX2MBbuf reflects the correct return value of cWX2MB
|
||
|
(either char* or wxCharBuffer), except for the const.)
|
||
|
|