f6bcfd974e
git-svn-id: https://svn.wxwidgets.org/svn/wx/wxWidgets/trunk@7748 c3d73ce0-8a6f-49c7-b76d-6d57e0e08775
269 lines
14 KiB
TeX
269 lines
14 KiB
TeX
\section{wxString overview}\label{wxstringoverview}
|
|
|
|
Classes: \helpref{wxString}{wxstring}, \helpref{wxArrayString}{wxarraystring}, \helpref{wxStringTokenizer}{wxstringtokenizer}
|
|
|
|
\subsection{Introduction}
|
|
|
|
wxString is a class which represents a character string of arbitrary length (limited by
|
|
{\it MAX\_INT} which is usually 2147483647 on 32 bit machines) and containing
|
|
arbitrary characters. The ASCII NUL character is allowed, although care should be
|
|
taken when passing strings containing it to other functions.
|
|
|
|
wxString only works with ASCII (8 bit characters) strings as of this release,
|
|
but support for UNICODE (16 but characters) is planned for the next one.
|
|
|
|
This class has all the standard operations you can expect to find in a string class:
|
|
dynamic memory management (string extends to accommodate new characters),
|
|
construction from other strings, C strings and characters, assignment operators,
|
|
access to individual characters, string concatenation and comparison, substring
|
|
extraction, case conversion, trimming and padding (with spaces), searching and
|
|
replacing and both C-like \helpref{Printf()}{wxstringprintf} and stream-like
|
|
insertion functions as well as much more - see \helpref{wxString}{wxstring}
|
|
for a list of all functions.
|
|
|
|
\subsection{Comparison of wxString to other string classes}
|
|
|
|
The advantages of using a special string class instead of working directly with
|
|
C strings are so obvious that there is a huge number of such classes available.
|
|
The most important advantage is the need to always
|
|
remember to allocate/free memory for C strings; working with fixed size buffers almost
|
|
inevitably leads to buffer overflows. At last, C++ has a standard string class
|
|
(std::string). So why the need for wxString?
|
|
|
|
There are several advantages:
|
|
|
|
\begin{enumerate}\itemsep=0pt
|
|
\item {\bf Efficiency} This class was made to be as efficient as possible: both
|
|
in terms of size (each wxString objects takes exactly the same space as a {\it
|
|
char *} pointer, sing \helpref{reference counting}{wxstringrefcount}) and speed.
|
|
It also provides performance \helpref{statistics gathering code}{wxstringtuning}
|
|
which may be enabled to fine tune the memory allocation strategy for your
|
|
particular application - and the gain might be quite big.
|
|
\item {\bf Compatibility} This class tries to combine almost full compatibility
|
|
with the old wxWindows 1.xx wxString class, some reminiscence to MFC CString
|
|
class and 90\% of the functionality of std::string class.
|
|
\item {\bf Rich set of functions} Some of the functions present in wxString are
|
|
very useful but don't exist in most of other string classes: for example,
|
|
\helpref{AfterFirst}{wxstringafterfirst},
|
|
\helpref{BeforeLast}{wxstringbeforelast}, \helpref{operator<<}{wxstringoperatorout}
|
|
or \helpref{Printf}{wxstringprintf}. Of course, all the standard string
|
|
operations are supported as well.
|
|
\item {\bf UNICODE} In this release, wxString only supports {\it construction} from
|
|
a UNICODE string, but in the next one it will be capable of also storing its
|
|
internal data in either ASCII or UNICODE format.
|
|
\item {\bf Used by wxWindows} And, of course, this class is used everywhere
|
|
inside wxWindows so there is no performance loss which would result from
|
|
conversions of objects of any other string class (including std::string) to
|
|
wxString internally by wxWindows.
|
|
\end{enumerate}
|
|
|
|
However, there are several problems as well. The most important one is probably
|
|
that there are often several functions to do exactly the same thing: for
|
|
example, to get the length of the string either one of
|
|
length(), \helpref{Len()}{wxstringlen} or
|
|
\helpref{Length()}{wxstringlength} may be used. The first function, as almost
|
|
all the other functions in lowercase, is std::string compatible. The second one
|
|
is "native" wxString version and the last one is wxWindows 1.xx way. So the
|
|
question is: which one is better to use? And the answer is that:
|
|
|
|
{\bf The usage of std::string compatible functions is strongly advised!} It will
|
|
both make your code more familiar to other C++ programmers (who are supposed to
|
|
have knowledge of std::string but not of wxString), let you reuse the same code
|
|
in both wxWindows and other programs (by just typedefing wxString as std::string
|
|
when used outside wxWindows) and by staying compatible with future versions of
|
|
wxWindows which will probably start using std::string sooner or later too.
|
|
|
|
In the situations where there is no corresponding std::string function, please
|
|
try to use the new wxString methods and not the old wxWindows 1.xx variants
|
|
which are deprecated and may disappear in future versions.
|
|
|
|
\subsection{Some advice about using wxString}\label{wxstringadvices}
|
|
|
|
Probably the main trap with using this class is the implicit conversion operator to
|
|
{\it const char *}. It is advised that you use \helpref{c\_str()}{wxstringcstr}
|
|
instead to clearly indicate when the conversion is done. Specifically, the
|
|
danger of this implicit conversion may be seen in the following code fragment:
|
|
|
|
\begin{verbatim}
|
|
// this function converts the input string to uppercase, output it to the screen
|
|
// and returns the result
|
|
const char *SayHELLO(const wxString& input)
|
|
{
|
|
wxString output = input.Upper();
|
|
|
|
printf("Hello, %s!\n", output);
|
|
|
|
return output;
|
|
}
|
|
\end{verbatim}
|
|
|
|
There are two nasty bugs in these three lines. First of them is in the call to the
|
|
{\it printf()} function. Although the implicit conversion to C strings is applied
|
|
automatically by the compiler in the case of
|
|
|
|
\begin{verbatim}
|
|
puts(output);
|
|
\end{verbatim}
|
|
|
|
because the argument of {\it puts()} is known to be of the type {\it const char *},
|
|
this is {\bf not} done for {\it printf()} which is a function with variable
|
|
number of arguments (and whose arguments are of unknown types). So this call may
|
|
do anything at all (including displaying the correct string on screen), although
|
|
the most likely result is a program crash. The solution is to use
|
|
\helpref{c\_str()}{wxstringcstr}: just replace this line with
|
|
|
|
\begin{verbatim}
|
|
printf("Hello, %s!\n", output.c_str());
|
|
\end{verbatim}
|
|
|
|
The second bug is that returning {\it output} doesn't work. The implicit cast is
|
|
used again, so the code compiles, but as it returns a pointer to a buffer
|
|
belonging to a local variable which is deleted as soon as the function exits,
|
|
its contents is totally arbitrary. The solution to this problem is also easy:
|
|
just make the function return wxString instead of a C string.
|
|
|
|
This leads us to the following general advice: all functions taking string
|
|
arguments should take {\it const wxString\&} (this makes assignment to the
|
|
strings inside the function faster because of
|
|
\helpref{reference counting}{wxstringrefcount}) and all functions returning
|
|
strings should return {\it wxString} - this makes it safe to return local
|
|
variables.
|
|
|
|
\subsection{Other string related functions and classes}
|
|
|
|
As most programs use character strings, the standard C library provides quite a
|
|
few functions to work with them. Unfortunately, some of them have rather
|
|
counter-intuitive behaviour (like strncpy() which doesn't always terminate the resulting
|
|
string with a NULL) and are in general not very safe (passing NULL to them will
|
|
probably lead to program crash). Moreover, some very useful functions are not
|
|
standard at all. This is why in addition to all wxString functions, there are
|
|
also a few global string functions which try to correct these problems:
|
|
\helpref{IsEmpty()}{IsEmpty} verifies whether the string is empty (returning
|
|
TRUE for NULL pointers), \helpref{Strlen()}{Strlen} also handles NULLs correctly
|
|
and returns 0 for them and \helpref{Stricmp()}{Stricmp} is just a
|
|
platform-independent version of case-insensitive string comparison function
|
|
known either as stricmp() or strcasecmp() on different platforms.
|
|
|
|
The {\tt <wx/string.h>} header also defines \helpref{wxSnprintf}{wxsnprintf}
|
|
and \helpref{wxVsnprintf}{wxvsnprintf} functions which should be used instead
|
|
of the inherently dangerous standard {\tt sprintf()} and which use {\tt
|
|
snprintf()} instead which does buffer size checks whenever possible. Of
|
|
course, you may also use \helpref{wxString::Printf}{wxstringprintf} which is
|
|
also safe.
|
|
|
|
There is another class which might be useful when working with wxString:
|
|
\helpref{wxStringTokenizer}{wxstringtokenizer}. It is helpful when a string must
|
|
be broken into tokens and replaces the standard C library {\it
|
|
strtok()} function.
|
|
|
|
And the very last string-related class is \helpref{wxArrayString}{wxarraystring}: it
|
|
is just a version of the "template" dynamic array class which is specialized to work
|
|
with strings. Please note that this class is specially optimized (using its
|
|
knowledge of the internal structure of wxString) for storing strings and so it is
|
|
vastly better from a performance point of view than a wxObjectArray of wxStrings.
|
|
|
|
\subsection{Reference counting and why you shouldn't care about it}\label{wxstringrefcount}
|
|
|
|
wxString objects use a technique known as {\it copy on write} (COW). This means
|
|
that when a string is assigned to another, no copying really takes place: only
|
|
the reference count on the shared string data is incremented and both strings
|
|
share the same data.
|
|
|
|
But as soon as one of the two (or more) strings is modified, the data has to be
|
|
copied because the changes to one of the strings shouldn't be seen in the
|
|
others. As data copying only happens when the string is written to, this is
|
|
known as COW.
|
|
|
|
What is important to understand is that all this happens absolutely
|
|
transparently to the class users and that whether a string is shared or not is
|
|
not seen from the outside of the class - in any case, the result of any
|
|
operation on it is the same.
|
|
|
|
Probably the unique case when you might want to think about reference
|
|
counting is when a string character is taken from a string which is not a
|
|
constant (or a constant reference). In this case, due to C++ rules, the
|
|
"read-only" {\it operator[]} (which is the same as
|
|
\helpref{GetChar()}{wxstringgetchar}) cannot be chosen and the "read/write"
|
|
{\it operator[]} (the same as
|
|
\helpref{GetWritableChar()}{wxstringgetwritablechar}) is used instead. As the
|
|
call to this operator may modify the string, its data is unshared (COW is done)
|
|
and so if the string was really shared there is some performance loss (both in
|
|
terms of speed and memory consumption). In the rare cases when this may be
|
|
important, you might prefer using \helpref{GetChar()}{wxstringgetchar} instead
|
|
of the array subscript operator for this reasons. Please note that
|
|
\helpref{at()}{wxstringat} method has the same problem as the subscript operator in
|
|
this situation and so using it is not really better. Also note that if all
|
|
string arguments to your functions are passed as {\it const wxString\&} (see the
|
|
section \helpref{Some advice}{wxstringadvices}) this situation will almost
|
|
never arise because for constant references the correct operator is called automatically.
|
|
|
|
\subsection{Tuning wxString for your application}\label{wxstringtuning}
|
|
|
|
\normalbox{{\bf Note:} this section is strictly about performance issues and is
|
|
absolutely not necessary to read for using wxString class. Please skip it unless
|
|
you feel familiar with profilers and relative tools. If you do read it, please
|
|
also read the preceding section about
|
|
\helpref{reference counting}{wxstringrefcount}.}
|
|
|
|
For the performance reasons wxString doesn't allocate exactly the amount of
|
|
memory needed for each string. Instead, it adds a small amount of space to each
|
|
allocated block which allows it to not reallocate memory (a relatively
|
|
expensive operation) too often as when, for example, a string is constructed by
|
|
subsequently adding one character at a time to it, as for example in:
|
|
|
|
\begin{verbatim}
|
|
// delete all vowels from the string
|
|
wxString DeleteAllVowels(const wxString& original)
|
|
{
|
|
wxString result;
|
|
|
|
size_t len = original.length();
|
|
for ( size_t n = 0; n < len; n++ )
|
|
{
|
|
if ( strchr("aeuio", tolower(original[n])) == NULL )
|
|
result += original[n];
|
|
}
|
|
|
|
return result;
|
|
}
|
|
\end{verbatim}
|
|
|
|
This is quite a common situation and not allocating extra memory at all would
|
|
lead to very bad performance in this case because there would be as many memory
|
|
(re)allocations as there are consonants in the original string. Allocating too
|
|
much extra memory would help to improve the speed in this situation, but due to
|
|
a great number of wxString objects typically used in a program would also
|
|
increase the memory consumption too much.
|
|
|
|
The very best solution in precisely this case would be to use
|
|
\helpref{Alloc()}{wxstringalloc} function to preallocate, for example, len bytes
|
|
from the beginning - this will lead to exactly one memory allocation being
|
|
performed (because the result is at most as long as the original string).
|
|
|
|
However, using Alloc() is tedious and so wxString tries to do its best. The
|
|
default algorithm assumes that memory allocation is done in granularity of at
|
|
least 16 bytes (which is the case on almost all of wide-spread platforms) and so
|
|
nothing is lost if the amount of memory to allocate is rounded up to the next
|
|
multiple of 16. Like this, no memory is lost and 15 iterations from 16 in the
|
|
example above won't allocate memory but use the already allocated pool.
|
|
|
|
The default approach is quite conservative. Allocating more memory may bring
|
|
important performance benefits for programs using (relatively) few very long
|
|
strings. The amount of memory allocated is configured by the setting of {\it
|
|
EXTRA\_ALLOC} in the file string.cpp during compilation (be sure to understand
|
|
why its default value is what it is before modifying it!). You may try setting
|
|
it to greater amount (say twice nLen) or to 0 (to see performance degradation
|
|
which will follow) and analyse the impact of it on your program. If you do it,
|
|
you will probably find it helpful to also define WXSTRING\_STATISTICS symbol
|
|
which tells the wxString class to collect performance statistics and to show
|
|
them on stderr on program termination. This will show you the average length of
|
|
strings your program manipulates, their average initial length and also the
|
|
percent of times when memory wasn't reallocated when string concatenation was
|
|
done but the already preallocated memory was used (this value should be about
|
|
98\% for the default allocation policy, if it is less than 90\% you should
|
|
really consider fine tuning wxString for your application).
|
|
|
|
It goes without saying that a profiler should be used to measure the precise
|
|
difference the change to EXTRA\_ALLOC makes to your program.
|
|
|