Document QString's UTF-8 conversion behaviors

We haven't handled the Unicode non-characters specially since Qt 5.2
(since commit 9327bc87c3), so this part of
the documentation was stale.

Since Qt 5.3 (since 8dd47e34b9), QString
will insert one replacement character for each byte that can't be decoded
properly.

[ChangeLog][Important Behavior Changes][UTF-8 decoding] The QString
UTF-8 decoder changed behavior slightly: when it encounters invalid
sequences, it will insert one replacement character per byte that is
invalid, instead of one replacement character for the whole invalid
length.

Change-Id: Ia4ec78afded9445bbe937311d6be80f71bd1a55f
Reviewed-by: Richard J. Moore <rich@kde.org>
Reviewed-by: Olivier Goffart <ogoffart@woboq.com>
Reviewed-by: Lars Knoll <lars.knoll@digia.com>
This commit is contained in:
Thiago Macieira 2014-04-04 10:34:15 -07:00 committed by The Qt Project
parent f56ef579ba
commit bbf37b61d0

View File

@ -4331,14 +4331,6 @@ QByteArray QString::toLocal8Bit_helper(const QChar *data, int size)
UTF-8 is a Unicode codec and can represent all characters in a Unicode
string like QString.
However, in the Unicode range, there are certain codepoints that are not
considered characters. The Unicode standard reserves the last two
codepoints in each Unicode Plane (U+FFFE, U+FFFF, U+1FFFE, U+1FFFF,
U+2FFFE, etc.), as well as 32 codepoints in the range U+FDD0..U+FDEF,
inclusive, as non-characters. If any of those appear in the string, they
may be discarded and will not appear in the UTF-8 representation, or they
may be replaced by one or more replacement characters.
\sa fromUtf8(), toLatin1(), toLocal8Bit(), QTextCodec
*/
@ -4493,10 +4485,10 @@ QString QString::fromLocal8Bit_helper(const char *str, int size)
sequences, non-characters, overlong sequences or surrogate codepoints
encoded into UTF-8.
Non-characters are codepoints that the Unicode standard reserves and must
not be used in text interchange. They are the last two codepoints in each
Unicode Plane (U+FFFE, U+FFFF, U+1FFFE, U+1FFFF, U+2FFFE, etc.), as well
as 32 codepoints in the range U+FDD0..U+FDEF, inclusive.
This function can be used to process incoming data incrementally as long as
all UTF-8 characters are terminated within the incoming data. Any
unterminated characters at the end of the string will be replaced or
suppressed. In order to do stateful decoding, please use \l QTextDecoder.
\sa toUtf8(), fromLatin1(), fromLocal8Bit()
*/
@ -9517,14 +9509,6 @@ QByteArray QStringRef::toLocal8Bit() const
UTF-8 is a Unicode codec and can represent all characters in a Unicode
string like QString.
However, in the Unicode range, there are certain codepoints that are not
considered characters. The Unicode standard reserves the last two
codepoints in each Unicode Plane (U+FFFE, U+FFFF, U+1FFFE, U+1FFFF,
U+2FFFE, etc.), as well as 16 codepoints in the range U+FDD0..U+FDDF,
inclusive, as non-characters. If any of those appear in the string, they
may be discarded and will not appear in the UTF-8 representation, or they
may be replaced by one or more replacement characters.
\sa toLatin1(), toLocal8Bit(), QTextCodec
*/
QByteArray QStringRef::toUtf8() const