Changed the processing of non-character code handling in the UTF8 codec.
Non-character codes are now accepted in QStrings, QUrls and QJson strings.
Unit tests were adapted accordingly.
For more info about non-character codes,
see: http://www.unicode.org/versions/corrigendum9.html
[ChangeLog][QtCore][QUtf8]
UTF-8 now accepts non-character unicode points; these are not replaced
by the replacement character anymore
[ChangeLog][QtCore][QUrl]
QUrl now fully accepts non-character unicode points; they are encoded as
percent characters; they can also be pretty decoded
[ChangeLog][QtCore][QJson]
The Writer and the Parser now fully accept non-character unicode points.
Change-Id: I77cf4f0e6210741eac8082912a0b6118eced4f77
Task-number: QTBUG-33229
Reviewed-by: Lars Knoll <lars.knoll@digia.com>
Reviewed-by: Thiago Macieira <thiago.macieira@intel.com>
The old code smply copied 100 shorts from the pointer passed into the
ushortarray ctor, regardless of the actual bounds of the original array.
Fix by making the ctor take the array by deference, deducing the size
as a template parameter, and only copying that much.
Fixes asan trace:
==18660==ERROR: AddressSanitizer: memcpy-param-overlap: memory ranges [0x7fff3c56de00,0x7fff3c56dec8) and [0x7fff3c56dd60, 0x7fff3c56de28) overlap
#0 0x457161 in memcpy asan_interceptors.cc:330
#1 0x4c40fe in ushortarray::ushortarray(unsigned short*) qtbase/tests/auto/corelib/io/qurlinternal/tst_qurlinternal.cpp:62
#2 0x4b0437 in ushortarray::ushortarray(unsigned short*) qtbase/tests/auto/corelib/io/qurlinternal/tst_qurlinternal.cpp:63
#3 0x47b643 in tst_QUrlInternal::idna_testsuite_data() qtbase/tests/auto/corelib/io/qurlinternal/tst_qurlinternal.cpp:119
...
Change-Id: Ie497bc8d337bc680a562482ca71ace535797ffb3
Reviewed-by: Thiago Macieira <thiago.macieira@intel.com>
It's a good practice to always replace bad UTF-8 sequences with the
replacement character. It could be considered a security issue too.
Change-Id: I9e7d72e4c4102cdb8334449b5e7f882228a9048f
Reviewed-by: David Faure (KDE) <faure@kde.org>
So far, this function hasn't been used for input coming in from the
user, so it wasn't necessary. But we may want to do it, or we may
already be doing it accidentally somewhere that isn't triggering the
failed assertions during unit testing.
So let's be on the safe side and allow it. And test it too.
Change-Id: Ib63addd8da468ad6908278d07a4829f1bdc26a07
Reviewed-by: David Faure (KDE) <faure@kde.org>
Edge case: a > that should have been >=. Without it, we never ran the
rest of the IDN nameprepping.
Change-Id: I2276d660de3a70d0c561bb18816820d9a0f47e77
Reviewed-by: Konstantin Ritt <ritt.ks@gmail.com>
RFC 3454 says about prohibited characters (section 2, "Preparation
Overview"):
3) Prohibit -- Check for any characters that are not allowed in the
output. If any are found, return an error. This is described in
section 5.
In other words, we mustn't simply strip the output of prohibited
characters. We must generate an error if they are present. We do that by
clearing the data.
We already had tests for prohibited output, but they were
indistinguishable from being stripped. So instead add some extra
characters so that we can tell whether the label was cleared.
Change-Id: I2d95217c27be5e2d54deed0036cb009e3b7f4886
Reviewed-by: Konstantin Ritt <ritt.ks@gmail.com>
Most notably, .com and .net now may contain non-ASCII characters.
list has been generated from
http://www.mozilla.org/projects/security/tld-idn-policy-list.html
Change-Id: Idc3191dc782bc4173ccb19b4bc81f4f061ca7999
Reviewed-by: Thiago Macieira <thiago.macieira@intel.com>
Change copyrights and license headers from Nokia to Digia
Change-Id: If1cc974286d29fd01ec6c19dd4719a67f4c3f00e
Reviewed-by: Lars Knoll <lars.knoll@digia.com>
Reviewed-by: Sergio Ahumada <sergio.ahumada@digia.com>
Changing it outside of the test function definition to avoid running
empty/inapplicable test functions.
Change-Id: I713560cde7f715696984ed082d682900f5f1bcdd
Reviewed-by: Qt Doc Bot <qt_docbot@qt-project.org>
Reviewed-by: Thiago Macieira <thiago.macieira@intel.com>
Reviewed-by: Caroline Chao <caroline.chao@nokia.com>
The asymmetry is intentional: the getters can use toLatin1() because the
called functions, with a QUrl::FullyEncoded parameter, return ASCII
only. This gives a small performance improvement over the need to run
the UTF-8 encoder.
However, the data passed to setters could contain non-ASCII binary data,
in addition to the percent-encoded data. We can't use fromUtf8 because
it's binary and we can't use toPercentEncoded because it already encoded.
Change-Id: I5ecdb49be5af51ac86fd9764eb3a6aa96385f512
Reviewed-by: David Faure <faure@kde.org>
Just in case someone (like me) changes the function signatures or adds
new functions.
Change-Id: I1025fea012d95ffe89acaf799aa58fd2b0babc80
Reviewed-by: David Faure <faure@kde.org>
Qt 5.0 beta requires changing the default to the 5.0 API, disabling
the deprecated code. However, tests should test (and often do) the
compatibility API too, so turn it back on.
Task-number: QTBUG-25053
Change-Id: I8129c3ef3cb58541c95a32d083850d9e7f768927
Reviewed-by: Lars Knoll <lars.knoll@nokia.com>
Reviewed-by: Olivier Goffart <ogoffart@woboq.com>
This allows the QUrl component getters to return fully decoded data,
like they did in Qt 4. This is necessary for some use-cases where the
component like the user name, password or path are used outside the
context of a URL. In those contexts, the percent-encoded data makes no
sense, and the loss of data of what could be represented in a URL is
acceptable.
Also take the opportunity to expand the documentation of those getter
methods, explaining what the options argument does.
Discussed-on: http://lists.qt-project.org/pipermail/development/2012-May/003811.html
Change-Id: I89f743cde78c02f169c88314bff0768714341419
Reviewed-by: Lars Knoll <lars.knoll@nokia.com>
Reviewed-by: David Faure <faure@kde.org>
Reviewed-by: Shane Kearns <shane.kearns@accenture.com>
Since we're about to introduce QUrl::FullyDecoded, this
QUrl::MostDecoded value would be confusing. Replace its uses with what
was intended at the point in question.
Change-Id: Iefd87bc33d37bace507c5cb0f206fa902e08e2df
Reviewed-by: Lars Knoll <lars.knoll@nokia.com>
Reviewed-by: David Faure <faure@kde.org>
Reviewed-by: Shane Kearns <shane.kearns@accenture.com>
This was trying all the possibilities by brute force, but it turns out
that some combinations are not valid so they should not be
tested. What's more, it was using old values of the flags, so this was
actually testing nothing.
Change-Id: I6c2f5230d240fc23418df2d3a1ca905dbc47dd10
Reviewed-by: Lars Knoll <lars.knoll@nokia.com>
Reviewed-by: David Faure <faure@kde.org>
Reviewed-by: Shane Kearns <shane.kearns@accenture.com>
By having the default value equal to zero, we follow the principle of
least surprise. For example, if we had
url.path()
and we refactored to
url.path(QUrl::DecodeSpaces)
Then instead of ensuring spaces are decoded, we make spaces the only
thing encoded (unicode, delimiters and reserved characters are
encoded).
Besides, modifying the default can only be used to encode something
that wasn't encoded previously, so having the enums as Encode makes
more sense.
As a side-effect, toEncoded() does not support any extra encoding
options.
Change-Id: I2624ec446e65c2d979e9ca2f81bd3db22b00bb13
Reviewed-by: Shane Kearns <shane.kearns@accenture.com>
I don't know if the bug is in moc or in qmake. But it bails out trying
to parse the .cpp file after the
tst_QUrlInternal::nameprep_testsuite_data function. If the #include is
placed above, it works. If it's placed below, it doesn't.
Change-Id: Ide554aa5aa3f1999e29604ba6d25ccdb09f6ef28
Reviewed-by: Marius Storm-Olsen <marius.storm-olsen@nokia.com>
Reviewed-by: Oswald Buddenhagen <oswald.buddenhagen@nokia.com>
Change it to operate on QChar pointers, which gains a little in
performance. This also avoids unnecessary detaching in the QString
source.
In addition, make the output be appended to an existing QString. This
will be useful later when we're reconstructing a URL from its
components.
Change-Id: I7e2f64028277637bd329af5f98001ace253a50c7
Reviewed-by: Lars Knoll <lars.knoll@nokia.com>
The reason for this change is that the strict parser made little sense
to exist. What would the recoder do if it was passed an invalid
string?
I believe that the tolerant recoder is more efficient than the
correcting code followed by the strict recoder. This makes the recoder
more complex and probably a little less efficient, but it's better in
the common case (tolerant that doesn't need fixes) and in the worst
case (needs fixes).
Change-Id: I68a0c9fda6765de05914cbd6ba7d3cea560a7cd6
Reviewed-by: Lars Knoll <lars.knoll@nokia.com>
This one function is an all-in-one:
- UTF-8 encoder
- UTF-8 decoder
- percent encoder
- percent decoder
The next step is add the ability to modify the behaviour, by telling
the function what else it must encode or decode and what it should
leave untouched.
Change-Id: I997eccfd2f9ad8487305670b18d6c806f4cf6717
Reviewed-by: Lars Knoll <lars.knoll@nokia.com>
These functions are now aliases to {to,from}Ace, which are usually
what you want. The original functions from Qt 4.0 had the wrong
semantics and wrong name. The new ones from Qt 4.2 execute the ACE
processing from IDNA (specifically, the ToASCII and ToUnicode
operations described in the RFC).
But so as not to be without tests, export the tests in unit testing
environment and test the punycode roundtrip. Note that the
tst_QUrl::idna_test_suite test tests *only* the Punycode roundtrip,
not the nameprepping.
Change-Id: I9b95b4bd07b4425344a5c6ef5cce7cfcb9846d3e
Reviewed-by: João Abecasis <joao.abecasis@nokia.com>
Reviewed-by: Lars Knoll <lars.knoll@nokia.com>
Reviewed-by: David Faure <faure@kde.org>