UAX #29 in Unicode 11 changed the EGC algorithm to its current form.
Although Qt has upgraded the Unicode tables all the way up to
Unicode 13, the algorithm has never been adapted; in other words,
it has been working by chance for years. Luckily, MOST
of the cases were dealt with correctly, but emoji handling
actually manages to break it.
This commit:
* Adds parsing of emoji-data.txt into the unicode table generator.
That is necessary to extract the Extended_Pictographic property,
which is used by the EGC algorithm.
* Regenerates the tables.
* Removes some obsoleted grapheme cluster break properties, and
adds the ones added in the meanwhile.
* Rewrites the EGC algorithm according to Unicode 13. This is
done by simplifying a lot the lookup table. Some rules (GB11,
GB12, GB13) can't be done by the table alone so some hand-rolled
code is necessary in that case.
* Thanks to these fixes, the complete upstream GraphemeBreakTest
now passes. Remove the "edited" version that ignored some rows
(because they were failing).
Change-Id: Iaa07cb2e6d0ab9deac28397f46d9af189d2edf8b
Pick-to: 6.1 6.0 5.15
Fixes: QTBUG-92822
Reviewed-by: Thiago Macieira <thiago.macieira@intel.com>
Reviewed-by: Konstantin Ritt <ritt.ks@gmail.com>
If one "accidentally" uses a release build of the unicode tool,
the asserts within it won't fire. Enable them in all cases.
Change-Id: I9d63641dc6d6d2e5805b61b36f8c28e624b25e12
Reviewed-by: Edward Welbourne <edward.welbourne@qt.io>
The Unicode table code can only be safely called on valid code-points.
So code that calls it must only pass it valid Unicode data. The string
iterator's Unchecked Unchecked methods only provide this guarantee
when the string being iterated is guaranteed to be valid UTF-16; while
client code should only use QString, QStringView and friends on valid
UTF-16 data, we have no way to be sure they have respected that.
So take the few extra cycles to actually check validity in the course
of iterating strings, when the resulting code-points are to be passed
to the Unicode table look-ups. Add tests that case mapping doesn't
access Unicode tables out of range (it'll trigger the new assertion).
Added some comments to qchar.h that helped me understand surrogates.
Change-Id: Iec2c3106bf1a875bdaa1d622f6cf94d7007e281e
Reviewed-by: Thiago Macieira <thiago.macieira@intel.com>
Make clear why we don't need to assert against out-of-bounda accesses
in the generated code, provided the code point is within its bound,
(Using one table's early entries as indices into later in the same
table at which to look up indices into another table made it a little
hard to work out what was going on, especially as nothing told me
about the early / late distinction. Record what I discovered, to save
the next person to stumble into this some confusion.)
Change-Id: I8e5771a7f3d70c1911aeae1b0cabe5c47bc7e9c7
Reviewed-by: Konstantin Ritt <ritt.ks@gmail.com>
Reviewed-by: Thiago Macieira <thiago.macieira@intel.com>
They were only used by one function each, in unicodetables.cpp, so
don't need to be macros.
Change-Id: I3e7f9f661568862d0a0d265bb8f657a8e0782b13
Reviewed-by: Thiago Macieira <thiago.macieira@intel.com>
Eliminate some needless parentheses, tidy up some spacing and
indentation and split some long lines. Change first += after
declaration to initializer.
Change-Id: I05ff2a6337b7ed14e0a2dc9c03fc784c92b63515
Reviewed-by: Konstantin Ritt <ritt.ks@gmail.com>
Reviewed-by: Thiago Macieira <thiago.macieira@intel.com>
Qt6 makes sizes qsizetype; and one of these was already sizeof()-sized.
While qsizetype might not be ssize_t, it's at least no bigger, so we
can safely use its format specifier, with a suitable cast.
Change-Id: I433f654f6b139d74b4d5358b804b44ab1f0ada15
Reviewed-by: Lars Knoll <lars.knoll@qt.io>
Reviewed-by: Konstantin Ritt <ritt.ks@gmail.com>
Removed three warnings, rather than fixing them, as Konstantin Ritt
tells me they've been redundant since Unicode 6 or so.
Change-Id: I4507e852bceb08a0252c77a8b383aceac212aad9
Reviewed-by: Konstantin Ritt <ritt.ks@gmail.com>
Don't include a QString::number() in a sum of QByteArray and C strings.
Change-Id: I7544e835fcf5625b1fe1ee2055a48600200daafd
Reviewed-by: Lars Knoll <lars.knoll@qt.io>
Reviewed-by: Konstantin Ritt <ritt.ks@gmail.com>
Reviewed-by: Mårten Nordheim <marten.nordheim@qt.io>
* Assume UTF-8 on all Unix like systems
* Export some functions to be able to compile QTextCodec once
moved to Qt5Compat.
Task-number: QTBUG-75665
Change-Id: I52ec47a848bc0ba72e9c7689668b1bcc5d736c29
Reviewed-by: Lars Knoll <lars.knoll@qt.io>
There is no reason for keep using our macro now that we have C++17.
The macro itself is left in for the moment being, as well as its
detection logic, because it's needed for C code (not everything
supports C11 yet). A few more cleanups will arrive in the next few
patches.
Note that this is a mere search/replace; some places were using
double braces to work around the presence of commas in a macro, no
attempt has been done to fix those.
tst_qglobal had just some minor changes to keep testing the macro.
Change-Id: I1c1c397d9f3e63db3338842bf350c9069ea57639
Reviewed-by: Lars Knoll <lars.knoll@qt.io>
There are (at least) two implementations of the low-level case-folding
algorithm, one of which (for QChar::toLower()) seems to be wrong (it
doesn't deal with special cases which expand to more than one code
point).
The algoithm hidden in QString and entangled with the QString
detaching code makes reusing the code much harder.
At the same time, the dependency of the algorithm on the unicode
tables makes exposing a non-allocating result type in the public API
hard. std::u16string would be an alternative if we can assure that all
implementations use SSO with at least four characters.
So, for the time being, leave this as internal API for use in an
upcoming QStringView::toLower() as well as case-insensitive hashing.
Change-Id: Iabb2611846f6176776aa20e634f44d8464f3305c
Reviewed-by: Thiago Macieira <thiago.macieira@intel.com>
This makes existing calls passing uint or ushort ambiguous, so
fix all the callers. There do not appear to be callers outside
QtBase. In fact, the ...BreakClass() functions appear to be
utterly unused.
Change-Id: I1c2251920beba48d4909650bc1d501375c6a3ecf
Reviewed-by: Lars Knoll <lars.knoll@qt.io>
Reviewed-by: Konstantin Ritt <ritt.ks@gmail.com>
Now that the standard gives us proper types for UTF-16 and UTF-32
characters, use them. Will eventually make the code much easier to
read than today, where uint could be an index as well as a char32_t.
It also ensures that the result of e.g. QChar::highSurrogate() can
still be implicitly converted to a QChar now that the
QChar(non-characater-integral-types) ctors are being made explicit.
[ChangeLog][QtCore][QChar] All low-level functions
(e.g. highSurrogate()) now take and return char16_t instead of ushort
and char32_t instead of uint.
Change-Id: I9cd8ebf6fb998fe1075dae96c7c4484a057f0b91
Reviewed-by: Lars Knoll <lars.knoll@qt.io>
Reviewed-by: Konstantin Ritt <ritt.ks@gmail.com>
Include WordBreakTest.html, since a test uses sample strings from it,
albeit without actually reading the file.
Had to comment out more of the new tests, as at Revision 24, pending
an update to harfbuzz and the text boundary detection code.
Task-number: QTBUG-79631
Task-number: QTBUG-79418
Task-number: QTBUG-82747
Change-Id: I0082294b09d67ffdc6a9b5c15acf77ad3b86f65f
Reviewed-by: Lars Knoll <lars.knoll@qt.io>
Put blank lines before the final Num*Classes entries in enums, to set
them off visibly from the "real" members. Moved some oddly placed
commas to the ends of preceding lines, so that later additions can
just add lines (with comma on end) without having to modify the
preceding line while doing so.
Change-Id: I5188dc25af9e4c17a1882fd9dab070e88013060b
Reviewed-by: Thiago Macieira <thiago.macieira@intel.com>
Also remove two stray commas pointed out in code-review and some
others noticed on checking for similar.
This amends commit c3eb521a0f.
Change-Id: If20c5146b740defe8d25ff61d399031b5c66ded1
Reviewed-by: Lars Knoll <lars.knoll@qt.io>
Had to teach the update program to accept category Lm as for
Joining_Transparent, for the sake of a new ArabicShaping.txt entry.
Added three new Unicode versions, several new scripts and a new
word-break class.
Updated UCD's test data for tst_QTextBoundaryFinder. This left 57
tests failing; I have commented out the data rows for those tests,
pending someone with more knowledge addressing this.
Task-number: QTBUG-79631
Task-number: QTBUG-79418
Change-Id: Ic33d3b3551195d47a84d98e84020f57a68f0b201
Reviewed-by: Eskil Abrahamsen Blomfeldt <eskil.abrahamsen-blomfeldt@qt.io>
The Unicode data tables moved with QString and friends.
So did the locale data generated from CLDR.
This amends commit a9aa206b7b.
Change-Id: If12f0420b559dcb78993adc00e9f39751bca684a
Reviewed-by: Volker Hilsheimer <volker.hilsheimer@qt.io>
Instead of four pairs of :1 :15 bit fields, use an array of four :1,
:15 structs. This allows to replace the case folding traits classes
with a simple enum that indexes into said array.
I don't know what the WASM #ifdef'ed code is supposed to effect (a :0
bit-field is only useful to separate adjacent bit-field into separate
memory locations for multi-threading), but I thought it safer to leave
it in, and that means the array must be a 64-bit block of its own, so
I had to move two fields around.
Saves ~4.5KiB in text size on optimized GCC 10 LTO Linux AMD64 builds.
Change-Id: Ib52cd7706342d5227b50b57545d073829c45da9a
Reviewed-by: Lars Knoll <lars.knoll@qt.io>
GCC doesn't like the sequence
: 5
: 5
: 8
: 6
: 8
and inserts a :6 padding between the :5 and the :8 and a :2 padding
between the :6 and the :8, growing the bitfield by 8 bits of embedded
padding and another byte to bring the struct back to sizeof % 2 == 0.
Fix by reshuffling the elements and adding a static_assert for the
next round.
Saves ~5KiB in QtCore executable size.
Change-Id: I4758a6f48ba389abc2aee92f60997d42ebb0e5b8
Reviewed-by: Thiago Macieira <thiago.macieira@intel.com>
The first replacement had missed objective-C++ code some places ourside
the src dir.
In C-files Q_DECL_NOTHROW is replaced with Q_DECL_NOEXCEPT as we still
need to turn it off when compiled in C mode, but can get rid of the old
NOTHROW moniker.
Change-Id: I6370f57066679c5120d0265a69e7e378e09d4759
Reviewed-by: Thiago Macieira <thiago.macieira@intel.com>
The current state produces uncompilable code.
Change-Id: I9a68b61866a4a416335ed4d7204c58122803fb1c
Reviewed-by: Eskil Abrahamsen Blomfeldt <eskil.abrahamsen-blomfeldt@qt.io>
This work around was added directly to the generated files
at some point, and never to the generator it seems. So to avoid
removing the workaround again when we regenerate the next time,
we need to add it.
Task-number: QTBUG-74511
Change-Id: Ided1bd949234ba82df61c55891646823e7f72e80
Reviewed-by: Morten Johan Sørvig <morten.sorvig@qt.io>
Document how to do an update, fix the bit-rot that had crept into
main.cpp since last it was compiled, correct the qt_attribution.json
to use the actual version number of UCD (its Revision number) instead
of the (admittedly correlated) Unicode release number. Updated to
Release 22 (which came with Unicode 11.0.0) in the process; but this
doesn't change our actual qunicodetables.cpp (so is incidental).
Task-number: QTBUG-71281
Change-Id: Ieb7a6e1a4d49f639993f76ff82c8f12a572db3c3
Reviewed-by: Thiago Macieira <thiago.macieira@intel.com>
Reviewed-by: Lars Knoll <lars.knoll@qt.io>
This is the squashed diff from wip/webassembly to dev.
Done-with: Peng Wu <peng.wu@intopalo.com>
Done-with: Sami Enne <sami.enne@intopalo.com>
Done-with: Morten Johan Sørvig <morten.sorvig@qt.io>
Started-by: Andrew Knight <andrew.knight@intopalo.com>
Change-Id: I6562433c0a38d6ec49ab675e0f104f2665f3392d
Reviewed-by: Lorn Potter <lorn.potter@gmail.com>
Reviewed-by: Eskil Abrahamsen Blomfeldt <eskil.abrahamsen-blomfeldt@qt.io>
Also adjusted the text segmentation and line break algorithms
so that they can handle the new data, and pass the test suite.
Change-Id: Ib727fd80003e34e96458d7a681996de3fa3691e7
Reviewed-by: Eskil Abrahamsen Blomfeldt <eskil.abrahamsen-blomfeldt@qt.io>
The text segmentation data is not being updated in this change,
as it requires additional code changes. Updating those will
come in a follow-up commit.
Change-Id: I5d6b6bc96044e8dd0c25cf6f79756e7f68bf6e7c
Reviewed-by: Konstantin Ritt <ritt.ks@gmail.com>
Reviewed-by: Eskil Abrahamsen Blomfeldt <eskil.abrahamsen-blomfeldt@qt.io>
The code that was supposed to initialize the joining property of
characters to the correct defaults was actually applied after reading
in the data from the Unicode file, and was in a couple of cases
overwriting explicitly specified data in ArabicShaping.txt
Task-number: QTBUG-63191
Change-Id: Ie35261039b2211a827322ca11afacd9555ccefc7
Reviewed-by: Konstantin Ritt <ritt.ks@gmail.com>
From Qt 5.7 -> tools & applications are lisenced under GPL v3 with some
exceptions, see
http://blog.qt.io/blog/2016/01/13/new-agreement-with-the-kde-free-qt-foundation/
Updated license headers to use new GPL-EXCEPT header instead of LGPL21 one
(in those files which will be under GPL 3 with exceptions)
Change-Id: I42a473ddc97101492a60b9287d90979d9eb35ae1
Reviewed-by: Thiago Macieira <thiago.macieira@intel.com>
Reviewed-by: Lars Knoll <lars.knoll@theqtcompany.com>
and implement TitlecaseTraits as well,
to be re-used in the following commit.
Change-Id: I5c0bff42fe3b9bc9c2454cc16916cc2be87f604f
Reviewed-by: Lars Knoll <lars.knoll@theqtcompany.com>
Use static_cast instead of C-style cast;
ushort -> unsigned short;
Avoid a lvalue in qGetProp();
Minor styling fixes.
Change-Id: Ic6412e28ac9316a587d32d0e674d21f672967fd6
Reviewed-by: Lars Knoll <lars.knoll@theqtcompany.com>
* Georgian lari currency symbol
* A large collection of CJK unified ideographs
* Emoji symbols and symbol modifiers
* Letters to support the Ik language in Uganda, Kulango in
the Côte d’Ivoire, and other languages of Africa
* A set of lowercase Cherokee syllables, forming case pairs
with the existing Cherokee characters
* The Ahom script for support of the Tai Ahom language in India
* Arabic letters to support Arwi—the Tamil language written in the Arabic script
For more details, see http://www.unicode.org/versions/Unicode8.0.0/
[ChangeLog][QtCore] Unicode data updated to v.8.0
Change-Id: If255f95c9c45655b721369a116299da3cabbba0a
Reviewed-by: Lars Knoll <lars.knoll@theqtcompany.com>
An executable .sh file should have its #! be the first two bytes.
Change-Id: I22c9eee6d349df743b02996bef0e093df3f42eb5
Reviewed-by: Frederik Gladhorn <frederik.gladhorn@theqtcompany.com>
Reviewed-by: Jędrzej Nowacki <jedrzej.nowacki@theqtcompany.com>
* Two newly adopted currency symbols:
the Azerbaijan manat and the Russia ruble
* Pictographic symbols (including many emoji), geometric symbols,
arrows, and ornaments originating from the Wingdings and Webdings sets
* Twenty-three new lesser-used and historic scripts
extending support for written languages of North America, China, India,
other Asian countries, and Africa
* Letters used in Teuthonista and other transcriptional systems,
and a new notational set, Duployan
For more details, see http://www.unicode.org/versions/Unicode7.0.0/
The Properties struct's .*Diff members were narrowed down
to signed 15 bits and the unicodeVersion has been expanded to 8 bits.
[ChangeLog][QtCore] Unicode data updated to v.7.0
Change-Id: I93ab6f79fa3b05f61abc7279f1d046834c1c1a0b
Reviewed-by: Lars Knoll <lars.knoll@digia.com>
Qt copyrights are now in The Qt Company, so we could update the source
code headers accordingly. In the same go we should also fix the links to
point to qt.io.
Outdated header.LGPL removed (use header.LGPL21 instead)
Old header.LGPL3 renamed to header.LGPL3-COMM to match actual licensing
combination. New header.LGPL-COMM taken in the use file which were
using old header.LGPL3 (src/plugins/platforms/android/extract.cpp)
Added new header.LGPL3 containing Commercial + LGPLv3 + GPLv2 license
combination
Change-Id: I6f49b819a8a20cc4f88b794a8f6726d975e8ffbe
Reviewed-by: Matti Paaso <matti.paaso@theqtcompany.com>
Yes, this is necessary. The noexcept operator looks for noexcept tagging,
not at the contents of the function to determine whether to return true.
The more conditionally-noexcept functions are used, the more important it
becomes that low-level functions are correctly marked noexcept. In that, it
is like constexpr.
Change-Id: I4bca178444d1fd7caf3a92f996b1536eebdb5014
Reviewed-by: Thiago Macieira <thiago.macieira@intel.com>
The GCC documentation says that a const function is not allowed to read
global memory. This needs to be clarified: it's not allowed to read RW
global memory. It's fine to read read-only memory, as that is equivalent
to just pure code.
The QChar static out-of-line functions only lookup a property of the
given Unicode character and always return the same value.
The only exception is the decomposition() function, which returns a
QString and is therefore not allowed to be marked const.
Change-Id: Id36b2f84a1b8ff9db5acf1d4e59e8b3811068cff
Reviewed-by: Konstantin Ritt <ritt.ks@gmail.com>
Reviewed-by: Giuseppe D'Angelo <giuseppe.dangelo@kdab.com>
This aimed to disctinct joining types "L", "T", and "U" from just "U".
Unicode 6.3.0 has introduced a character with joining type "L" and
Unicode 7.0 will add a few more characters of joining type "L", so
we'll have to deal with it anyways.
[ChangeLog][QtCore][QChar] Added JoiningType enum and joiningType()
method that deprecates the old QChar::Joining enum and joining() method.
Change-Id: I4be3a3f745d944e689feb9b62d4ca86d1cf371b0
Reviewed-by: Lars Knoll <lars.knoll@digia.com>
Reviewed-by: Thiago Macieira <thiago.macieira@intel.com>
* Mongolian and Phags-pa characters have been given a Joining_Type
classification for contextual shaping. As a part of these additions,
one Phags-pa character has the Joining_Type value of L (Left Joining),
which no character had been assigned before.
* The unassigned code points in the Currency Symbols block have been
given the Bidi_Class property value ET and the Line_Break property
value PR, to help implementations support new currency symbols,
when they are encoded.
* Hebrew letters and basic punctuation marks have been assigned
the newly introduced Word_Break property values Hebrew_Letter,
Single_Quote, and Double_Quote.
* The Bidi_Class property has been extended with four new values
for directional isolates.
For more details, see http://www.unicode.org/versions/Unicode6.3.0/
Change-Id: Iad62d02edc58a8497898dcd6d6c70d5aece317ea
Reviewed-by: Lars Knoll <lars.knoll@digia.com>