UAX #29 in Unicode 11 changed the EGC algorithm to its current form.
Although Qt has upgraded the Unicode tables all the way up to
Unicode 13, the algorithm has never been adapted; in other words,
it has been working by chance for years. Luckily, MOST
of the cases were dealt with correctly, but emoji handling
actually manages to break it.
This commit:
* Adds parsing of emoji-data.txt into the unicode table generator.
That is necessary to extract the Extended_Pictographic property,
which is used by the EGC algorithm.
* Regenerates the tables.
* Removes some obsoleted grapheme cluster break properties, and
adds the ones added in the meanwhile.
* Rewrites the EGC algorithm according to Unicode 13. This is
done by simplifying a lot the lookup table. Some rules (GB11,
GB12, GB13) can't be done by the table alone so some hand-rolled
code is necessary in that case.
* Thanks to these fixes, the complete upstream GraphemeBreakTest
now passes. Remove the "edited" version that ignored some rows
(because they were failing).
Change-Id: Iaa07cb2e6d0ab9deac28397f46d9af189d2edf8b
Pick-to: 6.1 6.0 5.15
Fixes: QTBUG-92822
Reviewed-by: Thiago Macieira <thiago.macieira@intel.com>
Reviewed-by: Konstantin Ritt <ritt.ks@gmail.com>
Include WordBreakTest.html, since a test uses sample strings from it,
albeit without actually reading the file.
Had to comment out more of the new tests, as at Revision 24, pending
an update to harfbuzz and the text boundary detection code.
Task-number: QTBUG-79631
Task-number: QTBUG-79418
Task-number: QTBUG-82747
Change-Id: I0082294b09d67ffdc6a9b5c15acf77ad3b86f65f
Reviewed-by: Lars Knoll <lars.knoll@qt.io>
Had to teach the update program to accept category Lm as for
Joining_Transparent, for the sake of a new ArabicShaping.txt entry.
Added three new Unicode versions, several new scripts and a new
word-break class.
Updated UCD's test data for tst_QTextBoundaryFinder. This left 57
tests failing; I have commented out the data rows for those tests,
pending someone with more knowledge addressing this.
Task-number: QTBUG-79631
Task-number: QTBUG-79418
Change-Id: Ic33d3b3551195d47a84d98e84020f57a68f0b201
Reviewed-by: Eskil Abrahamsen Blomfeldt <eskil.abrahamsen-blomfeldt@qt.io>
Also adjusted the text segmentation and line break algorithms
so that they can handle the new data, and pass the test suite.
Change-Id: Ib727fd80003e34e96458d7a681996de3fa3691e7
Reviewed-by: Eskil Abrahamsen Blomfeldt <eskil.abrahamsen-blomfeldt@qt.io>
The text segmentation data is not being updated in this change,
as it requires additional code changes. Updating those will
come in a follow-up commit.
Change-Id: I5d6b6bc96044e8dd0c25cf6f79756e7f68bf6e7c
Reviewed-by: Konstantin Ritt <ritt.ks@gmail.com>
Reviewed-by: Eskil Abrahamsen Blomfeldt <eskil.abrahamsen-blomfeldt@qt.io>
Version 6.2 of the Unicode Standard is a special release
dedicated to the early publication of the newly encoded Turkish lira sign.
In addition, there are some significant changes to the Unicode algorithms
for text segmentation and line breaking to improve breaking for emoji symbols.
For more details, see http://www.unicode.org/versions/Unicode6.2.0/
Change-Id: I21cfd4f307e41b41a19d36cce87f7a44c2661bc2
Reviewed-by: Thiago Macieira <thiago.macieira@intel.com>
Reviewed-by: Lars Knoll <lars.knoll@digia.com>
Instead of expanding the scripts table with script values for the code points
>= 0x10000, it has been merged with the properties table in order to
increase perfomance of the script itemization code (not affected yet).
(Stats: the properties table grew up in 97428-89800 = 7628 bytes;
the old scripts table was of size 7680 bytes)
The outdated ScriptsInitial.txt and ScriptsCorrections.txt file has been removed
(they were just empty, the "corrigendum" script corrections should be applied
to Scripts.txt directly, *no customization allowed*!).
More script testcases has been added - at least one per supported script.
Task-number: QTBUG-6530
Change-Id: I40a9e76f681e2dd552fd4c61af0808d043962e79
Reviewed-by: Lars Knoll <lars.knoll@nokia.com>
http://unicode.org/versions/corrigendum6.html:
> in Unicode 5.0, the list of characters with the Bidi_Mirrored property
> was made consistent for brackets and quotation marks, in preparation for
> new constraints on bidi mirroring. However, after publication of
> Unicode 5.0.0 it was discovered that this change adversely affected
> several quotation mark characters in deployed data.
Task-number: QTBUG-25169
Change-Id: Id49caf401af2d5a1e6dbcc32b2f350aa20b7f901
Reviewed-by: Lars Knoll <lars.knoll@nokia.com>
This is the beginning of revision history for this module. If you
want to look at revision history older than this, please refer to the
Qt Git wiki for how to use Git history grafting. At the time of
writing, this wiki is located here:
http://qt.gitorious.org/qt/pages/GitIntroductionWithQt
If you have already performed the grafting and you don't see any
history beyond this commit, try running "git log" with the "--follow"
argument.
Branched from the monolithic repo, Qt master branch, at commit
896db169ea224deb96c59ce8af800d019de63f12