Commit Graph

14 Commits

Author SHA1 Message Date
Ievgenii Meshcheriakov
2afe1a3c19 unicode: Generate tables for IDNA/UTS #46
Update the Unicode data processing tool to generate properties
and mapping tables needed to implement UTS #46
(https://unicode.org/reports/tr46/). The implementation extends
the standard to allow usage of underscores in URLs. This is done
for compatibility with DNS-SD and SMB protocols.

The data file needed to generate the new properties was taken from
https://www.unicode.org/Public/idna/13.0.0/IdnaMappingTable.txt

Task-number: QTBUG-85323
Change-Id: I2c303bf8a08aefb18a7491fb9b55385563bfa219
Reviewed-by: Edward Welbourne <edward.welbourne@qt.io>
2021-08-26 16:55:05 +02:00
Giuseppe D'Angelo
a794c5e287 Unicode: fix the extended grapheme cluster algorithm
UAX #29 in Unicode 11 changed the EGC algorithm to its current form.
Although Qt has upgraded the Unicode tables all the way up to
Unicode 13, the algorithm has never been adapted; in other words,
it has been working by chance for years. Luckily, MOST
of the cases were dealt with correctly, but emoji handling
actually manages to break it.

This commit:

* Adds parsing of emoji-data.txt into the unicode table generator.
  That is necessary to extract the Extended_Pictographic property,
  which is used by the EGC algorithm.

* Regenerates the tables.

* Removes some obsoleted grapheme cluster break properties, and
  adds the ones added in the meanwhile.

* Rewrites the EGC algorithm according to Unicode 13. This is
  done by simplifying a lot the lookup table. Some rules (GB11,
  GB12, GB13) can't be done by the table alone so some hand-rolled
  code is necessary in that case.

* Thanks to these fixes, the complete upstream GraphemeBreakTest
  now passes. Remove the "edited" version that ignored some rows
  (because they were failing).

Change-Id: Iaa07cb2e6d0ab9deac28397f46d9af189d2edf8b
Pick-to: 6.1 6.0 5.15
Fixes: QTBUG-92822
Reviewed-by: Thiago Macieira <thiago.macieira@intel.com>
Reviewed-by: Konstantin Ritt <ritt.ks@gmail.com>
2021-04-16 20:31:39 +02:00
Edward Welbourne
54f8be6cc0 Update UCD to Revision 26
Include WordBreakTest.html, since a test uses sample strings from it,
albeit without actually reading the file.

Had to comment out more of the new tests, as at Revision 24, pending
an update to harfbuzz and the text boundary detection code.

Task-number: QTBUG-79631
Task-number: QTBUG-79418
Task-number: QTBUG-82747
Change-Id: I0082294b09d67ffdc6a9b5c15acf77ad3b86f65f
Reviewed-by: Lars Knoll <lars.knoll@qt.io>
2020-03-14 11:26:59 +01:00
Edward Welbourne
c3eb521a0f Update UCD data to Unicode 12.1.0's Revision 24
Had to teach the update program to accept category Lm as for
Joining_Transparent, for the sake of a new ArabicShaping.txt entry.
Added three new Unicode versions, several new scripts and a new
word-break class.

Updated UCD's test data for tst_QTextBoundaryFinder.  This left 57
tests failing; I have commented out the data rows for those tests,
pending someone with more knowledge addressing this.

Task-number: QTBUG-79631
Task-number: QTBUG-79418
Change-Id: Ic33d3b3551195d47a84d98e84020f57a68f0b201
Reviewed-by: Eskil Abrahamsen Blomfeldt <eskil.abrahamsen-blomfeldt@qt.io>
2019-10-30 17:38:02 +01:00
Lars Knoll
41b4e154d6 Update Text segmentation and line break data to Unicode 10.0
Also adjusted the text segmentation and line break algorithms
so that they can handle the new data, and pass the test suite.

Change-Id: Ib727fd80003e34e96458d7a681996de3fa3691e7
Reviewed-by: Eskil Abrahamsen Blomfeldt <eskil.abrahamsen-blomfeldt@qt.io>
2018-01-03 07:47:26 +00:00
Lars Knoll
8bfabb34de Update most Unicode data to version 10.0
The text segmentation data is not being updated in this change,
as it requires additional code changes. Updating those will
come in a follow-up commit.

Change-Id: I5d6b6bc96044e8dd0c25cf6f79756e7f68bf6e7c
Reviewed-by: Konstantin Ritt <ritt.ks@gmail.com>
Reviewed-by: Eskil Abrahamsen Blomfeldt <eskil.abrahamsen-blomfeldt@qt.io>
2018-01-03 07:46:31 +00:00
Konstantin Ritt
a98b541f26 Update Unicode data files to v8.0
Change-Id: I0aa368cb07353924031a9af4f0bdc33692eb1053
Reviewed-by: Lars Knoll <lars.knoll@theqtcompany.com>
2015-11-05 08:24:58 +00:00
Konstantin Ritt
ecdd5648bd Update UCD source files to v7.0
Change-Id: I47277963c926128ad0c4ac5141835e767bb440a7
Reviewed-by: Lars Knoll <lars.knoll@digia.com>
2015-03-27 16:39:53 +00:00
Konstantin Ritt
a6046be428 Update UCD source files up to Unicode 6.3.0
Change-Id: I9ab58a659af1e758b172a24aa95bce1fea89c33d
Reviewed-by: Lars Knoll <lars.knoll@digia.com>
2014-01-14 15:38:43 +01:00
Konstantin Ritt
2672c4fa91 Update the Unicode Data and Algorithms up to Unicode 6.2
Version 6.2 of the Unicode Standard is a special release
dedicated to the early publication of the newly encoded Turkish lira sign.
In addition, there are some significant changes to the Unicode algorithms
for text segmentation and line breaking to improve breaking for emoji symbols.

For more details, see http://www.unicode.org/versions/Unicode6.2.0/

Change-Id: I21cfd4f307e41b41a19d36cce87f7a44c2661bc2
Reviewed-by: Thiago Macieira <thiago.macieira@intel.com>
Reviewed-by: Lars Knoll <lars.knoll@digia.com>
2012-10-09 03:04:41 +02:00
Konstantin Ritt
d8c04d60db Make QUnicodeTables::script() support SMP code points
Instead of expanding the scripts table with script values for the code points
>= 0x10000, it has been merged with the properties table in order to
increase perfomance of the script itemization code (not affected yet).
(Stats: the properties table grew up in 97428-89800 = 7628 bytes;
        the old scripts table was of size 7680 bytes)

The outdated ScriptsInitial.txt and ScriptsCorrections.txt file has been removed
(they were just empty, the "corrigendum" script corrections should be applied
to Scripts.txt directly, *no customization allowed*!).

More script testcases has been added - at least one per supported script.

Task-number: QTBUG-6530

Change-Id: I40a9e76f681e2dd552fd4c61af0808d043962e79
Reviewed-by: Lars Knoll <lars.knoll@nokia.com>
2012-06-14 05:22:13 +02:00
Konstantin Ritt
c9100bcce7 Update the Unicode data files up to v6.1.0
Change-Id: I20b94634b1f4ebff10757c2348cfdbbd906e8797
Reviewed-by: Lars Knoll <lars.knoll@nokia.com>
2012-06-10 15:57:54 +02:00
Konstantin Ritt
73b24486ed UCD-5.0: apply Corrigendum #6
http://unicode.org/versions/corrigendum6.html:
> in Unicode 5.0, the list of characters with the Bidi_Mirrored property
> was made consistent for brackets and quotation marks, in preparation for
> new constraints on bidi mirroring. However, after publication of
> Unicode 5.0.0 it was discovered that this change adversely affected
> several quotation mark characters in deployed data.

Task-number: QTBUG-25169
Change-Id: Id49caf401af2d5a1e6dbcc32b2f350aa20b7f901
Reviewed-by: Lars Knoll <lars.knoll@nokia.com>
2012-04-15 03:53:07 +02:00
Qt by Nokia
38be0d1383 Initial import from the monolithic Qt.
This is the beginning of revision history for this module. If you
want to look at revision history older than this, please refer to the
Qt Git wiki for how to use Git history grafting. At the time of
writing, this wiki is located here:

http://qt.gitorious.org/qt/pages/GitIntroductionWithQt

If you have already performed the grafting and you don't see any
history beyond this commit, try running "git log" with the "--follow"
argument.

Branched from the monolithic repo, Qt master branch, at commit
896db169ea224deb96c59ce8af800d019de63f12
2011-04-27 12:05:43 +02:00