qt5base-lts/util/unicode/README
Giuseppe D'Angelo a794c5e287 Unicode: fix the extended grapheme cluster algorithm
UAX #29 in Unicode 11 changed the EGC algorithm to its current form.
Although Qt has upgraded the Unicode tables all the way up to
Unicode 13, the algorithm has never been adapted; in other words,
it has been working by chance for years. Luckily, MOST
of the cases were dealt with correctly, but emoji handling
actually manages to break it.

This commit:

* Adds parsing of emoji-data.txt into the unicode table generator.
  That is necessary to extract the Extended_Pictographic property,
  which is used by the EGC algorithm.

* Regenerates the tables.

* Removes some obsoleted grapheme cluster break properties, and
  adds the ones added in the meanwhile.

* Rewrites the EGC algorithm according to Unicode 13. This is
  done by simplifying a lot the lookup table. Some rules (GB11,
  GB12, GB13) can't be done by the table alone so some hand-rolled
  code is necessary in that case.

* Thanks to these fixes, the complete upstream GraphemeBreakTest
  now passes. Remove the "edited" version that ignored some rows
  (because they were failing).

Change-Id: Iaa07cb2e6d0ab9deac28397f46d9af189d2edf8b
Pick-to: 6.1 6.0 5.15
Fixes: QTBUG-92822
Reviewed-by: Thiago Macieira <thiago.macieira@intel.com>
Reviewed-by: Konstantin Ritt <ritt.ks@gmail.com>
2021-04-16 20:31:39 +02:00

53 lines
2.8 KiB
Plaintext

Unicode is used to generate the unicode data in src/corelib/text/.
To update:
* Find the data (UAX #44, UCD; not the XML version) at
ftp://www.unicode.org/Public/zipped/$Version/
* Unpack the zip file; for each file in data/, replace with the new
version; find the *BreakProperty.txt in auxiliary/ and emoji-data.txt
in emoji/.
* In tst_QTextBoundaryFinder's data/ sub-directory, update its files
from the auxiliary/ sub-directory of the UCD data.
* If needed, add an entry to enum QChar::UnicodeVersion for the new
Unicode version
* In that case, also update main.cpp's initAgeMap and DATA_VERSION_S*
to match
* Build this project. Its binary, unicode, ignores command-line
options and assumes it is being run from this directory. When run,
it produces lots of output. If it gets as far as updating
qunicodetables.cpp the output hopefully doesn't matter.
* It'll end prematurely with a qFatal() message if it needs updates,
either in main.cpp or in QChar:
* "unassigned or unhandled age value:" initAgeMap() and
QChar::UnicodeVersion;
* "Unhandled script property value:" initScriptMap(), QChar::Script,
qharfbuzzng.cpp's _qtscript_to_hbscript[] array and
qfontconfigdatabase.cpp's specialLanguages.
* "unassigned word break class:" enum WordBreakClass,
word_break_class_string and initWordBreak();
* Assertions or other qFatal()s may trigger: if so, study code and
understand what's more complicated about this update; talk to folk
named in the git logs, maybe push a WIP to gerrit to solicit
advice. Some bit-field may need to be expanded, for example. In some
cases QChar may need additions to some of its enums.
* Build with the modified code, fix any compilation issues, make check
in suitable directories, including tst_QTextBoundaryFinder.
* That may have updated qtbase/src/corelib/text/qunicodetables.cpp; if
so the update matters; be sure to commit the changes to data/ at the
same time and update text/qt_attribution.json to match; use the UCD
Revision number, rather than the Unicode standard number, as the
Version, for all that qunicodetables.cpp uses the latter (see the
'UAX #44, UCD' page linked from https://www.unicode.org/ucd/ for the
table with this).
* If there are enum additions in qchar.h (public API), be sure to also
update the documentation in qchar.cpp for each affected enum,
respecting the existing ordering.
* If you don't normally build in the source tree, remember to delete
qtbase/.qmake.stash while you're cleaning up.
The script writingSystems.sh generates a list of writing systems,
ostensibly as a the basis for updating QFontDatabase::WritingSystem
enum; however, the Release 20 output of it contains many more writing
systems than are present in that enum, suggesting it has not been run
in a very long time. Further research needed.