Commit Graph

62 Commits

Author SHA1 Message Date
Marc Mutz
19e7c0d2b5 QChar/QString: centralize case folding in qchar.cpp
There are (at least) two implementations of the low-level case-folding
algorithm, one of which (for QChar::toLower()) seems to be wrong (it
doesn't deal with special cases which expand to more than one code
point).

The algoithm hidden in QString and entangled with the QString
detaching code makes reusing the code much harder.

At the same time, the dependency of the algorithm on the unicode
tables makes exposing a non-allocating result type in the public API
hard. std::u16string would be an alternative if we can assure that all
implementations use SSO with at least four characters.

So, for the time being, leave this as internal API for use in an
upcoming QStringView::toLower() as well as case-insensitive hashing.

Change-Id: Iabb2611846f6176776aa20e634f44d8464f3305c
Reviewed-by: Thiago Macieira <thiago.macieira@intel.com>
2020-05-09 06:25:05 +00:00
Marc Mutz
7b04e0012b QUnicodeTables: port to charNN_t
This makes existing calls passing uint or ushort ambiguous, so
fix all the callers. There do not appear to be callers outside
QtBase. In fact, the ...BreakClass() functions appear to be
utterly unused.

Change-Id: I1c2251920beba48d4909650bc1d501375c6a3ecf
Reviewed-by: Lars Knoll <lars.knoll@qt.io>
Reviewed-by: Konstantin Ritt <ritt.ks@gmail.com>
2020-04-27 13:08:41 +02:00
Marc Mutz
20cdf807b1 QChar: port low-level functions from uint/ushort to char32/16_t
Now that the standard gives us proper types for UTF-16 and UTF-32
characters, use them. Will eventually make the code much easier to
read than today, where uint could be an index as well as a char32_t.

It also ensures that the result of e.g. QChar::highSurrogate() can
still be implicitly converted to a QChar now that the
QChar(non-characater-integral-types) ctors are being made explicit.

[ChangeLog][QtCore][QChar] All low-level functions
(e.g. highSurrogate()) now take and return char16_t instead of ushort
and char32_t instead of uint.

Change-Id: I9cd8ebf6fb998fe1075dae96c7c4484a057f0b91
Reviewed-by: Lars Knoll <lars.knoll@qt.io>
Reviewed-by: Konstantin Ritt <ritt.ks@gmail.com>
2020-04-24 12:45:53 +02:00
Edward Welbourne
54f8be6cc0 Update UCD to Revision 26
Include WordBreakTest.html, since a test uses sample strings from it,
albeit without actually reading the file.

Had to comment out more of the new tests, as at Revision 24, pending
an update to harfbuzz and the text boundary detection code.

Task-number: QTBUG-79631
Task-number: QTBUG-79418
Task-number: QTBUG-82747
Change-Id: I0082294b09d67ffdc6a9b5c15acf77ad3b86f65f
Reviewed-by: Lars Knoll <lars.knoll@qt.io>
2020-03-14 11:26:59 +01:00
Edward Welbourne
65ea4948dc Unicode tables: minor prettification
Put blank lines before the final Num*Classes entries in enums, to set
them off visibly from the "real" members. Moved some oddly placed
commas to the ends of preceding lines, so that later additions can
just add lines (with comma on end) without having to modify the
preceding line while doing so.

Change-Id: I5188dc25af9e4c17a1882fd9dab070e88013060b
Reviewed-by: Thiago Macieira <thiago.macieira@intel.com>
2019-11-28 11:04:39 +01:00
Edward Welbourne
1a1718b342 Add missing docs for UCD additions at 5.15
Also remove two stray commas pointed out in code-review and some
others noticed on checking for similar.
This amends commit c3eb521a0f.

Change-Id: If20c5146b740defe8d25ff61d399031b5c66ded1
Reviewed-by: Lars Knoll <lars.knoll@qt.io>
2019-11-28 11:04:27 +01:00
Edward Welbourne
c3eb521a0f Update UCD data to Unicode 12.1.0's Revision 24
Had to teach the update program to accept category Lm as for
Joining_Transparent, for the sake of a new ArabicShaping.txt entry.
Added three new Unicode versions, several new scripts and a new
word-break class.

Updated UCD's test data for tst_QTextBoundaryFinder.  This left 57
tests failing; I have commented out the data rows for those tests,
pending someone with more knowledge addressing this.

Task-number: QTBUG-79631
Task-number: QTBUG-79418
Change-Id: Ic33d3b3551195d47a84d98e84020f57a68f0b201
Reviewed-by: Eskil Abrahamsen Blomfeldt <eskil.abrahamsen-blomfeldt@qt.io>
2019-10-30 17:38:02 +01:00
Marc Mutz
effbf147a4 QUnicodeTables: use array for case folding tables
Instead of four pairs of :1 :15 bit fields, use an array of four :1,
:15 structs.  This allows to replace the case folding traits classes
with a simple enum that indexes into said array.

I don't know what the WASM #ifdef'ed code is supposed to effect (a :0
bit-field is only useful to separate adjacent bit-field into separate
memory locations for multi-threading), but I thought it safer to leave
it in, and that means the array must be a 64-bit block of its own, so
I had to move two fields around.

Saves ~4.5KiB in text size on optimized GCC 10 LTO Linux AMD64 builds.

Change-Id: Ib52cd7706342d5227b50b57545d073829c45da9a
Reviewed-by: Lars Knoll <lars.knoll@qt.io>
2019-09-04 16:35:37 +00:00
Marc Mutz
2737b5e36a QUnicodeTables: pack Properties struct
GCC doesn't like the sequence

   : 5
   : 5
   : 8
   : 6
   : 8

and inserts a :6 padding between the :5 and the :8 and a :2 padding
between the :6 and the :8, growing the bitfield by 8 bits of embedded
padding and another byte to bring the struct back to sizeof % 2 == 0.

Fix by reshuffling the elements and adding a static_assert for the
next round.

Saves ~5KiB in QtCore executable size.

Change-Id: I4758a6f48ba389abc2aee92f60997d42ebb0e5b8
Reviewed-by: Thiago Macieira <thiago.macieira@intel.com>
2019-09-04 10:06:19 +02:00
Edward Welbourne
a9aa206b7b Move text-related code out of corelib/tools/ to corelib/text/
This includes byte array, string, char, unicode, locale, collation and
regular expressions.

Change-Id: I8b125fa52c8c513eb57a0f1298b91910e5a0d786
Reviewed-by: Volker Hilsheimer <volker.hilsheimer@qt.io>
2019-07-10 17:05:30 +02:00
Sona Kurazyan
ff2b2032a0 Remove usages of deprecated APIs from QtAlgorithms
Task-number: QTBUG-76491
Change-Id: I9dab736a0cbd2e86588919640c26e8ce6b3674d0
Reviewed-by: Alex Blasche <alexander.blasche@qt.io>
Reviewed-by: Leena Miettinen <riitta-leena.miettinen@qt.io>
2019-06-29 21:58:36 +02:00
Allan Sandfeld Jensen
95f787bfdc Replace Q_DECL_NOTHROW with noexcept the remaining places
The first replacement had missed objective-C++ code some places ourside
the src dir.

In C-files Q_DECL_NOTHROW is replaced with Q_DECL_NOEXCEPT as we still
need to turn it off when compiled in C mode, but can get rid of the old
NOTHROW moniker.

Change-Id: I6370f57066679c5120d0265a69e7e378e09d4759
Reviewed-by: Thiago Macieira <thiago.macieira@intel.com>
2019-04-09 14:48:42 +00:00
Eskil Abrahamsen Blomfeldt
01380dc267 Remove broken code from unicode generator
The current state produces uncompilable code.

Change-Id: I9a68b61866a4a416335ed4d7204c58122803fb1c
Reviewed-by: Eskil Abrahamsen Blomfeldt <eskil.abrahamsen-blomfeldt@qt.io>
2019-03-18 15:17:10 +00:00
Eskil Abrahamsen Blomfeldt
e59ba35f1b Wasm: Add workaround for Emscripten compiler again
This work around was added directly to the generated files
at some point, and never to the generator it seems. So to avoid
removing the workaround again when we regenerate the next time,
we need to add it.

Task-number: QTBUG-74511
Change-Id: Ided1bd949234ba82df61c55891646823e7f72e80
Reviewed-by: Morten Johan Sørvig <morten.sorvig@qt.io>
2019-03-18 15:16:53 +00:00
Edward Welbourne
d38f635355 Clean up and update Unicode character data 3rd-party infrastructure
Document how to do an update, fix the bit-rot that had crept into
main.cpp since last it was compiled, correct the qt_attribution.json
to use the actual version number of UCD (its Revision number) instead
of the (admittedly correlated) Unicode release number.  Updated to
Release 22 (which came with Unicode 11.0.0) in the process; but this
doesn't change our actual qunicodetables.cpp (so is incidental).

Task-number: QTBUG-71281
Change-Id: Ieb7a6e1a4d49f639993f76ff82c8f12a572db3c3
Reviewed-by: Thiago Macieira <thiago.macieira@intel.com>
Reviewed-by: Lars Knoll <lars.knoll@qt.io>
2018-11-11 22:09:27 +00:00
Morten Johan Sørvig
29c0377f07 WebAssembly for QtBase
This is the squashed diff from wip/webassembly to dev.

Done-with: Peng Wu <peng.wu@intopalo.com>
Done-with: Sami Enne <sami.enne@intopalo.com>
Done-with: Morten Johan Sørvig <morten.sorvig@qt.io>
Started-by: Andrew Knight <andrew.knight@intopalo.com>
Change-Id: I6562433c0a38d6ec49ab675e0f104f2665f3392d
Reviewed-by: Lorn Potter <lorn.potter@gmail.com>
Reviewed-by: Eskil Abrahamsen Blomfeldt <eskil.abrahamsen-blomfeldt@qt.io>
2018-08-30 06:48:33 +00:00
Lars Knoll
41b4e154d6 Update Text segmentation and line break data to Unicode 10.0
Also adjusted the text segmentation and line break algorithms
so that they can handle the new data, and pass the test suite.

Change-Id: Ib727fd80003e34e96458d7a681996de3fa3691e7
Reviewed-by: Eskil Abrahamsen Blomfeldt <eskil.abrahamsen-blomfeldt@qt.io>
2018-01-03 07:47:26 +00:00
Lars Knoll
8bfabb34de Update most Unicode data to version 10.0
The text segmentation data is not being updated in this change,
as it requires additional code changes. Updating those will
come in a follow-up commit.

Change-Id: I5d6b6bc96044e8dd0c25cf6f79756e7f68bf6e7c
Reviewed-by: Konstantin Ritt <ritt.ks@gmail.com>
Reviewed-by: Eskil Abrahamsen Blomfeldt <eskil.abrahamsen-blomfeldt@qt.io>
2018-01-03 07:46:31 +00:00
Lars Knoll
59febb49e4 Fix a bug in the generation of the Unicode joining property data
The code that was supposed to initialize the joining property of
characters to the correct defaults was actually applied after reading
in the data from the Unicode file, and was in a couple of cases
overwriting explicitly specified data in ArabicShaping.txt

Task-number: QTBUG-63191
Change-Id: Ie35261039b2211a827322ca11afacd9555ccefc7
Reviewed-by: Konstantin Ritt <ritt.ks@gmail.com>
2017-12-13 19:14:49 +00:00
Jani Heikkinen
f776595cc1 Updated license headers
From Qt 5.7 -> tools & applications are lisenced under GPL v3 with some
exceptions, see
http://blog.qt.io/blog/2016/01/13/new-agreement-with-the-kde-free-qt-foundation/

Updated license headers to use new GPL-EXCEPT header instead of LGPL21 one
(in those files which will be under GPL 3 with exceptions)

Change-Id: I42a473ddc97101492a60b9287d90979d9eb35ae1
Reviewed-by: Thiago Macieira <thiago.macieira@intel.com>
Reviewed-by: Lars Knoll <lars.knoll@theqtcompany.com>
2016-01-21 18:55:18 +00:00
Konstantin Ritt
c267646867 Move CasexTraits from qstring.cpp to qunicodetables_p.h
and implement TitlecaseTraits as well,
to be re-used in the following commit.

Change-Id: I5c0bff42fe3b9bc9c2454cc16916cc2be87f604f
Reviewed-by: Lars Knoll <lars.knoll@theqtcompany.com>
2015-11-16 14:34:04 +00:00
Konstantin Ritt
3a3e2193d7 QUnicodeTables generator: Minor clean-up & stricter checks
Use static_cast instead of C-style cast;
ushort -> unsigned short;
Avoid a lvalue in qGetProp();
Minor styling fixes.

Change-Id: Ic6412e28ac9316a587d32d0e674d21f672967fd6
Reviewed-by: Lars Knoll <lars.knoll@theqtcompany.com>
2015-11-13 20:42:20 +00:00
Konstantin Ritt
0e1f3aab11 Update Unicode data & algorithms up to v8.0
* Georgian lari currency symbol
* A large collection of CJK unified ideographs
* Emoji symbols and symbol modifiers
* Letters to support the Ik language in Uganda, Kulango in
  the Côte d’Ivoire, and other languages of Africa
* A set of lowercase Cherokee syllables, forming case pairs
  with the existing Cherokee characters
* The Ahom script for support of the Tai Ahom language in India
* Arabic letters to support Arwi—the Tamil language written in the Arabic script

For more details, see http://www.unicode.org/versions/Unicode8.0.0/

[ChangeLog][QtCore] Unicode data updated to v.8.0

Change-Id: If255f95c9c45655b721369a116299da3cabbba0a
Reviewed-by: Lars Knoll <lars.knoll@theqtcompany.com>
2015-11-05 08:25:02 +00:00
Konstantin Ritt
dec451cf67 Update Unicode data up to v7.0
* Two newly adopted currency symbols:
  the Azerbaijan manat and the Russia ruble
* Pictographic symbols (including many emoji), geometric symbols,
  arrows, and ornaments originating from the Wingdings and Webdings sets
* Twenty-three new lesser-used and historic scripts
  extending support for written languages of North America, China, India,
  other Asian countries, and Africa
* Letters used in Teuthonista and other transcriptional systems,
  and a new notational set, Duployan

For more details, see http://www.unicode.org/versions/Unicode7.0.0/

The Properties struct's .*Diff members were narrowed down
to signed 15 bits and the unicodeVersion has been expanded to 8 bits.

[ChangeLog][QtCore] Unicode data updated to v.7.0

Change-Id: I93ab6f79fa3b05f61abc7279f1d046834c1c1a0b
Reviewed-by: Lars Knoll <lars.knoll@digia.com>
2015-03-27 16:40:06 +00:00
Konstantin Ritt
31cd326bec Report a correct Unicode version used by Qt
The version bump was forgotten in edfce46a6c

Change-Id: I99501bb314baab68f037417db4b3b14e11f2de02
Reviewed-by: Lars Knoll <lars.knoll@digia.com>
2015-03-23 07:02:25 +00:00
Jani Heikkinen
83a5694dc2 Update copyright headers
Qt copyrights are now in The Qt Company, so we could update the source
code headers accordingly. In the same go we should also fix the links to
point to qt.io.

Outdated header.LGPL removed (use header.LGPL21 instead)

Old header.LGPL3 renamed to header.LGPL3-COMM to match actual licensing
combination. New header.LGPL-COMM taken in the use file which were
using old header.LGPL3 (src/plugins/platforms/android/extract.cpp)

Added new header.LGPL3 containing Commercial + LGPLv3 + GPLv2 license
combination

Change-Id: I6f49b819a8a20cc4f88b794a8f6726d975e8ffbe
Reviewed-by: Matti Paaso <matti.paaso@theqtcompany.com>
2015-02-11 06:49:51 +00:00
Marc Mutz
a627c8628f QUnicodeTables: mark all functions as nothrow
Yes, this is necessary. The noexcept operator looks for noexcept tagging,
not at the contents of the function to determine whether to return true.

The more conditionally-noexcept functions are used, the more important it
becomes that low-level functions are correctly marked noexcept. In that, it
is like constexpr.

Change-Id: I4bca178444d1fd7caf3a92f996b1536eebdb5014
Reviewed-by: Thiago Macieira <thiago.macieira@intel.com>
2015-01-10 08:42:55 +01:00
Matti Paaso
974c210835 Update license headers and add new license files
- Renamed LICENSE.LGPL to LICENSE.LGPLv21
- Added LICENSE.LGPLv3
- Removed LICENSE.GPL

Change-Id: Iec3406e3eb3f133be549092015cefe33d259a3f2
Reviewed-by: Iikka Eklund <iikka.eklund@digia.com>
2014-09-24 12:26:19 +02:00
Thiago Macieira
2fe90a61bd Mark all QChar static functions as [[gnu::const]]
The GCC documentation says that a const function is not allowed to read
global memory. This needs to be clarified: it's not allowed to read RW
global memory. It's fine to read read-only memory, as that is equivalent
to just pure code.

The QChar static out-of-line functions only lookup a property of the
given Unicode character and always return the same value.

The only exception is the decomposition() function, which returns a
QString and is therefore not allowed to be marked const.

Change-Id: Id36b2f84a1b8ff9db5acf1d4e59e8b3811068cff
Reviewed-by: Konstantin Ritt <ritt.ks@gmail.com>
Reviewed-by: Giuseppe D'Angelo <giuseppe.dangelo@kdab.com>
2014-07-07 03:51:43 +02:00
Konstantin Ritt
b80fcbdba6 Introduce QChar::JoiningType enum and QChar::joiningType() method
This aimed to disctinct joining types "L", "T", and "U" from just "U".
Unicode 6.3.0 has introduced a character with joining type "L" and
Unicode 7.0 will add a few more characters of joining type "L", so
we'll have to deal with it anyways.

[ChangeLog][QtCore][QChar] Added JoiningType enum and joiningType()
 method that deprecates the old QChar::Joining enum and joining() method.

Change-Id: I4be3a3f745d944e689feb9b62d4ca86d1cf371b0
Reviewed-by: Lars Knoll <lars.knoll@digia.com>
Reviewed-by: Thiago Macieira <thiago.macieira@intel.com>
2014-01-29 23:19:47 +01:00
Konstantin Ritt
edfce46a6c Update the Unicode Data and Algorithms up to Unicode 6.3.0
* Mongolian and Phags-pa characters have been given a Joining_Type
  classification for contextual shaping. As a part of these additions,
  one Phags-pa character has the Joining_Type value of L (Left Joining),
  which no character had been assigned before.
* The unassigned code points in the Currency Symbols block have been
  given the Bidi_Class property value ET and the Line_Break property
  value PR, to help implementations support new currency symbols,
  when they are encoded.
* Hebrew letters and basic punctuation marks have been assigned
  the newly introduced Word_Break property values Hebrew_Letter,
  Single_Quote, and Double_Quote.
* The Bidi_Class property has been extended with four new values
  for directional isolates.
For more details, see http://www.unicode.org/versions/Unicode6.3.0/

Change-Id: Iad62d02edc58a8497898dcd6d6c70d5aece317ea
Reviewed-by: Lars Knoll <lars.knoll@digia.com>
2014-01-14 15:38:57 +01:00
Konstantin Ritt
339aff06f9 [1/2] Implement Unicode Normalization Form Quick Check (NF QC)
Make Unicode tables generator take QuickCheck data from DerivedNormalizationProps.txt
into account and generate NF QC bits.
\sa http://www.unicode.org/reports/tr15/#Detecting_Normalization_Forms

Change-Id: Ib73bd42ddb8f99d0be0aff609711943c52dd9c24
Reviewed-by: Lars Knoll <lars.knoll@digia.com>
2013-08-13 03:51:07 +02:00
Frederik Gladhorn
c608ec8254 Merge remote-tracking branch 'origin/stable' into dev
Conflicts:
	src/corelib/io/qsavefile_p.h
	src/corelib/tools/qregularexpression.cpp
	src/gui/util/qvalidator.cpp
	src/gui/util/qvalidator.h

Change-Id: I58fdf0358bd86e2fad5d9ad0556f3d3f1f535825
2013-01-22 18:40:13 +01:00
Sergio Ahumada
48e0c4df23 Update copyright year in Digia's license headers
Change-Id: Ic804938fc352291d011800d21e549c10acac66fb
Reviewed-by: Lars Knoll <lars.knoll@digia.com>
2013-01-18 09:07:35 +01:00
Konstantin Ritt
9b0fab6b62 Update Qt internals to use QChar::Script
...and remove the outdated QUnicodeTables::Script enum.
QFontEngineData now has one extra slot that never used
(engines[QChar::Script_Inherited]). engines[QChar::Script_Unknown],
if accessed, would be set with a Box engine instance, and could be used
as a minor optimization some time later.

In order to preserve the existing behavior, we map all scripts up to Latin to Common.

Change-Id: Ide4182a0f8447b4bf25713ecc3fe8097b8fed040
Reviewed-by: Pierre Rossi <pierre.rossi@gmail.com>
Reviewed-by: Konstantin Ritt <ritt.ks@gmail.com>
2012-12-21 19:01:35 +01:00
Konstantin Ritt
f7639c0a6d Add QChar::Script enum
...where the values are not aliased to Common script.

The old QUnicodeTables::Script enum was retained for compatibility reasons
until Qt internals are updated to use QChar::script().
Using QChar::Script instead of QUnicodeTables::Script would improve both
the text analysis (itemization, boundary finding) and the text shaping quality.
This also a required step for switching to Hurfbuzz-NG.

/* This adds 6668 more .rodata bytes */

Change-Id: I5aa3d12c550528d0052542436990f8d0779ea8e5
Reviewed-by: Eskil Abrahamsen Blomfeldt <eskil.abrahamsen-blomfeldt@digia.com>
2012-12-20 14:48:32 +01:00
Konstantin Ritt
2fbb69a093 QTBF: Fix issue with no splitting the words at "." (FULL STOP)
As of Unicode 5.1, some punctuation marks were mapped to MidLetter and MidNumLet
for better URL and abbreviations handling which caused "hi.there" to be treated
like if it were just a single word;
until we have the Unicode Text Segmentation tailoring mechanism, retain
the old behavior by remapping (some of) those characters back to their old values.

Change-Id: I49dea6064f2ea40a82fc0b1bc3c4f0b4e803919f
Reviewed-by: David Faure <david.faure@kdab.com>
Reviewed-by: Lars Knoll <lars.knoll@digia.com>
2012-11-23 11:59:50 +01:00
Konstantin Ritt
2672c4fa91 Update the Unicode Data and Algorithms up to Unicode 6.2
Version 6.2 of the Unicode Standard is a special release
dedicated to the early publication of the newly encoded Turkish lira sign.
In addition, there are some significant changes to the Unicode algorithms
for text segmentation and line breaking to improve breaking for emoji symbols.

For more details, see http://www.unicode.org/versions/Unicode6.2.0/

Change-Id: I21cfd4f307e41b41a19d36cce87f7a44c2661bc2
Reviewed-by: Thiago Macieira <thiago.macieira@intel.com>
Reviewed-by: Lars Knoll <lars.knoll@digia.com>
2012-10-09 03:04:41 +02:00
Iikka Eklund
be15856f61 Change copyrights from Nokia to Digia
Change copyrights and license headers from Nokia to Digia

Change-Id: If1cc974286d29fd01ec6c19dd4719a67f4c3f00e
Reviewed-by: Lars Knoll <lars.knoll@digia.com>
Reviewed-by: Sergio Ahumada <sergio.ahumada@digia.com>
2012-09-22 19:20:11 +02:00
Konstantin Ritt
b57e2162ef QUnicodeTables: some internal API renamings
enums GraphemeBreak, WordBreak, and SentenceBreak has been renamed to
GraphemeBreakClass, WordBreakClass, and SentenceBreakClass respectively,
their values has been renamed to contain a '_' as logical enum-value separator
(just like many other nums in Qt, e.g. LineBreakClass);
*BreakFormat has been replaced with *Break_Extend (some format characters are
kind of subtype of the extender characters, not vice versa).

Change-Id: I9ddbcf8848da87409736c2d6d1798a62fa28cab8
Reviewed-by: Lars Knoll <lars.knoll@nokia.com>
2012-06-22 09:47:59 +02:00
Konstantin Ritt
c1329fba13 Clean-up the Unicode tables generator code and the generated header
This fixes the blocks and memory consumption reports, the whitespace issues
and makes the code a bit cleaner.

Since I'm the only one who does change this code, such a no-op commit
could not hurt anyone or even git blame ;)

Change-Id: Ib069f925a3791c82e16c368c8392bcffbfd68c53
Reviewed-by: Lars Knoll <lars.knoll@nokia.com>
Reviewed-by: Konstantin Ritt <ritt.ks@gmail.com>
2012-06-22 09:47:59 +02:00
Konstantin Ritt
d8c04d60db Make QUnicodeTables::script() support SMP code points
Instead of expanding the scripts table with script values for the code points
>= 0x10000, it has been merged with the properties table in order to
increase perfomance of the script itemization code (not affected yet).
(Stats: the properties table grew up in 97428-89800 = 7628 bytes;
        the old scripts table was of size 7680 bytes)

The outdated ScriptsInitial.txt and ScriptsCorrections.txt file has been removed
(they were just empty, the "corrigendum" script corrections should be applied
to Scripts.txt directly, *no customization allowed*!).

More script testcases has been added - at least one per supported script.

Task-number: QTBUG-6530

Change-Id: I40a9e76f681e2dd552fd4c61af0808d043962e79
Reviewed-by: Lars Knoll <lars.knoll@nokia.com>
2012-06-14 05:22:13 +02:00
Konstantin Ritt
12e0319213 Line Breaking Algorithm: handle the Object Replacement Character
See http://www.unicode.org/reports/tr14/#CB
and http://www.unicode.org/reports/tr14/#LB20 for details

Change-Id: Ice0aa2b2ce81f6e39839a353240420436eddd754
Reviewed-by: Lars Knoll <lars.knoll@nokia.com>
2012-06-10 15:58:13 +02:00
Konstantin Ritt
60e1892d83 Update the qunicodetables generator to deal with UCD 6.1 files
Change-Id: If22018ff83cfc6b9c984f689648da038fce11d84
Reviewed-by: Lars Knoll <lars.knoll@nokia.com>
2012-06-10 15:57:49 +02:00
Konstantin Ritt
2b15c1b30f Move ScriptSentinel enum from header to .cpp
Change-Id: Ic74e8e2471e92aa2014735f6ab0bb4f3b88de206
Reviewed-by: Lars Knoll <lars.knoll@nokia.com>
2012-05-25 21:48:44 +02:00
Konstantin Ritt
ba300f42bd QChar: add isSurrogate() and isNonCharacter() to the public API
+ QChar::LastValidCodePoint enum value that supercede the UNICODE_LAST_CODEPOINT macro
replace uses of hardcoded values with the new API; remove leftovers

Change-Id: I1395c9840b85fcb6b08e241b131794a98773c952
Reviewed-by: Thiago Macieira <thiago.macieira@intel.com>
2012-05-16 04:24:56 +02:00
Konstantin Ritt
3fe02eaa12 significant unicodetables generator performance optimization
since the entire range of a valid unicode code points is in use, QHash
is suboptimal and could be replaced with QList;
taking the value by ref and not inserting it back to the map + not calculating
the default value over and over gains us up to 60% performance boost!

Change-Id: I48c54a8e88472cf76c79c0aac44e65eeefa44861
Reviewed-by: Lars Knoll <lars.knoll@nokia.com>
2012-05-11 10:38:25 +02:00
Konstantin Ritt
8c0048a377 add some useful methods to QUnicodeTables::
in order to reduce code duplication and prepare the ground for upcoming changes

Change-Id: I980244149f65384c9484bbec4682de8b7b848b08
Reviewed-by: Lars Knoll <lars.knoll@nokia.com>
2012-05-10 11:34:25 +02:00
Konstantin Ritt
46b78113b2 add support for non-BMP ligatures
> http://www.unicode.org/versions/Unicode5.2.0/
D. Character Additions:
There are three new characters in the newly-encoded Kaithi script that will
require changes in implementations which make hard-coded assumptions about
composition during normalization. Most new characters added to the standard
with decompositions cannot be generated by the operations toNFC() or toNFKC),
but these three can. Implementers should check their code carefully
to ensure that it handles these three characters correctly.
U+1109A KAITHI LETTER DDDHA
U+1109C KAITHI LETTER RHA
U+110AB KAITHI LETTER VA

UCD 6.1 adds two more of them:
U+1112E CHAKMA VOWEL SIGN O
U+1112F CHAKMA VOWEL SIGN AU

Change-Id: I781a26848078d8b83a182b0fd4e681be2a6d9a27
Reviewed-by: Lars Knoll <lars.knoll@nokia.com>
2012-05-04 15:24:52 +02:00
Konstantin Ritt
f948bb3c6c qunicodetables generator: improve the output and the generated code
better memory usage report;
an additional asserts with conditions the implementation is depends on;
a namespace for the internal static data;
styling fixes

Change-Id: Id4048ff6104c56b5f590f9ac6fbf7c0bce79ec47
Reviewed-by: Lars Knoll <lars.knoll@nokia.com>
Reviewed-by: Konstantin Ritt <ritt.ks@gmail.com>
2012-04-24 21:45:00 +02:00