Commit Graph

2252 Commits

Author SHA1 Message Date
Behdad Esfahbod
71fd5e80ad [Indic] Further adjust base algorithm for Sinhala
Apparently if there is C,V,ZWJ,C, the first C will be base, but if
it's C,ZWJ,V,C, the second one will be.

Note that Uniscribe implements this differently, by breaking syllable in
the case of C,ZWJ,V,C and putting the first consonant in one syllable
and the rest in the next syllable.

Sinhala failures down from 208 to 158 (0.0581209%).  No changes to
Khmer.
2012-07-24 00:21:16 -04:00
Behdad Esfahbod
73d71cc527 [Indic] End Vowel-based syllable at ZWJ
One Devanagari test regressed, plus 10 Malayalam (at 1545 now).

Fixed 120 Sinhala failures.  Now at 208 (0.0765136%).
2012-07-24 00:09:12 -04:00
Behdad Esfahbod
34c215036f [Indic] Improve Sinhala base algorithm and reph positioning
Sinhala does not have half forms.  And most (all?) consonants can be
base, except when preceded by ZWJ, which would request a subjoined form.
Hence switch the base algorithm to categorize with Khmer, start search
at start, and stop at a ZWJ.

Also, mark all pos=base consonants after base to be subjoined.  Mark
base itself to have pos=base.

Finally, adjust Sinhala's reph position to after-main.

Brings down Sinhala failures from 455 to 328 (0.120656%).
2012-07-23 23:51:29 -04:00
Behdad Esfahbod
2ec934c6c2 [Indic] Change "unknown" position to end of syllable 2012-07-23 23:49:04 -04:00
Behdad Esfahbod
b70021f7c8 When removing zero-width marks, don't remove ligatures
If a mark ligated, it probably should NOT be removed.
2012-07-23 20:18:17 -04:00
Behdad Esfahbod
49c5ec5144 Minor refactoring 2012-07-23 20:14:13 -04:00
Behdad Esfahbod
c3e6fdc379 [Indic] Improve check on ligatures
Only skip actual ligatures, not marks in-between ligature components.
2012-07-23 20:11:42 -04:00
Behdad Esfahbod
771a8f5028 [Indic] exclude ligatures when matching on Indic category
If, say, a H,ZWJ,C ligature was formed, we don't want the code to detec
that as a Halant.  So, ignore ligatures when matching category in
final_reordering.

Sinhala failures down from 514 to 455 (0.167374%).
2012-07-23 20:09:30 -04:00
Behdad Esfahbod
d1af9e82e5 [GSUB/GPOS] Const correctness 2012-07-23 19:55:35 -04:00
Behdad Esfahbod
baacd090df [Indic] Minor refactoring 2012-07-23 19:51:48 -04:00
Behdad Esfahbod
c7c4de2fb9 [Indic] Remove syllable length check before sorting
We now limit syllable lengths in the machine.  No need to match here.
2012-07-23 18:25:02 -04:00
Behdad Esfahbod
9fa052733e [Indic] Limit syllables to at most five consonants
Seems to be about what Uniscribe does.  Not exactly.  But close enough.
More consonants will start a new cluster.

A few scripts went way down in failures.  In particular:

  - Devanagari failures went down from 490 to 56.
  - Telugu went down from 113 to 49.

Other scripts went down slightly or didn't change.  New numbers:

BENGALI: 353908 out of 354285 tests passed. 377 failed (0.106412%)
DEVANAGARI: 693572 out of 693628 tests passed. 56 failed (0.00807349%)
GUJARATI: 366485 out of 366506 tests passed. 21 failed (0.00572978%)
GURMUKHI: 60750 out of 60809 tests passed. 59 failed (0.0970251%)
KANNADA: 950730 out of 951913 tests passed. 1183 failed (0.124276%)
KHMER: 298613 out of 299124 tests passed. 511 failed (0.170832%)
MALAYALAM: 1046881 out of 1048416 tests passed. 1535 failed (0.146411%)
ORIYA: 42320 out of 42329 tests passed. 9 failed (0.021262%)
SINHALA: 271333 out of 271847 tests passed. 514 failed (0.189077%)
TAMIL: 1091837 out of 1091837 tests passed. 0 failed (0%)
TELUGU: 970524 out of 970573 tests passed. 49 failed (0.00504856%)

Some of the remaining Telugu and Devanagari issues seem to be Uniscribe
eating Anusvara when placed before a non-joiner.  Ouch!
2012-07-23 18:19:17 -04:00
Behdad Esfahbod
093cd58326 [Thai] Fix SARA AM handling
Oops, thinko.
2012-07-23 14:04:42 -04:00
Behdad Esfahbod
42848453bf [Thai] Reorder U+0E3A THAI VOWEL SIGN PHINTHU
Uniscribe reorders U+0E3A to be after U+0E38 and U+0E39.  We do that by
modifying the ccc for U+0E3A.

Fixes the two remaining Thai failures (see previous commit).
2012-07-23 13:52:07 -04:00
Behdad Esfahbod
4a7f4f3e56 [Thai] Adjust SARA AM reordering to match Uniscribe
Adjust the list of marks before SARA AM that get the reordering
treatment.  Also adjust cluster formation to match Uniscribe.

With Wikipedia test data, now I see:

  - For Thai, with the Angsana New font from Win7, I see 54 failures out
    of over 4M tests  (0.00129107%).  Of the 54, two are legitimate
    reordering issues (fix coming soon), and the other 52 are simply
    Uniscribe using a zero-width space char instead of an unknown
    character for missing glyphs.  No idea why.  The missing-glyph
    sequences include one that is a Thai character followed by an Arabic
    Sokun.  Someone confused it with Nikhahit I assume!

  - For Lao, with the Dokchampa font from Win7, 33 tests fail out of
    54k (0.0615167%).  All seem to be insignificant mark positioning
    with two marks on a base.  Have to investigate.
2012-07-23 13:15:33 -04:00
Behdad Esfahbod
2cc933aff9 [Indic] Fix cluster formation with left-matras and conjunct forms
Test case was: <U+0D15,U+0D4D,U+0D15,U+0D4A>.
2012-07-23 08:23:44 -04:00
Behdad Esfahbod
e6b01a878c [Indic] Further streamline cluster formation
This should address all possible cluster misformations that I had in
mind.
2012-07-23 00:11:26 -04:00
Behdad Esfahbod
7b2a7dadd6 [Indic] Merge clusters before sorting
This should fix any instabilities in cluster formation that we were
speculating may happen with surrounding syllables.  Or most of it
perhaps.
2012-07-22 23:58:55 -04:00
Behdad Esfahbod
abb3239ef9 [Indic] Update clusters for left-matra even if matra didn't move
Fixes crashes reported with left matra under
non-uniscribe-bug-compatibilty mode.
2012-07-22 23:55:19 -04:00
Behdad Esfahbod
60554f14d8 [Indic] Merge in Malayalam tests
From:
http://silpa.org.in/pub/tests/hb/ml/ml-harfbuzz-testdata.txt
2012-07-22 23:23:56 -04:00
Behdad Esfahbod
5c7081770c [Indic] Add extensive Sinhala tests
Generated by:
http://git.savannah.gnu.org/cgit/sinhala.git/plain/utils/gen-unicode-sinhala.py
2012-07-22 23:20:27 -04:00
Behdad Esfahbod
2efe4707b1 [Indic] Add Sinhala tests
Merge tests from:
http://git.savannah.gnu.org/cgit/sinhala.git/plain/patches/icu-sinhala-rendering.txt
2012-07-22 23:17:59 -04:00
Behdad Esfahbod
3d4c111b7a Add a test case 2012-07-20 19:34:39 -04:00
Behdad Esfahbod
92a1ad7bef [Indic] Stop searching for base if a post form is found before below form
Improves Bengali and Gurmukhi.  Malayalam regressed a bit.  We will deal
with that later.
2012-07-20 18:55:15 -04:00
Behdad Esfahbod
4c450c703f [Indic] Recompose Bengali Ya,Nukta
This is a bunch of hacks for now.

Improves Bengali a bit.
2012-07-20 18:13:04 -04:00
Behdad Esfahbod
e9c0f152a3 [Uniscribe] Fix script fallback
Gurmukhi failures half now.  Others changed slightly.
2012-07-20 17:37:48 -04:00
Behdad Esfahbod
5791f32915 [Indic] Allow a ZWNJ after SM's
Malayalam failures go way down.  Other scripts benefitted slightly too.
Sinhala had one or two test regressions, but...
2012-07-20 16:26:55 -04:00
Behdad Esfahbod
34ae336f3f [Indic] Improve Reph AfterMain positioning
Fixes 20 out of 48 failing Oriya tests.  Failure rate down to 0.066% now.
2012-07-20 16:17:28 -04:00
Behdad Esfahbod
bdd080431a [Indic] Reposition Oriya Candrabindu
Oriya failures down from 0.65% to 0.20%.
2012-07-20 16:03:09 -04:00
Behdad Esfahbod
5f0eaaad12 [Indic] Fix base search in final_reordering
Fixes most Malayalam failures.  Down from 1.6% to 0.38% now.  Fixes a
few more in other scripts too.
2012-07-20 15:47:24 -04:00
Behdad Esfahbod
81202bd860 [Indic] Don't attach SM/VD to other characters 2012-07-20 15:14:51 -04:00
Behdad Esfahbod
efb4ad7356 Fix compiler warnings
If x is not constant, we cannot ASSERT_STATIC on it.
2012-07-20 14:27:38 -04:00
Behdad Esfahbod
f31d97e44e [Indic] Form Telugu Reph out of Ra,Virama,ZWJ
Apparently this was approved in Feb 2012.  No font yet.
2012-07-20 14:13:35 -04:00
Behdad Esfahbod
2e193b240e [Indic] Don't split U+0AC9
Althought IndicMatraCategory.txt classifies it as Top_And_Right matra,
it does not have Unicode decomposition, and Uniscribe does not do
anything special about it either.

Gujarati failures down from 0.672% to 0.0130966%.
2012-07-20 14:02:35 -04:00
Behdad Esfahbod
30c3d5e9fc [Indic] Simplify Uniscribe cluster emulation
Now that we break syllables on Halant,ZWNJ, this code can be simplified.
2012-07-20 13:56:32 -04:00
Behdad Esfahbod
decf6ffca4 [Indic] Minor! 2012-07-20 13:51:31 -04:00
Behdad Esfahbod
9e4f94a72c [Indic] Break syllables at Halant,ZWNJ
That's really what Uniscribe does, and explains a lot of pecularities of
Halant,ZWNJ before the base.

Sent Telugu from 1% failures to 0.03%.  Improved Kannada and Malayalam
slightly.  Fixed half of Bengali, and did NOT break anything!
2012-07-20 13:48:03 -04:00
Behdad Esfahbod
2c372b80f6 [Indic] Better check for applying 'init'
Specifically, don't apply 'init' if previous char is a joiner.

Fixes some more of Bengali.
2012-07-20 13:37:48 -04:00
Behdad Esfahbod
34a7440b7c [GPOS] Don't zero mark advances
Fixes more of Telugu, Kannada, and Oriya.

May break things (outside Indic...), but we cannot think of any font relying
on this immediately.
2012-07-20 12:40:39 -04:00
Behdad Esfahbod
8ed248de77 [Indic] Minor 2012-07-20 11:42:24 -04:00
Behdad Esfahbod
d0e68dbd0b [Indic] Implement reph positioning step 5
Not tuned, just copied from step 2.  Fixes another 0.5% of Kannada
failures.  1% to go.
2012-07-20 11:25:41 -04:00
Behdad Esfahbod
a9e45c32e4 [Indic] Don't let ZWNJ at the end of syllable affect base search
Fixes a few Devanagari, half of remaining Kannada failures, quarter for
Telugu, and others slightly improved or unchanged.
2012-07-20 11:04:15 -04:00
Behdad Esfahbod
20b68e699f [Indic] Apply 'cjct' globally
Fixes 5 Devanagari failures, and no regressions.
2012-07-20 10:47:46 -04:00
Behdad Esfahbod
51e764de44 [Indic] Unbreak old scriptures
Brings down failures with Lohit-Telugu from 57% to 1.40%.
2012-07-20 10:30:24 -04:00
Behdad Esfahbod
900cf3d449 Minor 2012-07-20 10:18:23 -04:00
Behdad Esfahbod
87cd63266e [Indic] Recategorize some Kannada right matras
Kannada failures down from 3.5% to 2.93%.
2012-07-19 21:25:46 -04:00
Behdad Esfahbod
3604d64ced [Indic] Recategorize GURMUKHI ADDAK
It's not in IndicSyllabicCategory.txt.  Fixes most of Gurmukhi failures.
Failures down from 7.7% to 0.222%!
2012-07-19 21:13:04 -04:00
Behdad Esfahbod
8932858123 Minor 2012-07-19 21:02:38 -04:00
Behdad Esfahbod
47ef931f13 [buffer] Make sure out_info = info during GPOS 2012-07-19 20:52:44 -04:00
Behdad Esfahbod
ae63cf2062 Print line number during return when tracing 2012-07-19 20:45:41 -04:00