Closed Bug 662055 Opened 13 years ago Closed 12 years ago

advanced Hebrew diacritics are shown correctly only in particular order

Categories

(Core :: Layout: Text and Fonts, defect)

x86_64
Linux
defect
Not set
normal

Tracking

()

RESOLVED FIXED
mozilla13

People

(Reporter: amir.aharoni, Assigned: jfkthame)

References

Details

Attachments

(8 files, 4 obsolete files)

User-Agent:       Mozilla/5.0 (X11; Linux x86_64; rv:6.0a2) Gecko/20110603 Firefox/6.0a2
Build Identifier: Mozilla/5.0 (X11; Linux x86_64; rv:6.0a2) Gecko/20110603 Firefox/6.0a2

In Hebrew there are advanced diacritics that are supposed to displayed the same no matter what their logical order is. This works correctly on Windows 7 and in Chromium on Linux, but it works incorrectly in Firefox on Linux.

In the attached file the Hebrew word has different order of diacritics, but it is supposed to be displayed the same. I intentionally used character numbers in the HTML source code to make sure that no software applies any normalization to them.

It's important to note that this will only work with fonts which appropriately support Hebrew diacritics. I suggest Taamey Frank CLM, which can be downloaded here: http://culmus.sourceforge.net/taamim/index.html .

Reproducible: Always

Steps to Reproduce:
1. Install the Taamey Frank CLM font.
2. Open the attached file in the browser.
3. Check how the Hebrew word is displayed.

Actual Results:  
In Firefox the first word is displayed incorrectly - the diacritics are garbled.

Expected Results:  
Both words must be displayed identically.
Attachment #537354 - Attachment mime type: text/plain → text/html
Forgot to add: On Windows XP it's broken in all browsers, so that's probably an XP issue. But on Linux it's only broken in Firefox, so it may be a Gecko issue.
Attached file testcase with embedded font (obsolete) —
I had to zip the testcase in order to pack it with the font file. Sorry for that.
Attached file testcase with embedded font (obsolete) —
Attachment #537440 - Attachment is obsolete: true
Comment on attachment 559162 [details]
testcase with embedded font

This testcase didn't work unless the font is installed.
> Warning: Error in parsing value for 'src'.  Skipped to next declaration.
> Source File: https://bug662055.bugzilla.mozilla.org/attachment.cgi?id=559162
Line: 8
Specifically, it is missing "local(...)" in the @font-face rule, and also lacking a comma between the url sources. Thus,

  @font-face {
    font-family: TaameyFrankCLM;
    src: "Taamey Frank CLM", url(res/TaameyFrankCLM.ttf) url(data:;base64,.......

should be more like this:

  @font-face {
    font-family: TaameyFrankCLM;
    src: local("Taamey Frank CLM"), url(res/TaameyFrankCLM.ttf), url(data:;base64,.......

(Also, it seems wasteful to give both the URL to a file on the server _and_ a data URL.)
The embedded font doesn't load for me in Firefox (nightly build) - the Web console reports "Error in parsing value for 'src'.  Skipped to next declaration." Apparently we can't handle the newlines within the base64 string properly. (This may be a bug - I thought they ought to be ignored.)
I see the same issue on the latest build of Aurora for mobile now. Except that this build seems to support Hebrew diacritics as well as Firefox on desktop Linux.
Amir: it looks like people are still having trouble viewing your testcase (see comment 8). It might help if you were to diagnose the problem and perhaps upload a fixed one, or file a bug on Firefox?

Gerv
Attached file The recommended font
Adding the font to be installed.
Embedding the font using url() may be a separate issue. This can be easily reproduced by installing the attached the font locally and opening the first attached html file.

It can also be tested here:
http://translatewiki.net/wiki/User:Amire80/anna
A similar, but distinct issue: Bug 721821 - Hebrew diacritics are displayed incorrectly in SVG.
The issue here is that the font only implements support for the diacritics in a particular order (which happens _not_ to be the canonical order of normalised Unicode text).

Ideally, font developers would include the necessary lookups (whether using GSUB glyph composition, GPOS mark positioning, or both) to handle text presented in any (canonically-equivalent) order. This would make them immune to the possibility of normalisation being applied to the text, rather than assuming a specific order. However, in practice this is rarely done; most font developers implement only the character ordering that they consider most natural, without regard to the canonical ordering specified by Unicode. (A similar issue applies to certain diacritics in Arabic script, too.)

For this particular font, I believe the harfbuzz update in bug 695857 will resolve the issue; with this applied, the two samples from comment 1 both render identically, with the dagesh properly placed. This is because the harfbuzz update adds some normalisation support, and the font includes precomposed consonant+dagesh glyphs, which get used here. (However, it would probably not resolve the issue for a font that relied only on dynamic mark positioning, but supported only "logical" and not "canonical" ordering for the combining mark characters.)
Status: UNCONFIRMED → NEW
Depends on: 695857
Ever confirmed: true
(In reply to Jonathan Kew (:jfkthame) from comment #14)
> For this particular font, I believe the harfbuzz update in bug 695857 will
> resolve the issue

Not without bug 722139, it won't ;-)
Depends on: 722139
No longer depends on: 695857
Depends on: 695857
(In reply to Jonathan Kew (:jfkthame) from comment #14)
> For this particular font, I believe the harfbuzz update in bug 695857 will
> resolve the issue; with this applied, the two samples from comment 1 both
> render identically, with the dagesh properly placed.

Identically but not correctly -- see the screenshot. The dagesh is fine, but the meteg and qamats are overlapping
Since the harfbuzz update landed, we get _consistent_ behavior, because it reorders diacritics into canonical order. Unfortunately, the canonical order is not the order supported by most font designers, who assume "logical" order.

Harfbuzz has a fix for the Arabic equivalent of this issue (the ordering of SHADDA relative to vowels), but lacks the Hebrew version (for DAGESH). This patch provides that support, and fixes the rendering with this font and other similarly-constructed fonts that assume "logical" or "linguistic" ordering.
Assignee: nobody → jfkthame
Attachment #597834 - Flags: review?(mozilla)
After reviewing the SBL Hebrew manual (as recommended by comments at http://forum.fontlab.com/archive-old-microsoft-volt-group/vista-and-diacritic-ordering-t6751.0.html), the Hebrew marks need a more complex permutation, particularly to bring the shin/sin dots, rafe and holam into the appropriate positions.

Maybe we should combine this with the Arabic permutation, and do the whole bunch with a single lookup table?
Attachment #597834 - Attachment is obsolete: true
Attachment #597834 - Flags: review?(mozilla)
Attachment #597909 - Flags: review?(mozilla)
Comment on attachment 597909 [details] [diff] [review]
[harfbuzz] patch v2, permute ordering of Hebrew diacritics

LGTM.  There's the typo in the comments but you know that already.  I'll upstream a slightly modified but equivalent version.
Attachment #597909 - Flags: review?(mozilla) → review+
https://hg.mozilla.org/integration/mozilla-inbound/rev/20df766a9922

Forgot to fix the comment before pushing, so fixed that in a followup changeset:
https://hg.mozilla.org/integration/mozilla-inbound/rev/2e4d93627d9f
Target Milestone: --- → mozilla13
Most things are looking good with this patch, including the complex diacritic combinations in https://bug404426.bugzilla.mozilla.org/attachment.cgi?id=402560. 

One case that still does not work is https://bug637772.bugzilla.mozilla.org/attachment.cgi?id=515942 - the Hebrew word for Jerusalem which is normally spelled in the Bible with two diacritics under the lamed. 

This is the case mentioned at http://en.wiktionary.org/wiki/Appendix:Unicode_normalization#Issues: 'the problem is that the diacritics should not /have/ a canonical ordering, because the two orderings are not actually equivalent (that is, the two diacritics should have the same value for the Canonical_Combining_Class (ccc) property, but instead they have different ones). For example, Hebrew לִַ ("lai") is mistakenly normalized to לִַ ("lia").'
Hmm, I see that the SBL Hebrew font manual recommends working round the יְרוּשָׁלִַם problem by using U+034F COMBINING GRAPHEME JOINER. This indeed improves the rendering in SBL Hebrew itself and some other fonts; but sites with the Biblical text, e.g. http://tanach.us/Tanach.xml?Jov7:11#Ps137:5-137:6 or http://mechon-mamre.org/i/t/t26d7.htm don't seem to be doing that.
The sequences <PATAH, HIRIQ> and <HIRIQ, PATAH> are (for better or worse) defined by Unicode to be canonically equivalent, and therefore no distinction can (reliably) be made between them. And the combining classes mean that under normalization, they'll end up in the order <HIRIQ, PATAH>. To maintain the order <PATAH, HIRIQ>, therefore, CGJ can be inserted between them to introduce a distinction that would otherwise not exist.

I'm wondering, though, whether in practice the only case where these two marks co-occur (in either order) is this example. If so, perhaps we should consider permuting the classes such that they will (always, unless blocked by CGJ) appear in the order <PATAH, HIRIQ>. This would presumably improve the rendering of sites such as you mention, but it would mean that if anyone does ever want the <HIRIQ, PATAH> order maintained, they'd need to use CGJ in _this_ case instead. WDYT, Simon? Behdad?
(In reply to Simon Montagu from comment #21)
> Most things are looking good with this patch, including the complex
> diacritic combinations in
> https://bug404426.bugzilla.mozilla.org/attachment.cgi?id=402560. 

Nice.

Simon, can you please create me a test file or the cases you care about?  Just a plain text file with one test per line.  That would immensely help.


> One case that still does not work is
> https://bug637772.bugzilla.mozilla.org/attachment.cgi?id=515942 - the Hebrew
> word for Jerusalem which is normally spelled in the Bible with two
> diacritics under the lamed. 
> 
> This is the case mentioned at
> http://en.wiktionary.org/wiki/Appendix:Unicode_normalization#Issues: 'the
> problem is that the diacritics should not /have/ a canonical ordering,
> because the two orderings are not actually equivalent (that is, the two
> diacritics should have the same value for the Canonical_Combining_Class
> (ccc) property, but instead they have different ones). For example, Hebrew
> לִַ ("lai") is mistakenly normalized to לִַ ("lia").'

Interesting.  Can we make them have the same class?  Or that would break other combinations?
(In reply to Jonathan Kew (:jfkthame) from comment #23)
> The sequences <PATAH, HIRIQ> and <HIRIQ, PATAH> are (for better or worse)
> defined by Unicode to be canonically equivalent, and therefore no
> distinction can (reliably) be made between them. And the combining classes
> mean that under normalization, they'll end up in the order <HIRIQ, PATAH>.
> To maintain the order <PATAH, HIRIQ>, therefore, CGJ can be inserted between
> them to introduce a distinction that would otherwise not exist.
> 
> I'm wondering, though, whether in practice the only case where these two
> marks co-occur (in either order) is this example. If so, perhaps we should
> consider permuting the classes such that they will (always, unless blocked
> by CGJ) appear in the order <PATAH, HIRIQ>. This would presumably improve
> the rendering of sites such as you mention, but it would mean that if anyone
> does ever want the <HIRIQ, PATAH> order maintained, they'd need to use CGJ
> in _this_ case instead. WDYT, Simon? Behdad?


Sounds reasonable but I don't know Hebrew rendering enough to really assess that.
Alternatively, if uniscribe does reordering, we should match that.
(In reply to Jonathan Kew (:jfkthame) from comment #23)
> I'm wondering, though, whether in practice the only case where these two
> marks co-occur (in either order) is this example. If so, perhaps we should
> consider permuting the classes such that they will (always, unless blocked
> by CGJ) appear in the order <PATAH, HIRIQ>. This would presumably improve
> the rendering of sites such as you mention, but it would mean that if anyone
> does ever want the <HIRIQ, PATAH> order maintained, they'd need to use CGJ
> in _this_ case instead. WDYT, Simon? Behdad?

That sounds very reasonable to me, but note that for grammatical reasons, the word can occur with any of <PATAH, HIRIQ>, <QAMATS, HIRIQ>, <PATACH, SHEVA> or <QAMATS, SHEVA>.

http://tanach.us/Tanach.xml?#Ps137:6-137:6
http://tanach.us/Tanach.xml?#Ps137:5-137:5
http://tanach.us/Tanach.xml?#1%20Kings10:2-10:2
http://tanach.us/Tanach.xml?#2%20Kings9:28-9:28

Other than these, I'm not aware of any other cases where one consonant can have more than one vowel diacritic in Hebrew, so any case that comes up will presumably be some kind of nonce-form, which I have no prolbem in requiring CGJ for. Amir, what do you think?
Target Milestone: mozilla13 → ---
I also cannot recall any other such word.

In my edition of Gesenius' Hebrew Grammar i use the CGJ, see https://en.wikisource.org/wiki/Page:Gesenius%27_Hebrew_Grammar_%281910_Kautzsch-Cowley_edition%29.djvu/90 .
(In reply to Behdad Esfahbod from comment #24)

> Can we make them have the same class?  Or that would break
> other combinations?

That would fix the immediate problem, in that it would allow the preferred logical but non-canonical order (which renders correctly) to be maintained, but I think it would actually be doing users a disservice in the longer term. The problem is that if we conflate any combining classes in this way, then we will _not_ ensure that canonically-equivalent strings render identically. And this means that users may be misled into believing they can use such strings in contrastive ways, as they see a difference in the displayed form. However, because the strings are in fact equivalent _by definition_, any process is free to apply normalization to the data on the understanding that it will not make any significant or semantic change - and so the distinction that the user might be trying to make is illusory and their data inherently fragile.

I think it's important, therefore, that (as http://www.unicode.org/faq/normalization.html#9 says) we should only permute, never split or combine, the CC values. This way we preserve the Unicode-defined equivalence or distinction between encoded sequences, and continue to render equivalent sequences in the same way, so that we don't create an illusion of a difference that cannot be reliably maintained.
(In reply to Simon Montagu from comment #27)
> That sounds very reasonable to me, but note that for grammatical reasons,
> the word can occur with any of <PATAH, HIRIQ>, <QAMATS, HIRIQ>, <PATACH,
> SHEVA> or <QAMATS, SHEVA>.

OK. We currently (as of yesterday) have the order

  sheva, hataf segol, hataf patah, hataf qamats, hiriq, tsere, segol, patah, qamats

from original classes 10..18, moved to 15..23 because of the marks we want closer to the base.

We could simply move patah and qamats to the head of this section of the list, before sheva. Or move sheva and hiriq to the end, following qamats. Either way, we'd get the relevant pairs in the desired order.

Simon (or Amir), do you have any opinion on the most sensible approach here? If in fact no other pairs of these marks ever co-occur, then it doesn't matter precisely which order we use for the classes; but maybe you have a sense for what's most logical (e.g., what about tsere and segol? does it seem more plausible to put them before or after patah and qamats, supposing someone ever wanted to use them together?).
I can't think of anything else.
https://hg.mozilla.org/mozilla-central/rev/20df766a9922
https://hg.mozilla.org/mozilla-central/rev/2e4d93627d9f
Status: NEW → RESOLVED
Closed: 12 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla13
Is it supposed to be in Nightly already?
(In reply to Amir Aharoni from comment #33)
> Is it supposed to be in Nightly already?

The patches just merged to mozilla-central, which means it should appear in _tomorrow's_ Nightly; today's has already been built before this went in.
Thanks a lot.

It works correctly in Nightly 2012-02-18 if harfbuzz.scripts is 4, but broken if it's 3. In the previous Nightly it was broken in both cases.
Right - by default, we currently use Uniscribe or DirectWrite shaping for Hebrew on Windows, which won't be helped by the fix here, as it's outside our control.

Note that the harfbuzz.scripts value is actually a combination of bit flags, so to _add_ Hebrew [4], you should change the default setting of 3 ("simple" scripts [1] plus Arabic [2]) to 7; otherwise you're inadvertently switching it _off_ for the scripts where we expect to use it by default.

(Bug 722139 is about using harfbuzz by default for Hebrew; I think we may be to the point where we can make that switch shortly.)
Another problem that remains can be seen with the Arial and Courier fonts from WinXP, for example: if the text contains sequences such as <consonant,dagesh>, the dagesh is generally not placed well, as these fonts have no mark positioning data. They do contain glyphs for the Unicode presentation forms in the FBxx block, but we don't currently use those to render text that was entered in decomposed form. I've filed bug 728866 about this issue.
I'm re-opening this for a small followup, as suggested in comments 21 and following. I think we should adjust the mark-class permutation such that HIRIQ and SHEVA move after PATAH and QAMATS. This will enable the Biblical spelling of Jerusalem to render as expected, without requiring authors to explicitly insert CGJ between the co-occurring marks.

Note that both Microsoft (Uniscribe, DirectWrite) and Apple (Core Text) engines allow users to enter these combinations without CGJ, and give the desired rendering; i.e. they are _not_ forcing these marks into canonical order. (It appears that they are not sorting at all within the set of vowels below the consonant, and therefore they do _not_ render the NFC-ordered form of the text properly with the SBL or SIL fonts; however, this should be regarded as a bug - they fail to render canonically-equivalent texts identically - rather than a feature.)
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
With this adjustment, the "natural" spelling of Jerusalem renders correctly without requiring CGJ - which many authors fail to insert, as they are unaware of the issue, and neither MS nor Apple engines fully normalize the mark order. (I suspect they just reorder the consonant-modifier marks such as dagesh and sin/shin dots, and leave the vowels untouched).

(In the event that an author actually *wants* sheva/hiriq to appear before patah/qamats, they'd need to use CGJ to maintain that order. But as far as we're aware, these combinations never occur in real text - and the available fonts don't handle most of them well anyway.)
Attachment #599633 - Flags: review?(smontagu)
Comment on attachment 599633 [details] [diff] [review]
patch, move sheva and hiriq after other vowels for shaping

Review of attachment 599633 [details] [diff] [review]:
-----------------------------------------------------------------

Looks good to me. 

BTW, Amir pointed me to another problematical case: meteg to the left and right of a vowel as in section 16g at http://en.wikisource.org/wiki/Page:Gesenius%27_Hebrew_Grammar_%281910_Kautzsch-Cowley_edition%29.djvu/89. I think that that is another case where it is reasonable to require CGJ for correct ordering (see also http://www.gentlewisdom.org/qaya/academic/hebrew/Meteg.html).
Attachment #599633 - Flags: review?(smontagu) → review+
Pushed followup:
https://hg.mozilla.org/integration/mozilla-inbound/rev/046998d673fc

Behdad, this updated permutation is what I'd recommend you take in upstream harfbuzz, too.

(In reply to Simon Montagu from comment #40)
> BTW, Amir pointed me to another problematical case: meteg to the left and
> right of a vowel...
Yes, there's no getting away from the requirement to use control codes for users who want to differentiate the various orderings here - otherwise, normalization says they're all the same.
Thanks Jonathan.  I'll pick this now.

Simon, Amir, can you please compile me a test suite?  Just a text file with one test text per line, of all the cases you want to make sure HarfBuzz handles.
All picked up and cleaned up upstream.  Still waiting for test data though.
Attached file Hebrew testcases (obsolete) —
Here are a bunch of Hebrew testcases in a plain text file.
This screenshot is with a current build from Mozilla-Inbound including attachment 599633 [details] [diff] [review]
Sorry, I should have said that the screenshot is with the font set to SBL Hebrew.
https://hg.mozilla.org/mozilla-central/rev/046998d673fc
Status: REOPENED → RESOLVED
Closed: 12 years ago12 years ago
Resolution: --- → FIXED
Thanks.  Pushed test cases to upstream harfbuzz.
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: