Archæology

The assorted finds of Artefact Publishing

The good and the not so good

I now have a patched version of Mozilla which uses Pango to display difficult scripts. This is good, because now I can see lots of Hindi pages the way they are meant to be. It is also good (but not as much) because it has shown up a bug: Mozilla seems to regard Sanskrit as a language whose script is western. It’s easy to appreciate that this is not in fact the case.

But wait, you say, what about those times when I transliterate the Sanskrit in devanāgarī into the Roman alphabet (with accents)? Surely then the Sanskrit is in a western font? Well, yes, and there we see the problem with associating languages with scripts. Really, the two should be kept completely distinct, and the browser should keep track of what font you wish to use for which Unicode code range (say, for devanāgarī, or Arabic script). All solved, right?

Well, sadly, no, that wouldn’t do either. Unicode’s handling of Chinese and Japanese scripts uses the same code points for characters which are common to both. However, despite this commonality, they are meant to be handled differently depending on which language is using the character. Make sense? In short, it puts a big obstacle in the way of separating script from language, and that can’t be good.

That said, I make no judgement on the handling of CJK in Unicode, given that I know next to nothing about the languages, their native scripts, nor the Unicode treatment and history thereof.

The main thing, though? I can use that Mozilla build to actually see the Sanskrit I sprinkle around here displayed correctly (provided I set the right western font, which makes the English rather hard to read). That is a very good thing indeed.

Posted by jamie on January 29, 2004 15:41+13:00

Comments

Congratulations on Mozilla, enjoy!

Concerning your remarks about languages versus scripts, I noticed earlier that you sometimes add language markup to non‐English phrases written in transliteration. “Wat Pah Nanachat” in the Hut in the Woods article, for instance, is marked‐up as language Thai, which has the result that my Mozilla‐based browser (Galeon) picks the Latin glyphs for it from some Thai font instead of from my normal Latin font — and that looks quite funny.

S.

Posted by: Stefan on January 29, 2004 16:20+13:00

Yes, I’ve noticed the strange font my browser uses on such instances also, but decided that it was a small price to pay for doing the right thing. Of course, the question of whether such language markup is the right thing in the case of transliterations is open to debate — such as the one happening on comp.infosystems.www.authoring.html currently (Message-ID: <Xns9478C40A5472jkorpelacstutfi@193.229.0.31> is a good place to start).

I maintain that regardless of the difficulties, regardless of poor implementations, it’s a good thing to state what language a text is written in, irrespective of script.

Posted by: Jamie on January 29, 2004 19:56+13:00

To update myself, in the bug I raised on Mozilla there is a pointer to the latest draft of Tags for Identifying Languages, the successor to RFC 3066. This allows for a script subcode from ISO 15924, so that one might write sa-Latn for Sanskrit transliterated into Latin script.

Posted by: Jamie on January 29, 2004 20:47+13:00

well heres a question.

Bordering on geeky for me - what is your everyday browser of choice, is it mozilla?

Posted by: sue on January 30, 2004 09:30+13:00

Sue, I use Mozilla Firebird for my general browsing. I’m tempted now to use the special build of Mozilla mentioned, but I suspect I will just bring that out on special occasions.

Posted by: Jamie on January 30, 2004 14:07+13:00

I don’t see why the situation with CJK should be any different from the situation with the Roman character set, which multiple languages use. Yes, there are different simplifications of the same character in Japanese and mainland Chinese, but this suggests that all you need to do in the browser is have a map from (language × character-range) to appropriate font.

And then, if you assert that your Roman characters are actually Sanskrit, the browser can choose a different font for them. But then, it sounds as if this is what already happens. So, what’s the problem anyway?

Posted by: Michael on February 3, 2004 12:34+13:00

Well, you said yourself that you need to match language with script in order to get the right results. However, there is no standard way of doing that just yet (but see the link I posted above), and even if there were it makes the rendering issue much more complicated – you can’t just plonk down some Unicode characters and figure that you’ll get the right result.

As for the case of Sanskrit transliterated into the Roman alphabet, I almost certainly wouldn’t want them to be displayed in a different font, certainly not by default. This is one of the majority of cases where the script has no, and needs no, connection to language.

Posted by: Jamie on February 3, 2004 13:05+13:00

...it makes the rendering issue much more complicated.

For whom? The browser implementor, or the author? It seems to me that it should be pretty easy for the author as long as they are willing to specify a language for what they write. The browser then just has to implement the map I described, so that the right things happen with CJK characters that differ from language to language. Elsewhere, there probably isn't an issue at all.

Posted by: Michael on February 3, 2004 16:20+13:00

It isn’t an insurmountable problem in environments where the language can easily be specified (though it is more work for the browser implementor, and while author’s should really be specifying language regardless, most do not), but what of situations where it isn’t possible to specify the language, as in a plain text file?

The issue is raised in the Unicode Chinese, Japanese and Korean FAQ — the Unicode Consortium are essentially putting theory over practice (though by the sounds of it only with real problems in a small number of cases).

Posted by: Jamie on February 3, 2004 19:02+13:00