Archæology

The assorted finds of Artefact Publishing

It’s gone all curly (hopefully)

On learning that U+2019 ( ’ ) is the preferred Unicode character for the apostrophe, I have gone and changed all my ASCII apostrophe characters to that. For good measure, I changed the double quotation marks to “ and ” as well. This include specifying a content :before and :after the HTML Q element in the stylesheet.

So far I haven’t created a situation where the quotation marks produced by the Q element should be specific to a language which uses its own quotation symbols (for example French’s guillemets « and »). In such a case the Q element must be given its own language attribute, even if it occurs within an element which has the same language attribute set, which is just more work than should be necessary.

Anyway, I’m not convinced these new characters look better in my font, and they’ll cause problems for people whose fonts lack them, and it’s certainly slower and more error-prone writing ’ instead of '. However, I’ll put up with a lot for even a semi-spurious ideal of correctness.

Posted by jamie on June 6, 2003 12:10+12:00

Comments

I learnt about this problem from Markus Kuhn at http://www.cl.cam.ac.uk/~mgk25/ucs/quotes.html (he has quite a lot of useful stuff on his pages) and subsequently tried to make the change in my own practice. The HTML mnemonics are not so difficult to remember:

‘ ’ “ ”

for (left-right)(single-double) quote mark.

Posted by: Michael on June 10, 2003 11:39+12:00

Ah, but it’s not quite so simple. I originally was using the entity references (’, etc), but then realised that on my RSS feed, with XML escaping turned off, it failed to validate (since XML doesn't recognise those entity references). So, I use the decimal notation (’, etc) — while the order is the same, obviously, it’s harder to remember numbers. But I cope.

Posted by: Jamie on June 10, 2003 11:50+12:00

Urk. I'm pretty confident that I don't understand what you mean by "XML escaping turned off", but it sounds like you could use the entity references if you turned it on. Is this not possible for RSS feeds, and if not, why not?

Yours in baffled, bemused, and somewhat belligerent astonishment. (Just why is the world so unreasonable? :-)

Posted by: Michael on June 11, 2003 12:26+12:00

The XML escaping essentially means that the special XML characters, like &, are turned into a non-special form, like &. So, if I did have XML escaping on for my RSS feed, it would take ’ and turn it into ’ — which isn’t very useful. With it turned on, the XML is invalid because ’ is not a predefined XML entity, where ’ is.

This is entirely reasonable behaviour. XML allows for Unicode characters through the numerical references, and isn’t burdened by a whole bunch of predefined entities which map to the same thing. Does that make sense?

Posted by: Jamie on June 11, 2003 15:14+12:00

So in what sense can XHTML be considered XML? (Let me guess, none whatsoever, despite my preconceptions. :-)

XHTML uses and accepts these entity references OK. So, does this mean you could change the details of the XML being exported through RSS to make it more XHTMLish and thus able to use entity references?

Posted by: Michael on June 11, 2003 15:31+12:00

RSS is an XML vocabulary and unrelated to XHTML. The reason XHTML allows entity references is because they are defined in the XHTML DTD; see http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd.

XHTML is an XML vocabulary. Where it becomes a bogosity is in serving it as text/html; serving it as text/xml is merely problematic. See http://www.hixie.ch/advocacy/xhtml.

And now I notice that the automatic markup of URLs as links is useless as it captures following punctuation. Time to tinker.

Posted by: Jamie on June 11, 2003 15:57+12:00

I thought you might say something like that.

So here's the inevitable follow-up, why doesn't RSS define the same entity references in its DTD as XHTML? Given that RSS is used to transfer HTML snippets, this seems a pretty reasonable thing to do...

Posted by: Michael on June 12, 2003 10:50+12:00

Well, the DTD for RSS 0.91 defines some entities for this sort of thing, but not the ones in question. As for other versions of RSS, who knows? There doesn’t seem to be a DTD anywhere, except presumably in the bowels of the validators, where I can’t be bothered checking. Of course these validators must work by assuming that what is fed to them is some sort of RSS, since there is no DTD in the feeds generated by Movable Type. That statement should tell you all you need to know about the way RSS has developed and is used.

Posted by: Jamie on June 12, 2003 11:18+12:00

Why do you use entities for quotation marks and other non‐ASCII characters anyway? On my own website (http://staff.washington.edu/baums/), I just mark pages as UTF-8‐encoded and then use actual characters in the text — no entities for me.

PS. There seems to be a chance that your comment‐posting system does not handle UTF-8 properly, but I’ll try anyway.

Posted by: Stefan Baums on December 16, 2003 16:09+13:00

I would generally prefer to use the characters themselves rather than character references, but I haven’t done the necessary changes to the CGI parts of Movable Type to use UTF-8 so that such things will display nicely when I am editing.

An earlier issue that continues to some extent is the lack of an easy and fast means of entering such characters — I can type — faster than I can do bits of trickery to get the character itself.

Posted by: Jamie on December 16, 2003 20:28+13:00

I write my webpages in GNU Emacs using an input method I wrote myself in Emacs Lisp (available from and explained on my website). But you are right. What is severely lacking from the free software world is a good and well‐documented system‐wide input method system, and an easy‐to‐use program to design your own input methods for that system. The current situation with Xmodmap (not flexible enough), XIM (horrendously undocumented) and any number of unstable other solutions (IIIMF, you name it) is just atrocious.

Posted by: Stefan Baums on December 16, 2003 21:29+13:00