Aug 14


I am pleased to announce the release of Rupakara, a font that supports the new INDIAN RUPEE SIGN, as well as the letters used to transliterate Indic scripts into Latin script. I was inspired to make this font available by Unni Koroth of Foradian Technologies, who wrote to describe my work to help encode the character.

The article about the INDIAN RUPEE SIGN on the Hindi edition of the Wikipedia spells my name माइकल ऍवरन Māikal Ĕvaran (though this would be read Māikala Ĕvarana in Sanskrit). Later someone corrected this to ऍवरसन Ĕvarasan and finally to एवर्सन Evarsan.

Unni Koroth blogged a notice about the UTC decision and also blogged an interview with me about Rupakara.

I’ve just learned that tomorrow, 15 August, is India’s Independence day. I am happy to dedicate Rupakara as a gift to India on this auspicious day day.

Follow-up, 22 August: Some folks at the Management Scholars Academy of India have blogged about using Rupakara.

Jul 23

The International Phonetic Alphabet is based on the Latin alphabet A-Z, with a lot of extensions. There are extensions like “Latin script a” ɑ, like “Latin epsilon” ɛ, like “Latin gamma” ɣ, like “Latin eng” ŋ, like “Latin phi” ɸ, and so on. Notice the following:

  • Latin ɛ is fairly similar to Greek ε, though its capital is Ɛ and the Greek’s is Ε.
  • Latin ɣ is rather different to to Greek γ being symmetrical with a loop; its capital is Ɣ and the Greek’s is Γ.
  • Latin ɸ is distinctly different from Greek φ, having strong serifs in its ascender and descender; it has no capital and the Greek’s capital is Φ.

And this is fine. These Latin letters were “disunified” from Greek a long time ago, and the UCS contains all of them as uniquely encoded characters. Three letters, however, were not disunified, and are problematic.

  • U+03B2 ( β ) GREEK SMALL LETTER BETA
  • U+03B8 ( θ ) GREEK SMALL LETTER THETA
  • U+03C7 ( χ ) GREEK SMALL LETTER CHI

Now the first and third of these do have non-Greek shapes, just as Latin phi does. Here’s an example from Daniel Jones’ Outline of English Phonetics (1932)—click on the image to see it larger if you like:
Latin beta from Jones 1932

Now, the serifs on that beta’s descender are very atypical indeed in Greek typography. Moreover, the fact that the letter is unified with Greek can cause some troubles in sorting multilingual data, since oin a typical English or German or French sort (for instance) the Latin alphabet sorts first, then the Greek alphabet, then the Cyrillic, and then others. In practice this means that β does not sort after b (where one might expect it), but after z.

The IPA chi can also differ from the typical Greek chi. In the 1949 Handbook of the IPA, the serifs on the letter are on the top-right to bottom-left branch of the x; the other branch is curved.
Latin beta from Jones 1932
A point to remember is that the intent of the IPA chi was originally not that it was unified with Greek chi, but rather that it was different:

The non-roman letters of the International Phonetic Alphabet have been designed as far as possible to harmonise well with the roman letters. The Association does not recognise makeshift letters; it recognises only letters which have been carefully cut so as to be in harmony with the other letters, For instance, the Greek letters included in the International Phonetic Alphabet are cut in roman adaptations.

Let’s compare capital and small Latin Xx, Greek Χχ, and that IPA chi. Now it’s possible that because Greek fonts have been in use for a good while that some people might prefer a greekish glyph to a latinish glyph. Nevertheless, take note of the weight of that older IPA chi, and compare it to the “stretched x” shape.
Exes and chis
But in fact there’s another reason to encode a Latin chi. Lepsius made use of it in his transcription of Chukchi, and there its capital is entirely different from the capital used in Greek. Now, there is precedent for just this kind of thing being a reason to disunify: Cyrillic Ԛ and ԛ (used in Kurdish) were disunified from Latin Q and q because the capital Cyrillic one sometimes looks like an oversized small one.
Latin beta from Jones 1932
So, what it looks like is that we have the following—Latin x, Greek chi, and Latin chi (both greekish and latinish glyphs are shown):
Exes and chis
Let’s assume that LATIN LETTER CHI and LATIN LETTER BETA get encoded (leaving aside the question of THETA for now). Now the big question for the IPA is, what should be done when they are? The current recommendation is “use GREEK LETTER CHI”, but of course there’s no alternative. When there is… well, I for one would prefer a Latin letter that sorts between x and y, rather than a Greek letter that sorts between φ and ψ.

There is certainly data out there using the Greek letters β and χ and θ. Of course, there is also data out there using non-Unicode fonts, or SAMPA, or other things. In my opinion, the right thing to do is bite the bullet, get Latin beta, chi, and theta encoded, and get the recommediations promulgated through fonts and keyboard drivers. But I do not know what the view of the International Phonetic Association might be.

Here is an example of some functionality related to this. I created a number of folders named “a_la”, where the “_” is replaced by various letters.

Sorting folders
It’s easy to see that in the Mac OS, Latin letters sort before Greek. Thorn þ sorts correctly after z. Eth ð after d. IPA ɡ after g, followed by IPA gamma ɣ. Small capital ɪ and Latin iota ɩ follow i, as expected. Then, after þ, we see that the Greek alphabet appears in its correct order. But I am sure that I want IPA beta to sort after b, not after þ, and likewise IPA chi after x. I am torn between wanting IPA theta to sort after t or after þ, but probably the former. Anyway, I want a disunification of these three IPA letters from Greek.

Nov 19

The other week I worked on a project to “rehabilitate” two already-encoded letters that are badly specified, and which cause problems to people using Cyrillic in the UCS. Not problems just for the end user, but problems for implementers as well. The characters in question are U+0478 CYRILLIC CAPITAL LETTER UK, U+0479 CYRILLIC SMALL LETTER UK, U+047C CYRILLIC CAPITAL LETTER OMEGA WITH TITLO, U+047D CYRILLIC SMALL LETTER OMEGA WITH TITLO. The exciting story is found in this document.

My idea was to come up with practical solutions that will avoid ambiguity. On the other hand, theoretical perfection is something we don’t have the luxury for. We are doing damage control on bad choices made more than a decade ago! I am sure we would not have made those mistakes were we encoding Cyrillic for the first time today.

Today, I think we would have encoded a BROAD OMEGA and used diacritics for the beautiful omega or other things, and we would have encoded MONOGRAPH UK and left digraph UK to be encoded as a string of characters, Cyrillic о and у. Solution 2b and 3b in my document were attempts to achieve that situation, which would have been ideal, in my view.

The UTC was conservative on the side of stability, and more or less chose solutions 2a and 3a. (It’s not done till it’s published of course.) I had a concern that if they choose 2a, it will be possible to represent beautiful omega both as 047D and as BROAD OMEGA with two diacritics, and those will not be equivalent, which would cause ambiguity in text representation. (Of course, we have this now with OMEGA WITH TITLO, so the situation would not be worse than it is today.)

I thought that the case against 3a is a good deal stronger. A number of vendors are happy shipping monograph glyphs for 0479, and this poses no security issues. Looking at the Cyrillic fonts shipping with Windows XP, however, I found that all but one of them avoids encoding this character at all. My guess is that this is a question of security. So… we still have a problem here, since digraph UK can be represented by two letters, or (in principle) by this UK. I am thinking that the best solution for security’s sake is to recommend that the reference glyphs for 0479 are drawn with half-width letters, to distinguish it and make it unappealing to use the character at all. This is tantamount to deprecation—if everyone does this in their fonts, it would be a real solution.

Oct 25

Last week I finalized the proposal to encode the Avestan script which I had a lot of fun working with Roozbeh on. I also helped put together a proposal for a Bopomofo character (with Andrew West) and a proposal for eight more Arabic characters with Roozbeh again and with his wife Elnaz.

This week, I’m wrestling with Old Cyrillic and Meitei Mayek.

preload preload preload