On the Nepali Language and Unicode

The Nepali lan­guage gets very lit­tle rep­re­sen­ta­tion on the in­ter­net. Take, for ex­am­ple, the Nepali Wikipedia which has about 33 thou­sand ar­ti­cles. The Esperanto Wikipedia boasts 8 times that num­ber (at around two hun­dred thou­sand ar­ti­cles), which is kind of sad, be­cause Esperanto is an ar­ti­fi­cial lan­guage cre­ated by one per­son in the 19th cen­tury. It is spo­ken by a mea­ger 2 million peo­ple world­wide. Compare this to the Nepali lan­guage, which has more than 25 mil­lion speak­ers.

It’s as they say, whoever con­trols the me­dia con­trols the mind”. And the me­dia of the 21st cen­tury, the in­ter­net, is so des­per­ately out of Nepali’s hands that, not only do we read the news, ar­ti­cles, the weather, and nov­els in English, but we go on to cap­tion our pho­tos of warm, in­ti­mate mo­ments in English. Our pop cul­ture is stuffed full of ref­er­ences to Hindi me­dia. My lit­tle brother, who is not even 10, will sniff my phone from a mile away and hide out in some cor­ner to watch Motu Patlu on YouTube. He might not be as flu­ent in Hindi as he is in Nepali. But in his lit­tle world, Hindi is the lan­guage that su­per­heroes speak. And Nepali? Only his an­noy­ing brother.

I’m not say­ing that I’m any bet­ter. If any­thing, I’m far worse. I’m one of those peo­ple who, af­ter be­ing told a phone num­ber, says, Okay, re­peat that. But in Eng­lish”. Ask me to trans­late this very post to Nepali and watch how fast I run away. My friends are, frankly, no bet­ter.

The prob­lems are much worse for other lan­guages of Nepal. The Newari Language is listed by UNESCO as be­ing definitely en­dan­gered”. The Kusunda lan­guage has speak­ers in sin­gle digit num­bers, which is par­tic­u­larly un­for­tu­nate be­cause Kusunda hap­pens to be a lan­guage iso­late, so once it’s gone, we’ll lose any chance of ac­cu­rately re­con­struct­ing it. Half of the 123 lan­guages of Nepal are en­dan­gered.

In this post, I will talk about Unicode for Nepali lan­guage in hopes that I can shed some tech­ni­cal light on Unicode in Nepali con­text for the unini­ti­ated. The idea is to let in­ter­ested peo­ple know of the tools and ideas in Unicode, specially in con­text of the Nepali lan­guage to en­cour­age adop­tion and us­age.

Some background

Note: The num­bers start­ing with 0x’ and U+’ are Hexadecimal num­bers. U+’ prefix ad­di­tion­ally im­plies that the num­ber fol­low­ing it is a Unicode code point.

Computers fun­da­men­tally think in num­bers. To a com­puter, a pic­ture is a 2d matrix of color in­ten­sity at each pixel as recorded by the cam­era. An au­dio record­ing is, sim­i­larly, a se­quence of am­pli­tudes of a sound wave recorded sev­eral thou­sand times each sec­ond. It goes with­out say­ing that text is also read by the com­puter as a se­quence of num­bers.

Of course, what the num­bers mean to the com­puter is en­tirely de­pen­dent on the pro­gram that reads the data. In case of a pic­ture, the num­bers are used to vary the bright­ness of color el­e­ments on the mon­i­tor. For an au­dio, the data se­quence is used to pro­por­tion­ally ad­just the volt­age across the speaker coil, which causes the mem­brane to vi­brate and cre­ate sound. For text, these num­bers will be used to do a wide va­ri­ety of task. If you’re typ­ing, num­bers cor­re­spond­ing to the al­pha­bets you press will be stored in the mem­ory. These num­bers will be used to look up char­ac­ters in­side the font file, which stores a vi­sual glyph for each dis­playable num­ber. That will be drawn on the screen (after many more steps).

So, it seems that for the com­puter sys­tem to sup­port text at all, char­ac­ters will have to have a stan­dard­ized map­ping to num­bers. If the font file and the key­board don’t agree on what the num­ber 65 is, then the wrong char­ac­ter will be dis­played on the mon­i­tor.

ASCII is the stan­dard map­ping of these num­bers to char­ac­ters. It as­signs the num­bers from 0 to 127 to English al­pha­bets and sym­bols. Because the early de­vel­op­ment of com­put­ers was cen­tral­ized in the US and UK, ascii en­cod­ing be­came the pri­mary method of text rep­re­sen­ta­tion on com­put­ers.

Computers al­most uni­ver­sally store data in blocks of 8 bits (a byte) which can rep­re­sent 256 num­bers (28). ASCII only uses half of that (128 char­ac­ters). So, with ASCII, half of the rep­re­sen­ta­tional space of a byte goes un­used. There have been sev­eral ex­ten­sions to ASCII which make use of the un­used space from 128 to 255.

So how do you type Nepali in ASCII? Well, you don’t. ASCII is, by its very de­f­i­n­i­tion, American. In the early 90s, Bureau of Indian Standards came up with its very own en­cod­ing scheme called ISCII. ISCII is an ex­ten­sion of ASCII, so it re­tains all char­ac­ters from ASCII in their right­ful place. From 128 to 255, it stores char­ac­ters used in the lan­guages of India. Interestingly, ISCII ac­tu­ally uni­fied a bunch of sim­i­lar scripts like Devanagari, Tamil, Bengali, and Oriya. So, a num­ber 164 (0xA4) in ISCII can mean all of the fol­low­ing: अ, அ, অ, ଅ and more. The idea was that, since all these scripts are de­scen­dants of the same Brahmi script, they be­have sim­i­larly and rep­re­sent sim­i­lar sounds, which is true, so all four of the above let­ters rep­re­sent the same /ʌ/ sound.

To switch be­tween scripts in ISCII, you just change the fonts ren­der­ing the num­bers. If a Devanagari font is se­lected, com­puter dis­plays Devanagari characters. If a Tamil font was se­lected, it dis­plays the same text in Tamil. The equiv­a­lent for English lan­guage would be to do as­sign num­bers to the 26 English al­pha­bets, and switch cases by chang­ing fonts. The pho­netic in­for­ma­tion stays the same but the rep­re­sent­ing char­ac­ters seem dif­fer­ent. ISCII used a spe­cial ATR byte to spec­ify which script the frag­ment of text was in. ISCII is im­por­tant his­tor­i­cally, be­cause Unicode spec­i­fi­ca­tion for Devanagari and other In­dic scripts is based al­most en­tirely on ISCII. The ATR byte was not adopted by Uni­code be­cause font at­trib­utes are not a part of Unicode.

In Nepal, how­ever, the gov­ern­ment did­n’t ini­ti­ate any ef­forts. What I think hap­pened in­stead was that peo­ple or groups (like Muni Shakya) in­de­pen­dently as­signed the 256 num­bers to var­i­ous char­ac­ters and lig­a­tures. Which char­ac­ter a spe­cific num­ber rep­re­sented was com­pletely up to the font you chose. That must have been messy. Fonts are meant to change how the text looks, not what it rep­re­sents!

The table be­low shows the char­ac­ter map­ping of two Nepali fonts from the late ’90s. You’ll no­tice how same field of the ta­bles does­n’t en­code the same char­ac­ter. Both learn­ing to type in these fonts and chang­ing the fonts must have been an in­con­ve­nience.

unicodenepalione.png

So far, I have talked ex­clu­sively about the rep­re­sen­ta­tion of text. Typing was a dif­fer­ent beast en­tirely. Suppose that you were us­ing the Annapurna font. Since key­boards come in the stan­dard QWERTY lay­out, if you pressed the A key (which is in­ter­nally ASCII 65), an would ap­pear. But on chang­ing the font to Sabdatara and it would change to a द्ब. The im­age be­low demon­strates how, typ­ing कलम in Sabdatara font, and then chang­ing it to Annapurna re­sults in garbage.

kalamin3fonts.png

But thank­fully, over time these map­pings sta­bi­lized to the what is now known as the Traditional or Remington’s key­board lay­out. Generally, if some­one over 30 tells you that they can type in Nepali, what they prob­a­bly mean is that they can type in the tra­di­tional key­board lay­out. It is what you type if you use the Sagar­matha or the Preeti font.

Unicode

Naturally, India was not alone in want­ing an en­cod­ing for its scripts. The late ’80s and the 90s saw the rise of en­cod­ing schemes such as PASCII, VSCII, JIS X, many ISO stan­dards, and much more, each de­vel­oped by a par­tic­u­lar coun­try or a group of coun­tries to en­code their lan­guage. All this was fine and dandy for the sys­tems back then. But then came the Internet.

The Internet pretty much broke every­thing. Computers all over the world were ex­pected to un­der­stand each other, but they spoke in dif­fer­ent en­cod­ings. If you did­n’t en­code your data in the same for­mat as the re­ceiv­ing com­puter was set to, your text would ren­der as garbage. The prob­lem was that a num­ber in one com­puter in one coun­try rep­re­sented some­thing en­tirely dif­fer­ent in an­other com­puter in an­other coun­try.

Unicode was de­signed from the ground up to sup­port all scripts of the world. It currently has 1,37,994 char­ac­ters from 150 writ­ing scripts. Unicode puts each char­ac­ter its own unique iden­ti­fy­ing num­ber. For ex­am­ple, the num­ber 2325 (U+0915) uniquely iden­ti­fies the Devanagari let­ter क, the num­ber 70658 (U+11402) iden­ti­fies the Newari let­ter 𑐂, the num­ber 65 (0x41) rep­re­sents the Eng­lish cap­i­tal let­ter A and the num­ber 24859 (0x611b) the uni­fied Chinese, Japanese and Korean ideo­graph 愛. These num­bers mean the same thing no mat­ter where in the world you are or what com­puter you are us­ing. These num­bers are also im­mutable, mean­ing that, Unicode won’t ever change what a num­ber means, hence main­tain­ing back­ward com­pat­i­bil­ity.

codepoint_table.png This table lists all Nepali Characters in the Unicode Devanagari block with their code­point num­bers.

Characters of the same script live in the same block. A block is es­sen­tially a range of num­bers. For ex­am­ple, Devanagari char­ac­ters live in the U+0900 to U+097F block. Newa is in the U+1900 to U+194F block. The Limbu writ­ing sys­tem, Sirijanga Script, lives in U+1900 to U+194F block.

The ad­van­tage of sep­a­rat­ing lan­guage into blocks, is that each block can get its own kind of treat­ment from the soft­ware. We know that dif­fer­ent scripts be­have dif­fer­ently. In Devanagari, some char­ac­ters go above the pre­vi­ous char­ac­ter, some un­der. Some change the char­ac­ter it­self. In Tibetan, di­a­crit­ics stack up ver­ti­cally. A sin­gle Urdu char­ac­ter, when typed, can change how the en­tire word looks. Not to men­tion that scripts have dif­fer­ent di­rec­tions of read­ing. Chinese is writ­ten from top to bot­tom. Urdu is writ­ten from right to left. All these quirks can be han­dled by soft­ware based on what block the char­ac­ters come from.

The Devanagari block is shown be­low. It was taken from Wikipedia.

wikipediaunicodetable.png

Now, if you look at the table above, you will no­tice a lot of char­ac­ters miss­ing. Where is क्ष? Or द्य? Or all the half forms like क्‍ , ग्‍ , and the rest? Well, be­cause Unicode was based on ISCII, and since Hindi con­sid­ers क्ष, त्र and ज्ञ to be com­pound char­ac­ters, they aren’t al­lo­cated sep­a­rate code points. That does­n’t mean you can­not type it, though. You just have to in­put the con­stituent char­ac­ters and the text-ren­der­ing sys­tem will take care of the rest. क्makes क्ष, त्makes त्र and ज्makes ज्ञ au­to­mat­i­cally.

consonantforms.png

Unicode Specification on how a soft­ware should han­dle com­pound char­ac­ters in In­dic scripts.

So then, what if you want to type out क्ex­plic­itly? Unicode pro­vides a char­ac­ter called Zero Width Non-Joiner (ZWNJ) (zero width be­cause it does­n’t take up any space in the text). But it pre­vents the Halant ‌ from chang­ing char­ac­ters to their half forms. Similarly, there is a Zero Width Joiner which ex­plic­itly asks for the half form to be dis­played, even if there is­n’t a char­ac­ter next to it.

Now, if you look at the table for Sabdatara and Annapurna font above, you’ll no­tice that the half forms are ex­plic­itly en­coded. That is prob­a­bly an ar­ti­fact from the print­ing press era. When type­set­ting for the print­ing press, or when us­ing a Linotype, a fol­lowed by a ‌ would­n’t mag­i­cally change into क्‍. You would have to se­lect the cor­rect type blocks man­u­ally. That kind of think­ing could have car­ried over into the font de­sign process.

But com­put­ers are smarter than that. They will eas­ily sub­sti­tute the glyphs for you. The only thing to keep in mind is to use your ZWJs and the ZWNJs when you want the other forms.

Unicode in­tends to de­pict the un­der­ly­ing char­ac­ters rather than the ren­der­ings. Think of it this way: प्राप्त is the ren­der­ing and + ‌ + + ‌ + + ‌ + is its un­der­ly­ing char­ac­ter com­po­si­tion. Unicode del­e­gates the ren­der­ing to the dis­play sys­tem. This model of text en­cod­ing is called the Virama-based model (vi­rama mean­ing ha­lanta), and is not unique to Devanagari.

The Unicode 12.0 spec­i­fi­ca­tion has an in­sight­ful im­age to clar­ify fur­ther.

purtiunicode.png

UTF

Unicode is mostly con­cerned with as­sign­ing ab­stract num­bers to char­ac­ters, and de­scrib­ing their prop­er­ties. Unicode Transformation Format, or UTF is the im­ple­men­ta­tion of the Unicode spec­i­fi­ca­tion. UTFs are the al­go­rithms which con­vert these ab­stract num­bers into bits and bytes that are even­tu­ally stored in the mem­ory, processed or trans­mit­ted.

Only three Unicode Transformation Formats are in wide­spread use: UTF-8, UTF-16 and UTF-32. The num­bers in their name sig­nify the num­ber of bits they use. So UTF-8 uses 8 bits (one byte), UTF-16 uses 2 bytes and UTF-32 uses 4 bytes. The figure be­low will try to demon­strate how UTF8 works. It’s okay if it ends up con­fus­ing you in­stead. Unless you are writ­ing mul­ti­lin­gual com­puter pro­grams, it is not gen­er­ally nec­es­sary to know how these en­cod­ing schemes work. Just remember that most all of in­ter­net uses utf-8 be­cause it is the most ef­fi­cient en­cod­ing.

utf8rocketinggraph.png

Nepali Input

Unless you can al­ready type in the tra­di­tional lay­out, I rec­om­mend that you try out the Phonetic lay­out by Nepali Language Technology Kendra. I’m sure any rea­son­ably ded­i­cated per­son can learn this lay­out in a day.

As it turns out, it is a fairly sim­ple ex­er­cise to cre­ate a new key­board lay­out from scratch. I was able to make my­self a mod­i­fied ver­sion of the Phonetic layout in less than an hour. To make your own lay­out, down­load and in­stall the Microsoft Keyboard Layout Creator tool.

Nepal Language Technology Kendra web­site also pro­vides key­board lay­outs for Nepali Keyboard lay­outs for Linux. On Android phones, you can in­stall ei­ther Hamro Keyboard or Google Indic key­board, both of which are ex­cel­lent. On iPhones and iMac, sim­ply ac­ti­vat­ing the re­quired in­put method in Settings should suf­fice.

Fonts

compositionsdevanagari.png

Google Fonts has a us­able se­lec­tion of fonts for Devanagari. To fil­ter the fonts by the lan­guage, find the Language drop­down and se­lect Devanagari (even though it is­n’t a lan­guage).

Featured photo of Linotype by werner moser from FreeImages.com