Making मनसा: a Nepali Programming Language

I’ve been work­ing on and off on a Nepali pro­gram­ming lan­guage with my friends for the last few months. It’s called मनसा (IAST: mansā) and I think it’s ready for an al­pha re­lease. If you’d like to try the lan­guage out, visit mnsa.cc - the of­fi­cial web­site. You can play around with the lan­guage right in the browser with­out hav­ing to down­load any­thing, not even a De­vana­gari key­board lay­out.

This post is a col­lec­tion of ran­dom things I want to say about the lan­guage, in­clud­ing how the idea came about, the in­ter­est­ing things I learnt mak­ing the pro­ject, and the prob­lems faced.

How it all started

In the sixth se­mes­ter, we are re­quired to sub­mit a team pro­ject by the end of the term. Most teams in my class were mak­ing bor­ing things, like ho­tel man­age­ment sys­tems, or polling web­sites. If you get to chose your own pro­ject, why not do some­thing ex­cit­ing and new? We set­tled on mak­ing a pro­gram­ming lan­guage. Why not in Nepali?”, I said, and af­ter a long de­bate, we set­tled on build­ing a Nepali pro­gram­ming lan­guage.

In the be­gin­ning, we thought a Nepali pro­gram­ming lan­guage would­n’t have any prac­ti­cal use at all. We thought that all this pro­ject would ever be is a nov­elty, a fun thing to play with and for­get about, some­thing like Brainfuck or Be­funge. But as we con­tin­ued to look into things to in­clude in our pro­posal, we re­al­ized that a Nepali pro­gram­ming lan­guage might have a few very niche but very prac­ti­cal uses.

After some more de­lib­er­a­tion, we de­cided that it had to be a com­piled lan­guage, be­cause that was the most dif­fi­cult to make in terms of code, and the coolest”. It was ex­cit­ing and some­what un­cer­tain be­cause we don’t have Compiler Construction classes in our syl­labus, so we had ba­si­cally zero idea on how we’d pro­ceed. And we did­n’t know how to work with Nepali char­ac­ters in the pro­gram. It was a great learn­ing ex­pe­ri­ence.

The language

The lan­guage is ac­tu­ally very ten­ta­tive at this point and we’re still re­fin­ing it. You should be able to get the gen­eral idea of the syn­tax by the small ex­am­ple code be­low. If you don’t, wait for the of­fi­cial doc­u­men­ta­tion to be fin­ished. It’s in the works.

We have made a few changes to the con­ven­tional struc­ture of a pro­gram state­ment to make it more ex­pres­sive. For ex­am­ple, in func­tion calls, the name of the func­tion comes af­ter the pa­ra­me­ters, as verb comes at the end of sen­tence in Nepali. So print Something” be­comes something लेख. We sim­i­larly mod­i­fied other con­structs such as con­di­tion­als and loops to fit into the Nepali gram­mar struc­ture more. Technically, it is a very small change but it re­ally does help make the pro­grams more read­able in Nepali. It was sur­pris­ing how fluid the lan­guage be­comes by changes so sim­ple.

The compiler

The com­piler was writ­ten in C++ com­pletely from scratch. I did­n’t want to use lexer or parser gen­er­a­tors be­cause these tools hide the in­ter­est­ing and com­pli­cated de­tails. That is a good thing if your goal is to make good, main­tain­able com­pil­ers but, as a stu­dent, I wanted to ex­plore all the gory de­tails and learn every­thing I could. I should men­tion here that the se­ries of lec­tures by Alex Aiken is an amaz­ing re­source, and if you’re plan­ning on do­ing com­pil­ers too, you can at least skim through the videos to see the big pic­ture.

I used Visual Studio Code to write C++ code this time. I ac­tu­ally started cod­ing us­ing plain old vim as al­ways, but even­tu­ally it be­came so un­man­age­able with files all over the place that I had to find a more man­age­able tool. I went with vs­code, and I’m glad I did be­cause vs­code is in­cred­i­bly well made. And the c/​c++ ex­ten­sion is so good. It ap­par­ently makes an AST for my code on the fly and checks for all kinds of er­rors. I had never be­fore used these so­phis­ti­cated tools for writ­ing C++. I gen­er­ally used vanilla sub­lime with vim ex­ten­tion, or just com­mand line vim. But re­ally it’s so nice. It’s like dis­cov­er­ing gey­sers for the first time when you’ve been show­er­ing with cold wa­ter all your life (strangely spe­cific metaphor? sorry).

When writ­ing the com­piler, I used many new C++17 fea­tures. I was drawn into mod­ern C++ first be­cause of con­s­t­expr which al­lows you to make func­tions that are eval­u­ated at com­pile time. I wanted to build a fast but man­age­able lex­i­cal analyser us­ing a com­po­si­tion of con­s­t­expr func­tions but I re­alised that I was try­ing to pre­ma­turely op­ti­mize again, so I con­trolled my­self. Maybe in ver­sion 0.2.

I also used C++17 lambda func­tions to de­clare nested func­tions to elim­i­nate re­peated ac­tions that are only rel­e­vant in­side a spe­cific func­tion. Here’s an ex­am­ple:

auto makeDeclNode = [&](ast::astType a) {
   stmt.reset(new ast::Ast);
   astNode declaration;
   if (declaration = declList()) {
      stmt->type = a;
      stmt->left = std::move(declaration);
   } else
      errQuit("declarations");
};

if (accept(lexer::tokenType::_int)) {
   makeDeclNode(ast::astType::_intDecl);
} else if (accept(lexer::tokenType::_str)) {
   makeDeclNode(ast::astType::_strDecl);
} else if (accept(lexer::tokenType::_bool)) {
   makeDeclNode(ast::astType::_boolDecl);
}

Maybe it’s be­cause I’ve cod­ing a lot in javascript these days, but nested func­tions feel like an im­por­tant lan­guage fea­ture to have. मनसा has nested func­tion sup­port too.

I also tried to use C++ smart point­ers every­where. They make mem­ory man­age­ment so pleas­ant and take away most of the is­sues I used to deal when us­ing raw point­ers. But I did find the lack of good ob­server pointer id­iom an­noy­ing and dan­ger­ous. Maybe they’ll make it bet­ter by C++20, but hope­fully I’ll al­ready be us­ing Rust by then.

In v0.1, I have only im­ple­mented the lexer, parser, se­man­tic an­a­lyzer and a rudi­men­tary code gen­er­a­tor with emits C++ code. In the new ver­sions, I will pro­gres­sively add new mod­ules like op­ti­mizer, and redo some old ones.

  1. The Lexical Analyser is a sim­ple fi­nite state ma­chine which it­er­ates over each each UTF8 en­coded char­ac­ter and con­verts them into to­kens. I learnt a lot about Unicode while mak­ing the lexer. I am in­ter­ested in world lan­guages and scripts, so read­ing the Unicode doc­u­men­ta­tion and other ar­ti­cles was fas­ci­nat­ing to me. If you’re in­ter­ested in lan­guages too, I can­not rec­om­mend at least skim­ming through the Unicode Standard enough. If you’re in love with writ­ing scripts like I am, do read this ar­ti­cle on Smashing Magazine. I have writ­ten about Unicode in an­other blog post so maybe also check that out.

    The en­cod­ing of Unicode char­ac­ters is also a fas­ci­nat­ing topic. I find UTF-8 particularly beau­ti­ful. And as it turns out, it was de­signed by Ken and Rob from Bell Labs. I’m a big fan of the Bell Labs peo­ple.

  2. I wrote the Parser us­ing the Recursive Descent al­go­rithm. It’s crazy how sim­ple yet pow­er­ful Recursive Descent is. I learnt it us­ing only the Wikipedia ar­ti­cle, wrote my first parser in a week­end, and it works like magic. It’s just so el­e­gant. In the cur­rent code base, the parser takes up the most vol­ume at about a thou­sand lines. But I do be­lieve that I should have mixed in some Pratt pars­ing to parse the op­er­a­tors, be­cause re­cur­sive de­scent has to make a lot of func­tion calls even for triv­ial tasks, which makes it in­ef­fi­cient. Oh well, maybe next time.

    The Parser makes an AST. Because the AST is vi­tal, I wrote some rou­tines to gen­er­ate ac­tual pic­tures from the AST. I’ve de­tailed the process in an­other blog post but this is what the ASTs vi­su­al­iza­tion looks like:

  3. The Semantic Analyser is all about re­cur­sively get­ting to all nodes in the AST and check­ing their types and mak­ing sure every­thing is ac­cord­ing to rules.

  4. I had to write my own Symbol Table, which was awe­some be­cause I got to play with STL con­tain­ers like pairs and un­ordered maps. Once you dis­cover these tools, you can never go back to hand-cod­ing data struc­tures (which I’ll admit was stu­pid, but I’m slowly fight­ing my not-in­vented-here syn­drome).

  5. The ac­tual Code Generator is not done yet. In it’s place is a sim­ple func­tion which re­cur­sively nav­i­gates the AST and gen­er­ates C++ code. For variable names, I just base64 en­coded all Devanagari iden­ti­fiers and re­placed the il­le­gal char­ac­ters in base64 with un­der­score. The gen­er­ated code looks like this:

int main(){
   auto v4KSo4KSv4KS_pX_pCksuCkvuCkh_pCkqATT1 = [&]() {
      cout << "\n";
   };
   int v4KSG4KSH = 100;
   string v4KSV4KWB4KSw4KS_p = "नमस्ते";
   while  (v4KSG4KSH > 0) {
      cout << v4KSV4KWB4KSw4KS_p;
      v4KSo4KSv4KS_pX_pCksuCkvuCkh_pCkqATT1();
      v4KSG4KSH= (v4KSG4KSH - 10) ;
   };
}

The Website

Making the web­site was like a pleas­ant stroll in the morn­ing sun. It was not par­tic­u­larly chal­leng­ing. I used Adobe XD to make a pro­to­type and then de­signed it with vanilla CSS. I de­cided that un­less there’s an on­line com­piler a web­site for a pro­gram­ming lan­guage is no use. I also fell in love with Haskell’s on­line shell. But mak­ing it was go­ing to be dif­fi­cult.

I used vanilla javascript, socket.io li­brary and Codemirror to make the com­piler part. Learning to write the syn­tax high­light­ing was some­what com­pli­cated so I just mod­i­fied a pre­ex­ist­ing one. I wanted any­body with a com­puter or a mo­bile phone to be able to go in, type code and try it out. So I also made an in-browser key­board lay­out switcher which reads in­put keys and con­verts them to Nepali based on this keyboard lay­out, which we also made. I have yet to make the key­board work on the phone, but I’m so busy these days that I don’t yet have the time.

Then I made a nice key­board lay­out switch­ing drawer that I love to play with.

I apol­o­gize for the low qual­ity GIF. I had to run the screen record­ing through so many on­line con­vert­ers that all com­pressed the orig­i­nal in some form or an­other that now it looks like it came straight from early 2000s.

The back­end part is also in­ter­est­ing, but it still needs lots of work. I basically in­stan­ti­ated a linux server on Azure and ran a sim­ple NodeJS pro­gram on it which lis­tens to socket.io re­quests for pro­grams in मनसा, com­piles the pro­gram, runs it in a sand­box, and pipes the in­put and out­put to the web ter­mi­nal. I have put some prim­i­tive rate lim­it­ing and pro­tec­tions in the code but at some point I in­tend to use re­captcha or some­thing to that ef­fect to guard the API. I might also look to use cloud func­tions in­stead of run­ning a full linux box all the time. I wrote a blog post doc­u­ment­ing a part of the process here.

The IDE

I’ve hated ElectronJS with a pas­sion I only re­serve for a few things in life (like the Cursed Child) The hate is a re­sult of a few fac­tors. I use a 6 year old Dell lap­top as my com­puter which is so di­lap­i­dated at this point that it stut­ters when I have lots of chrome tabs and VScode run­ning at the same time. So I ap­pre­ci­ate par­si­mony from the de­vel­op­er’s side. I’m also kind of an­noyed at this de­vel­oper-first cul­ture we de­vel­op­ers are pro­mot­ing, pig­gy­back­ing sim­ply on the fact that com­put­ers are get­ting faster and with more mem­ory than be­fore.

But then the night be­fore a ma­jor demon­stra­tion of the pro­ject I was in such a rush be­cause I had so many things to make and re­fine, that I broke all my rules and made a deal with the devil it­self. Oh mighty devil, oh ac­com­plice of the pho­ton, oh sin­ful chimera of chromium and nodejs, oh gods of chaos and asyn­chronic­ity, I im­plore you.

Anyway, the re­sults were as­tound­ing. In about three hours, I had a per­fectly work­ing syn­tax-high­light­ing en­abled (thanks again, CodeMirror), beau­ti­ful IDE for demon­stra­tion. It could com­pile and run the code and up­load it to Arduino too (with some avr­dude magic, of course). ElectronJS is in­cred­i­bly fast and sim­ple (for the de­vel­oper, at least). Even though the rea­sons for hat­ing Elec­tronJS are still valid, I got a new per­spec­tive by ac­tu­ally work­ing on it, and I’ve added it into my tool­box. Thank you, ElectronJS, for tak­ing me back to my child­hood days of Visual Basic and RAD.

The Arduino interface

मनसा can also write code for Arduino. The idea came when I was try­ing to teach my cousin (who’s 12) this pro­gram­ming lan­guage. I re­al­ized that it’s dif­fi­cult to get a kid to re­late to the world in­side the com­puter. For ex­am­ple, when teach­ing the con­cept of loop­ing for the first time, it’s way eas­ier to show him a led blink­ing 10 times and ex­plain the code than to show him ten in­stances of ‘Hello World’.

So I added in some fea­tures to the lan­guage, mainly the external mech­a­nism which al­lows you to write C code in­side मनसा func­tions. The re­sult­ing sys­tem is kind of hacky but sur­pris­ingly flex­i­ble. The em­bed­ded C code does most of the reg­is­ter twid­dling and bit flip­ping, and ex­poses high level func­tions to मनसा. That code is writ­ten as li­brary. Then, us­ing the im­port mech­a­nism of the lan­guage, you can im­port these func­tions and ac­cess them like nor­mal मनसा functions. I will put in a video demon­stra­tion when I have the time.

Random, interesting observations

  1. It's nice to have a small code diary, if you will, to jot down all complex ideas that you think up while coding. I made a diary.md file on the root of the source tree to keep a chronological log of my thoughts, snippets of codes, things I had to do next, of bugs and other things in the file.
  2. C++11 range based loops are awesome. I know about these before.
  3. Makefiles get very sloppy very fast and annoying to maintain. I must learn to use more sophisticated build system soon. Maybe Cmake? The build system should be able to build on both Linux and on WSL, install all required programs if they don't exist, and only compile the libraries that actually changed.
  4. Most of the 3000+ lines of code I've written is worthless. I read somewhere lines of code are not investments, rather they are loans. The more you add to the codebase, the more expensive you make it to maintain it. It takes more time for a prospective contributor to make sense of the whole project. The more lines you write, the more bugs you hide. I would have been far far better using a parser generator and making the language interpreted instead of compiled. That would reduce complexity of the system and let me focus on the big picture and on the applications of the language.
  5. At some point, you really need a second, vertical monitor.
  6. Compiling larger code bases takes far more time than you expect.