Sandboxing Unsafe Executables in Linux for an Online Compiler with Minijail

I wrote a toy com­piler few months back. I wanted peo­ple to see it, so I put the code up on Github. But as it turns out, not every­one is will­ing or ca­pa­ble of go­ing through the con­vo­luted process of cloning the repos­i­tory, com­pil­ing the pro­gram, in­stalling a Nepali lan­guage key­board and learn­ing an ob­scure half-baked pro­gram­ming lan­guage just be­cause some id­iot put it on Github.

So, I started to write a web app to make the pro­gram eas­ily ac­ces­si­ble. The web app lets user write code in their browser, then com­piles and ex­e­cutes the pro­gram on the server, and al­lows the user to send in­put from the browser to the server as it ex­e­cutes.

My first in­stinct was to use some­thing like AWS Lambda to com­pile and run each process as a cloud func­tion, but then I looked into the deep abyss of my wal­let and found my­self lost in the dark­ness.

Another idea was to forego the cloud al­to­gether. Compiling code into as­sem­bly can be done in any un­der-pow­ered Virtual Private Server. I can write an im­ple­men­ta­tion of a sim­ple vir­tual ma­chine in JavaScript, then I can add a new back­end to my com­piler to gen­er­ate code for the vir­tual ma­chine. Then I can em­bed the JS vir­tual ma­chine in the web­page, and when the user hits com­pile, all server has to do is com­pile the lan­guage into ma­chine code for the vir­tual ma­chine and send that back to the client. Execution be­comes client-side headache. Something like (a slightly saner ver­sion of) Brainfuck could be per­fect for this kind of ap­pli­ca­tion. It’s rel­a­tively sim­ple to make a Brainfuck Virtual ma­chine.

Anyway, I de­cided that AWS Lambda is too waste­ful for my needs. Virtual Machine on Webpage idea is go­ing to sig­nif­i­cantly in­crease code main­te­nance re­lated tasks in the fu­ture. I tried to find some other way of ex­e­cut­ing user’s pro­grams on the server.

Issues with executing unsafe binaries on server

But huge se­cu­rity is­sues emerge by al­low­ing user-gen­er­ated ex­e­cuta­bles to run on your server. Just to name a few:

  1. The executable can start an expensive infinite loop (for example: listing all prime numbers above 10^5), which makes the system unbearably slow for other processes.
  2. It can generate and store a huge amount of data which either completely fills up the RAM, the storage system, or both, causing the system to slow down or crash, and making it physically impossible for other users to store and run their executables.
  3. It can overwrite important files in the server (for example, the very node.js program responsible for compiling and executing user programs) and infect all clients with malicious payloads.
  4. It can utilise the program bugs or kernel bugs to elevate it's status to root to install stealthy rootkits or bitcoin mining software to use our servers for their benefit.

This is just the tip of the ice­berg. So many other ma­li­cious at­tacks are pos­si­ble de­pend­ing on the sys­tem and in­fra­struc­ture arrange­ment. The sys­tem de­signer has to be very care­ful in set­ting up the sys­tem where un­trusted and po­ten­tially un­safe ex­e­cuta­bles are run, with­out caus­ing sig­nif­i­cant lags for gen­uine users.

Solutions and preventive measures

This was the first time I had to de­sign a sys­tem like this, so I had to re­search quite a bit. In this post, I high­light some re­sults of my re­search. First, lets see how the above is­sues can be dealt with in a gen­eral way.

  1. To prevent expensive infinite loops, we can limit the maximum CPU percentage allowed to each user process and kill it after a fixed time.
  2. To prevent memory overuse, simply limit the maximum memory allowed per process.
  3. Either completely block file writes, or have a sandboxing file layer which redirects all file accesses of the process to a safe temporary directory. Or run the process on a virtual file system in a separate mount namespace.
  4. Use redundant security measures and tightly constrain the execution environment using whitelists to minimize the kernel attack surface.
  5. Execute the user process in a sandbox, which essentially manages most of the above concerns and provides more isolation.

These are, of course, just the gen­eral guide­lines. In my case, be­cause i. the users ac­tu­ally can’t di­rectly craft the ma­chine code that runs in the server and ii. I have full con­trol over and knowl­edge of the as­sem­bly code that is be­ing gen­er­ated, things are a lit­tle eas­ier se­cu­rity-wise. But this will even­tu­ally change as I add more fea­tures and as more con­trib­u­tors join. So tak­ing some time to tighten the se­cu­rity is more fu­ture-proof.

Features the Linux Kernel provides

I’ll start with the fea­tures al­ready pro­vided by a rel­a­tively re­cent Linux kernel by de­fault.

  1. Users and Groups: It's obvious, but by simply ensuring that there's no case in which the unsafe binary will be run as root and that no program started with sudo executes the unsafe binary, we cut off a big chunk of attacks. But let's take it a step further. Make a new user account with limited privileges and run the unsafe binary as that user. Set up permissions accordingly so that the new user account cannot read what need not be read. Add the user only to groups which it absolutely requires.
  2. Namespaces: Namespace is a feature of Linux kernel that allows you to isolate a process from other processes in certain respects. For example, if you run the untrusted executable in a separate process namespace, the process is not able to see or interact with any other process running in the system. It literally becomes the init process in it's view. Namespaces allow containers like dockers to fake isolated systems without the overhead of the full Virtual Machine.
  3. Control groups: Control groups (also called cgroups) allow you to allocate resources and set limits on CPU time and memory, among other things. This, when used in conjunction with namespaces allows for effective containerization of apps.
  4. Capabilities: Capabilities in Linux allows selective provisioning of root privileges. If you really have to allow the untrusted program to do things that only root is able to, then capabilities allows you to allocate only the required root-only operations to the running process
  5. Seccomp-bpf: Seccomp-bpf stands for Secure Computing mode-Berkeley Packet Filter (although no one calls it this). Seccomp by itself blocks all syscalls except four (exit(), sigreturn(), read() and write() on already open file descriptors) so unsafe compute-bound processes can be run without many risks, as almost 99% of the syscalls are blocked by the kernel. BPF is an addon to seccomp which allows you to block any syscalls you want. strace is a linux tool that allows you to trace syscalls made by a process.

Because these fea­tures are pro­vided na­tively by the Linux ker­nel, we can ap­ply them us­ing cor­re­spond­ing syscalls and pa­ra­me­ters and make with a rel­a­tively ro­bust sand­box. I toyed with the idea of mak­ing my own re­stricted mi­cro-sand­box­ing pro­gram (and I re­ally wanted to) but de­cided not to be­cause I was al­ready jug­gling more things than I like to.

There are pro­grams which use these ker­nel fea­tures and more to sand­box ap­pli­ca­tions for us. I had ex­pected there to be many, spe­cially in this age of cloud com­put­ing and lambda func­tions.

ns­jail: I could­n’t get it to com­pile be­cause of some strange pro­to­buf de­pen­dency er­ror. It does­n’t help that the GitHub readme does­n’t have any build steps or the ver­sions of de­pen­den­cies re­quired. Which is a shame be­cause this was al­most ex­actly what I was look­ing for: a light­weight ap­pli­ca­tion sand­box. I might look into it some more later. This is the of­fi­cial site for ns­jail.

mbox: It does­n’t seem to work in 2019. The ex­am­ple us­ages in GitHub fail to do any kind of block­ing. The -n still does­n’t block in­ter­net ac­cess. I’m guessing that it re­lies on some old ker­nel spe­cific fea­tures. It was last up­dated 3 or 4 years ago on github. Also, read­ing the au­thor’s pa­per and ycombi­na­tor com­ments, I got the feel­ing that he’s much more proud of his filesys­tem lay­er­ing work than his sand­box­ing work. Also, Mbox seems more like some aca­d­e­mic/​proof-of-con­cept work. Doesn’t fit my re­quire­ments. This is the of­fi­cial site for Mbox.

Docker: Heavy and not built for my use case. It is also not as se­cu­rity-fo­cused, and ap­par­ently it can be bro­ken out of. At any rate, I’m not gonna be in­stan­ti­at­ing full con­tain­ers for ex­e­cut­ing a sub-megabyte pro­grams. Al­though Docker run­ning small dis­tros like Puppy linux is an in­ter­est­ing idea. This is the of­fi­cial site for Docker.

sys­temd-nspawn is re­ally in­ter­est­ing. It needs a full chown-able filesys­tem I did­n’t use it this time, but I’m def­i­nitely go­ing to use this in some fu­ture pro­ject.

mini­jail: This lit­tle soft­ware is ap­par­ently used by Google to sand­box chromium pro­grams. It is not as fea­ture rich as ns­jail so I ended up us­ing this in con­junc­tion with cgroups and some other pro­grams to iso­late the un­safe bi­nary.

https://​en.wikipedia.org/​wiki/​OS-lev­el_vir­tu­al­i­sa­tion#Im­ple­men­ta­tions

PS Mosh is re­ally cool sub­sti­tute to SSH, specially when you’re us­ing vim to code di­rectly on the server. Plus, the fact that I don’t have to restart SSH con­nec­tion every time I wake my lap­top is such a con­ve­nience.

Resources

  1. https://en.wikipedia.org/wiki/OS-level_virtualisation#Implementations
    This page has a list of many sandboxing tools that don't use full VMs.
  2. https://chromium.googlesource.com/chromiumos/docs/+/master/sandboxing.md
    Explains how minijail can be used for security. If you're using Minijail, also check out the embedded video.
  3. https://people.csail.mit.edu/nickolai/papers/kim-mbox.pdf
    The author's paper explaining Mbox.
  4. https://blogs.rdoproject.org/2015/08/hands-on-linux-sandbox-with-namespaces-and-cgroups/
    https://www.toptal.com/linux/separation-anxiety-isolating-your-system-with-linux-namespaces
    Use these two links if you're going baremetal. Also try finding pages in chromium.googlesource.com about sandboxing. They have done a lot of work in that area. A representative page is below: https://chromium.googlesource.com/chromium/src/+/master/docs/design/sandbox.md
  5. http://static.googleusercontent.com/media/research.google.com/en//pubs/archive/34913.pdf
    This is the paper explaining Native Client, a sandboxing system developed by Chrome developers to allow execution of C/C++ programs in the web browser with near-native speed.
  6. https://blog.golang.org/playground
    The Go language website has an online code compiler similar to ours. In this blog post, they describe how they used Native Client (see resource 5) and other ideas to provide the safe online compilation and execution service to users. Particularly interesting is their decision to disallow any interactive input to favour caching results and to reduce CPU time per program.
    They also seem to be using a separate branch of compiler to generate Native Client executable for executables. I had contemplated on whether I should add an extra backend in my compiler which generates secure code at compile time. But I decided that the maintenance burden was not worth it.