JIT VMs and a Compiler Book

30 07 2007

Well, a few friends and I are doing a group independent study this semester where we’ll be writing something at least similar to a JIT JavaScript virtual machine. At least, the original goal was to do this, but we’re now being sidetracked (perhaps productively so) by OpenMoko, an all-open-source smart phone intended to have a feature set comparable to (or exceeding) the iPhone. It’s wide open, and rather than being one-upped by new developments to Tamarin or by new/upcoming efforts by a company we’ve heard rumors might be doing their own JS VM (who would almost certainly beat our performance if they attempt this), we’ll just be the first on a new platform, and rather than focusing on the main goal we had in mind before (doing a JIT compiler for purely performance reasons) we can consider other issues related to using an embedded platform (memory usage being more imporatant, power consumption, etc.) as well as being the first on a new platform. It’s not very mature. But I think it will be interesting, and fun. And educational.

As a result of this, our professor asked us to each get a copy of one of Andrew W. Appel‘s compiler books and finish it by the end of August. Since many (/most/all) of us want to us either Haskell of OCaml, we decided to standardize on Modern Compiler Implementation in ML. I got my copy yesterday, and read 40 pages (I read none today, discarding it temporarily in favor of wine tasting in Napa and Sonoma). So far, the book is well written, concise, and explains both the theory behind, and practical aspects of, the topics it addresses. I’m partway through parsing, which has been a nice refresher on FSMs (deterministic and nondeterministic) and regular expressions. One thing which has bothered me slightly is that he sort of plays fast and loose with models of computation. The example which sticks for me is that he gives a pretty good explanation of FSMs, and then begins to talk about arbitrarily deeply nested comments. The main issue with them being that you can’t use a finite state automata to parse them, because you’d need an infinite number of states to handle an arbitrarily large nesting depth. He gives the (accurate) practical solution that when specifying this in ML-lex (used in the book), states can be explicitly specified, and hitting the open and close comment representations can run ML code which increments and decrements a counter. This works fine, but he continues referring to the resulting implementation as a finite state automata – which it isn’t. Adding the counter transforms it into a pushdown automata with a single (unspecified) letter stack alphabet.

I haven’t finished the whole group of parsing chapters – maybe he doubles back to this. Another consideration is that there are also C and Java versions of this text, both of which have been updated since initial publication, while the ML version has not. It’s possible this was changed in one of those newer versions.

I’m not even sure why this bothers me so much. I used to be all about getting the intuition to work, and leaving formalism and particulars to be secondary concerns. Perhaps the years of hard work from my sometimes overly-pedantic discrete math, models of computation, and algorithms TAs finally sunk in. Or maybe it was writing a theory paper. I’m turning into a stickler for formalism.

Regardless of this silly irritation, I’m still excited for the rest of the book.


Monads! (and Why Monad Tutorials Are All Awful)

26 07 2007

If you have an affinity for programming language theory (as I do), and wander around the internet searching for topics of interest, eventually you will run into references to monads. They’re neither trivial, nor as complex as the dearth of proper explanations of them might suggest.

In a nutshell, they are the PLT manifestation of a concept from category theory, and abstract away propagation of results and side effects, without actually performing side effects or leaving a pure-functional paradigm. Any attempt at description beyond that point tends to either

  • a) get terribly botched*, or
  • b) turn into a real category theory explanation, which most people don’t have the background to understand easily

and for those reasons I will refrain from attempting further explanation myself. I don’t have confidence in my own ability to explain in a non-interactive medium, and I certainly don’t have the mathematics background to formalize it.

The internet is littered with tutorials and explanations “in laymen’s terms” of what monads are, what they are for, why we care, etc. Every such tutorial I have started (at least a dozen, probably several more – all highly recommended) has either bored me to tears, struck me as completely inane, or simply confused me.

I had decided that I was simply going to have to learn abstract algebra and category theory in order to really get a handle on what they were, until a wonderful post appeared on LtU, linking to one of the earlier papers on monads. The paper, by Philip Wadler, gives not only a clear explanation of what a monad is, but also provides a strong motivation for looking for monads, an approachable description of how they are derived from a practical point of view for use in a type system, and a number of useful, nontrivial examples. Probably one of the most well-written academic papers I’ve read. You should be familiar with an ML-style type system and notation to understand the examples. Or you need be be good at picking up such things on the fly.

Monads for Functional Programming. Enjoy.

* The canonical example of a monad is for some reason the Maybe monad, though I don’t know why – it’s certainly a monad, but it’s actually not the most intuitive first example. Most of the monad explanations I found online immediately launch into using Maybe as an example of why monads aren’t hard. They’re not, but the example is usually provided far too early, without enough background, motivation, or explanation, making the reader simply feel like he or she is missing some greater complexity (which is true) which was already explained (which usually is not the case). Wadler’s example of a simple interpreter is much clearer.

Con Kolivas on Linux Kernel Development

25 07 2007

Through OSNews I found an interview in apcmag (which I’d never heard of before) with Con Kolivas, a Linux kernel developer, who has decided to leave Linux development entirely. The interview discusses how he became involved, what he did, and why he left. His explanations have a lot of good insight into the problems both with the development of the Linux kernel, as well as OS development over the past 13 years or so as well. It’s quite interesting.


OS Agnosticism Through Virtualization

21 07 2007

Today, Xen was merged into the the mainline Linux kernel.

A couple days ago, another source drop occurred for the Solaris Xen support, which it seems will relatively soon be integrated into mainline OpenSolaris (this is my assumption, I have no knowledge about any particular plans behind this).

This has me asking myself a question given that I now have the ability to run Solaris on Linux, or Linux on Solaris (whether through Xen or an LX branded zone) at native or nearly-native speeds. If this trend continues, will it really matter forever what OS you run on your desktop, or what platform you develop your software for? Mac OS X and Windows are not yet fully cooperating with Xen (I’m not aware of any OS X efforts for this at all). But I suspect that eventually they will. For desktop users and developers – servers, or machines which are part of a mass-deployment may remain true to a single OS per box. Servers need (when they’re busy) every bit of speed they can get, and need to use as little power as they can. But desktop applications can usually take a slight speed hit without issue. And Xen is a pretty small performance hit. This of course assumes you aren’t running a new Windows computer loaded with crapware – I still can’t figure out how to get all of that off my mom’s HP.

And for those of you nay-sayers talking about games not running well virtualized, or virtualized platforms not getting the hardware access they need for games, check this out. It doesn’t have to be that way.

Edit: I originally forgot that Xen has support for HVM on newer chips, not just paravirtualized guests. This makes me pretty excited. Now if I could just get OS X to run as a domU, I run out and buy that Macbook I’ve been eying as soon as possible…

Electronics and Airport Security

20 07 2007

Another blog has an interesting post detailing an encouter with airport security while carrying some home-made electronics, with some interesting observations at the end:

But these are the times we live in. A handful of people with no knowledge of physics, engineering, or pyrotechnics are responsible for determining what is and what is not safe to bring on a plane. They’re paid minimum wage and told to panic if they see something they don’t recognize. Does this make me feel safer? It doesn’t really matter. Implementing real security would bring the cost of flying up, which would likely cause a collapse of the airborne transportation network this country has worked so hard to build up.

This is the sort of thing I’m always mildly concerned I’ll have to deal with when I fly with my big bag of electronics and AC adapters – which is why I always arrive a couple hours before departure time.

Citations, Citations, Citations

20 07 2007

I recently got a paper published in the ASM Workshop ’07. Two of my co-authors presented it in person in Norway, which I was unable to attend.

This was my first academic paper, and while it could have been better, I’m fairly satisfied with it. Finishing it off sucked up most of my time the first two weeks that I was out of school for the summer. I basically left Brown on a Saturday, slept the rest of that day, watched TV and drank soda on Sunday, and on Monday began two weeks’ worth of writing, initially cleaning up the draft submission which resulted in acceptance to the conference, and eventually rewriting the paper from scratch after our advisor took a closer look at it. Two days after final submission I flew out to California to begin my internship at Sun. I must admit that most of the reason this ate those two weeks was because we all put off the revisions for so long because of finals, but from other things Shriram has said, this seems to be pretty common among both inexperienced and veteran academics.

One thing he commented on while he was helping us to revise it was that a number of the citations we had in the paper were originally wrong. Many were taken from the homepages of paper authors kind enough to provide BibTeX information. Most of the others were taken (and somewhat corrected) from CiteSeer. However, Shriram caught that a number of these were still incorrect, and said that many (most?) of the citation information given on CiteSeer is incorrect or incomplete. When I asked what the normal way to find solid citations is, whether it’s anything more than spending lots of time researching, he answered in the negative.

Why is there no central, reputable, correct repository for this information? Or alternatively, why is CiteSeer going to the trouble of archiving all of these papers, and not doing so completely correctly?

Source Control and Build Systems

18 07 2007

I’ve been thinking lately about the shortcomings with SCMs (Source Control Management systems) and build systems. Mainly, that by integrating a certain kind of SCM with a build system, builds of large projects could be made to go a bit faster, as well as gaining some semantic information.

SCMs I’ve used:

  • CVS – About as basic as you can get while still being useful. Basic checkout-modify-checkin model which forms the basis of most SCMs.
  • Subversion – A slightly nicer version of CVS for most intents and purposes. This is what I consider baseline for source control. The only difference from CVS that I’ve ever cared about is that it screws up less frequently with binary files than CVS.
  • Mercurial – A nice distributed SCM.
  • SCCS / Teamware – Sun’s internal SCM. It’s a distributed SCM with very explicit parent-child repository relationships (which are of course flexible, more later)
  • Perforce – Pretty obnoxious in my experience, and not singularly any particular SCM paradigm.

I like distributed SCMs, largely because it also provides a way for a developer to have their own version control without dealing with the main repository. There’s no need to make the SCM work with another used locally, or with another copy of the same SCM – checkins can be done on the local repository without issue. This is important for developers who may for example just be starting on an existing project and not have commit access yet, or for a developer who wants to work on a laptop or other local machine under circumstances which lack access to the main repository. I also like a noticable parent-child relationship between repositories. SCCS has this. Mercurial has a notion of a parent repository (by default the one you cloned from), but it isn’t emphasized in use of the tool, so it isn’t as semantically meaningful – this is a personal nitpick, and I don’t really have a technical reason for backing up this preference. Perforce is mostly just obnoxious in my opinion*. It is similar to a DSCM, in that each developer has their own copy of the respository which they do direct file checkouts from. It keeps checkout state on a central server, however. And to save space, it enforces a requirement that all copies of the repository be on the same filesystem, because it uses hardlinks to do an initial clone of the repository. This is novel, and certainly saves space, but in addition to making remote work impossible also creates an emormous problem. Developers do occasionally screw up their umask on UNIX systems, so they do occasionally need to chmod files. Or more commonly, an editor will allow users to easily (sometimes accidentally) override read-only protection on files by chmod’ing them independently. Then the editing begins. It’s bad enough when someone commits bad code to a main repository, and those who sync up before the changes are backed out have their workspaces broken. But when someone accidentally chmod’s a file in the Perforce situation above, it can simultaneously break building in every checked out workspace that hasn’t done a local checkout of that file, which would remove the hardlink and replace it with an actual copy. For this reason, I dislike SCMs which make hardlinks for checkouts. One thing I like about both Perforce and SCCS/Teamware is that each has an explicit list of files being edited, which is the inspiration for the main point of this post.

Build systems I’ve used:

  • make (and parallel versions of it) – The baseline build system from a UNIX/*nix perspective.
  • ant – Used for Java.

The main problem with make and ant is that they are stateless – so for partial builds, they still need to do a lot of initial work to determine what needs to be compiled, and for languages like C, linked. Both traverse the filesystem, stat’ing every file listed such that the build target is dependent on it. For small projects, this isn’t significant. For larger projects (such as OpenSolaris, it becomes noticable, and increases the build time – particularly on systems with other slowdowns for file operations, such as older systems with slow disks, or systems using network mounts for files. An incremental x86 build of OpenSolaris, on the build machines at work, takes between 10 and 40 minutes depending on the number of users on the build servers, after changing one file. Recompiling any one of the main files in the genunix module takes less than 30 seconds. Re-linking all of the object files which compose genunix doesn’t take more than a minute or so. And rebuilding cpio archives for the bfu script only takes about 2 to 3 minutes. Which means that each OS build I do takes between 2 and 8 times as long as it should. This is partly a result of the OS makefile structure – there has been some discussion recently on some OpenSolaris lists about the construction of the makefiles. But the main issue is that either way, hundreds of files are stat’ed which are known not to have changed – the build system simply lacks this information.

One possible solution to this (which I might try to code up this weekend) is to combine a distributed SCM with explicit local checkout (so it has a full list of all potentially modified files), and a build system. This would allow the tools, given a dependency graph, to not even look at source files which don’t need to be rebuilt. Only modified files and those which depend explicitly on them would need to be checked. This could shave some noticable time off the OS build system. Of course, this doesn’t address things like header-dependency unless those are added explicitly to the dependency tree requirements. But for object files, this could be a win.

* It is probably worth noting that this is from my experience at one company where Perforce is in use. I haven’t looked into it too much, so these problems could just be a particular configuration of Perforce.