Working with Large ML Code Bases

11 02 2008

It’s interesting how tool prevalence and standardization (or lack thereof) for a programming language are very strong indicators of how many people have used it to build significant projects.

My honors thesis has me hacking up ML implementations – I started with OCaml, but later switched to MLkit after realizing that it already had support for some features I was working to implement in OCaml. As you might imagine, the source for a compiler and runtime for reasonably mature languages (the ML family of languages started in the mid-70s) is a reasonably hefty chunk of code. The layouts of the MLkit source tree is fairly clear (there is a conveniently-named “Compiler” directory, containing such subdirectories as “Backend/X86/”). The layout of the OCaml source is slightly less intuitive, but it comes with an informative README which points out where various parts of the system reside. Yet even within these more focused segments of the source trees, there is a lot of code. And it’s not boilerplate – when you need functions and types for spitting out most of the relevant instructions on the x86 architecture, or handling every abstract syntax tree node of a very expressive language, you’re going to have a lot of code. But there are no good tools for managing a code base of this size in any ML descendant.

I ran into this to some degree with the class project for my software engineering class last year (CS190 – 2007’s page is missing… I need to look into that). We wrote a frontend to GDB using OCaml. The idea was to display separate, independently-controlled panes for each thread of a multithreaded program (obviously, this didn’t scale well past, oh, 5 threads). It worked reasonably well at the end of the semester, but not well enough that we would ever release it. There were a few times, in a project which totaled only a few hundred lines of code, that I wanted some mechanized help working through other modules quickly to find what I needed. Several times I resorted to emailing a peer – hardly a terrible horror, but not something which should have been necessary for simply finding a function in a small code base. The only tools we had for managing the code base were…. Subversion, and VIm.

MLkit is nearly 240K lines of code in over 1000 files (including comments, but as it is primarily a research project, those are few and far between, plus a few odd linebreaks). Even cutting it down to just the compiler (ignoring its dependencies elsewhere in the tree) we’re looking at almost 56K lines of code in over 150 files.

C, C++ and Java programmers have a plethora of tools at their disposal for working with code bases of this size. When I was at Sun this past summer, between cscope and having OpenGrok set up on the kernel code, finding definitions, declarations, types, etc. was usually a breeze. Both have their limitations, and their bugs, but they’re still great. And that’s not even using a lot of the IDE support present in major IDEs like Visual Studio or Eclipse.

ML programmers have no reliable systems which I can find.

A quick search for OCaml IDEs yields a few results. A page listing a few of what look to be mostly-dead OCaml plugins for Eclipse. Of these ODT has apparently seen activity this year, but seems to be mostly syntax highlighting and a build tool. A from-scratch IDE, Camelia, built by my friend Nate when we TAed CS017 (now unfortunately renamed CSCI0170, thanks to the abomination that is Sungard‘s Banner…). Searching for any of OCaml, SML, ML, or Haskell with “cscope” on Google yields nothing of use. Jane Street Capital uses OCaml. I wonder what tools they use.

Oddly enough, it seems the winner in this case may be the F# plugin for Visual Studio. It has support for following references around large code bases, and is actively maintained (as MS intends it to be a mainstream .NET language in the near future). Unfortunately, it can also only deal with F# files, which are close to, but not quite close enough to Standard ML files….

Perhaps I’ll build a cscope-alike for ML myself.

After I finish my thesis.

EDIT: A commenter on reddit suggested taking a look at ocamlbrowser. It seems to be at least mostly what I’m looking for (for OCaml), though I can’t seem to get it to do anything other than display the types for functions in the standard module library, so I can’t say for sure. There’s a setting to change the module search path, but it doesn’t seem to change anything for me. I also find it odd that despite the fact that it is included with the OCaml base distribution (at least in Debian), no web search for any query I can think of which expresses the sort of task ocamlbrowser is supposed to facilitate yields any results referencing it. This suggests that very few people use it – maybe I’m not the only one who can’t get it to work on an arbitrary code base easily. Might play with it some more once I get some more time.

EDIT2: Another reddit commenter mentioned otags (in the vein of ctags), which looks good.  If only I was still working in OCaml :-p On the other hand, looking up the link for ctags made me look at that again (I had foolishly assumed that it, like cscope, supported only C and C++).  In particular, apparently someone reimplemented ctags for more language support (and called it Exuberant Ctags).  The language support list claims that it in fact works on SML!  A download and compile later, I’m quite the happy camper!  It works just like the standard ctags.  Many thanks to zem on reddit for inspiring a very fortuitous Google search.

Advertisements




Variant Interfaces

1 01 2008

Today I finished my first pass reading of Colimits for Concurrent Collectors, and was struck with an interesting idea. The paper is a demonstration of an approach to refine specifications using the notion of colimits. I don’t fully understand the refinement process yet (hence the “first pass reading”). The paper contains many examples specifications written in a variant of Specware, to specify a safe interface between a mutator and garbage collector without using stop-the-world semantics – this is the case study for the refinement.

An interesting pattern of sorts is used. A number of the specifications for portions of the system are parameterized by other specifications. A common idiom in the examples are lines similar to:

MORPHISM Cleaning-View = ONLY nodes,roots,free,marked,workset,unmark

What this line means is that the current specification, when it refers to an instance of the Cleaning-View specification, can only access the mentioned observers. This is interesting if we translate it to interfaces in an object-oriented environment. What this amounts to is essentially defining an interface (here, a specification) and then multiple subsets of this interface (here, imported subsets of observer functions). One application of this (and I think the most interesting) is using the sub-interfaces as restricted capabilities on objects. This is a reasonably common use of interfaces in OO languages.

You can already mimic this in normal object oriented languages by defining minimal subsets, and larger subsets as extending the smaller interfaces. But what about keeping siblings with overlap in sync? A nice alternative would be to be able to specify all of the siblings together, as a set of variant interfaces (similar to the notion of variant types). For example, in psuedo-Java:

public interface Primary {
    void func1();
    void func2(int);
    void func3(double);

    subset Takeargs {func2, func3};
    subset Noargs {func1};
    subset Nofloatingpoint {func1, func2}
}

This example would of course need to be cleaned up a bit to deal with method overloading.

The problem with doing the equivalent of this in normal Java is keeping these “sub-interfaces” in sync with the main interface, and keeping overlapping siblings in sync without just having an arbitrarily large hierarchy of interfaces. Allowing joint specifications, and then referring to variants through names like Primary.Takeargs (to say that a variable should be of type Primary, but the current scope should only have access to the Takeargs subset) would ease this difficulty, making the use of interfaces for restricted control slightly easier to deal with as a programmer.

I also wonder what could be done with this as an extension to a language such as Spec#, with pre- and post-conditions on methods. You could then deduce that objects using only a restricted subset of an interface might maintain certain invariants (such as making no modifications to an object), more easily than before. More importantly, you could guarantee it statically at compile time – the interface variants would be subtypes of the main interface type. Without the explicit sub- and sibling-interface relationships it would still be possible to statically verify properties of interest, as the compiler can see what methods are called on various objects, but this approach lets the programmer encode this restriction easily.

Ultimately this is just syntactic sugar for a construct which can already be used, but it encourages the programmer to make the type checker do some more verification work for the programmer. It’s also sort of a static version of something similar to the E caretaker for capability delegation or more generally the object capability model(lacking revocation, because all of this happens at compile time). I should read some of the EROS and Coyotos papers…

It would be interesting to hack up a prototype implementation of this, but I don’t feel like mucking about right in the internals of another compiler (thesis work…), and unfortunately I don’t think any of the good meta-programming languages are really appropriate for this. Scheme is dynamically-typed, Ruby is duck-typed (so really, dynamically typed), and you need static typing to get the programming benefits of this. Perhaps it’s finally time for me to learn MetaOCaml.





Typing the Stack, and Why Tag-Free Garbage Collection Doesn’t Quite Work

28 11 2007

One of the most significant barriers to implementing a completely safe garbage collector lies with collecting the stack. Heap datastructures are relatively easy to walk — that is to say, that one can generate a walk function for each variant of heap datastructures, etc., in the language being collected. The stack is more difficult because it is not an explicit structure in the language. The other work I’ve read towards this (described in the post linked above) made rather awkward contortions to get around this fact. Wang and Appel CPS’ed and converted closures for their entire programs. Wang and Fluet side-step the issue by writing a safe collector for the internals of an interpreter – which isn’t really quite garbage collection in the normal sense, since they just have a Scheme stack structure defined in Cyclone which they then traverse. The harder problem is to actually type the machine-native stack.

  • The Essence of Compiling with Continuations: Towards this, my partners and I read the work defining A-Normal Form, or ANF. The authors examined a number of CPS based compilers to see what work they actually performed (many compilers use CPS as an intermediate representation to support some transformations). They found that to mitigate the code expansion of CPS transformations, and to get rid of some of the execution overhead from relatively trivial continuations, the compilers performed a number of additional reductions, pulling the continuation out of argument lists and leaving it implicit in a register, and a few other tricks. Adding these transformations together, it turns out that the sum total of these actions is essentially CPSing the source program, performing the normal reductions, and then un-CPSing the program. The authors then present an equivalent transformation directly on the source language, which is simpler to implement. This essentially accomplishes the same thing as using SSA – giving a name to every intermediate result of the computation.

This may help with typing the stack – we might be able to generate each frame’s type from the ANF of the program. Obviously some more work needs to be done here; how this might be done isn’t immediately obvious. I complained that we would have to mark the stack frames in some way, and to suggest this was not as terrible as I thought, our advisor suggested reading a couple papers by Benjamin Goldberg which pretty much wrapped up work on garbage collection without tags. In a nutshell, it really only works for languages lacking polymorphic operators.

  • Tag-Free Garbage Collection for Strongly Typed Programming Languages: This paper describes a way to sidestep tagging pointers on the stack into the heap (the notion was mentioned earlier by Appel, but implementation details were omitted). It does this basically by embedding frame-specific garbage collection routine pointers right after function calls. It then changes compilation so that returning from a routine goes an extra instruction later. Goldberg says that this works fine because the retl instruction on SPARC is only a pseudo-instruction which compiles to a jump with a hard-coded offset – I wonder how portable this is. This of course only works in a straightforward way for monomorphically typed languages. To extend the scheme to support polymorphic functions, they begin passing type-specific traversal functions to the frame-specific collection functions. This is then extended yet again to support higher-order functions – the overall strategy is to stash location-specific pointers in select locations to provide ready access to correctly-typed collector procedures in an efficient manner. There is also an addendum at the end about implementing collection for parallel languages. One interesting idea presented which isn’t explored in much depth is that there can be different collection procedures for different stages of each procedure. This allows for optimization such as not traversing uninitialized pointers early in the procedure, or not traversing dead local variables during collections after their last access in the function.
  • Polymorphic Type Reconstruction for Garbage Collection without Tags: This second paper revisits much of the content of the first paper, more clearly presenting the type reconstruction algorithm. It then addresses the fact that it is possible to have objects reachable from the garbage collection roots for which it is impossible to reconstruct the type, and proves that in the absence of polymorphic primitive operators, it is possible to reconstruct the value of any useful reachable object, because any object which matters will be passed to a primitive operation at some point, and types can be reconstructed from there. He later touches briefly on the issue of having a polymorphic equality operator (or any polymorphic primitive), and the problem with the operator needing to use data of the objects to do its job, without knowing the type, and mentions offhand that this is dealt with by implicitly passing a method dictionary to the operator, which amounts to a tag. This is still tag-free in the sense that there are no extra bits used up in the pointer, but not fully tag-free because the compiler still needs to pass explicit runtime hints about types.




Solaris and OS X Playing Together

4 11 2007

After winning my iPhone in a raffle, I was told by those who raffled it off that it was purchased before the price drop, and as a result, I could get the $100 credit.  What better to spend it on than upgrading to Leopard?  After all, a system with DTrace, DTrace-enabled Ruby, and a friendly GUI to boot is too good to pass up when you only have to pay $40 after taxes and credit. So I spent about 6 hours backing up my data to my Solaris box.  Why so long?  Well, I could not make Tiger and Solaris 10 play nicely with file sharing, so I had to resort to using scp to move 32GB of data.  Not pleasant.  I’d tried getting Tiger to mount NFS shares, but with little (read: no) success. So naturally one of the first things I did with my machine after installing Leopard, Firefox, and Adium was start poking around in the settings.  I quickly found a “Show Advanced” button in the Directory Utility.  And what appeared?  A tab for network mounts!  All I did was add a quick entry of the form you’d expect, start NFS on my Solaris machine, and voila!  Immediate access to my backups over NFS.  This is great – it means I can easily restore my iTunes collection to the fresh install of Leopard.  The ease with which this was done makes me suspect I missed something glaringly obvious in Tiger. Of course, a few things remain to be dealt with:

  • Supporting read/write access – just unifying the UID of my account on both machines.
  • Denying read access to other clients – probably just changing the permissions in my home directory.
  • Getting my Mac to dynamically resolve the hostname of my S10 box – we use DHCP on my home network, and I don’t want to change this.

Now, here’s to hoping this will get me to do more work on my ruby-on-rails project while I’m flying between coasts the next few weeks! 

A successful NFS configuration on OS X Leopard





Provably Correct Garbage Collection

25 10 2007

Working towards my honors thesis, my partners and I have read several interesting papers on using type systems to prove correctness (safety) of garbage collection implementations:

  • Type-Preserving Garbage Collectors: In this paper, Wang and Appel present a way of writing garbage collectors in the language being collected, and relying on a statically checked type system to prove safety of their collector. They implement a copying collector as a function written in the ML-like language they are collecting, basically by parameterizing all types by memory regions. Each data type and language function are parametrized over a single region as a type. The garbage collection routines are a basic collector plus per-datatype collector routines parameterized over a from-space and a to-space. Each copy function deep-copies a structure in the from-space, and returns a copy located in the to-space. They address pointer sharing by providing an implementation with forwarding pointers (for which they perform an odd version of subtyping to guarantee the user program can’t modify the forwarding pointers in objects, since the collector is being written in the language being collected). To guarantee that the collector returns no pointers to data in the from-space, they introduce a language construct ‘only’ which takes one or more regions and a body, and executes the body with only the specified regions in scope. The problem with this is that if you forget to use the object, you could actually return the original The paper is fairly well-written, and the work is presented in a series of progressively more complete garbage collector functions. A fair amount of time is spent addressing particulars of their implementation, because they make a number of simplifying assumptions. The language supported is first-order only, and is written in CPS with closures converted to datastructures. Also presented are performance results, which are interesting, but largely inconclusive because the system they present is intended only to show that it is possible to write a safe garbage collector, and not meant to be a practical system. Many optimizations are mentioned throughout the paper as future work.
  • Implementation and Performance Evaluation of a Safe Runtime System in Cyclone: This paper by Fluet and Wang aims to show that a similar approach to that above can actually be done with reasonable performance. They implement a Scheme interpreter in Cyclone (a safe dialect of C). They also build a copying collector, using Cyclone’s region primitives to do so. The approach is relatively straightforward (and largely similar to the approach above) with one exception. Instead of a safe upcast with a construct which prevents later downcasts, they use linear types (unique pointers) along with some under-specified functions to create a sequence of region types, to support forwarding pointers. They compare their interpreter running some standard Scheme benchmarks to a couple others. Their interpreter running their garbage collector actually slightly outperforms their interpreter using the standard Cyclone conservative collector – consistently. That is fairly impressive, and shows the overhead for safe collection alone is not a significant bottleneck. Comparisons to MzScheme however are not so great – with the exception of the continuation-based benchmark, MzScheme far outperforms their implementation, by a factor of 2 to 3. And this was done in 2004, before MzScheme had a JIT compiler. This may be a result of the structure of their interpreter; it is a very basic interpreter, which performs a predefined number of steps on an abstract machine structure, checks to see if a collection is necessary, and repeats. They perform a comparison however of the total amount of unsafe code in each implementation, and come in at about 1/6th the amount of unsafe code as the next ‘cleanest’ implementation. Most of this apparently is from a highly optimized version of malloc() used to implement the region primitives.

Both papers write their main programs in slighly odd ways – the Wang/Appel paper does CPS and closure conversion to make stacks and environments safely traversable, and the Fluet/Wang paper sort of cheats by writing a straight interpreter in a separate language. Not to degrade the latter – they still write a garbage collector which is (itself) free of unsafe code, which wonderful. But the constructions as state machines and the CPS and closure conversions definitely hurt performance. They’re necessary, however, in order to safely collect stacks and environments.  My thesis adviser told my partners and me that he may be aware of a way to walk these structures without the efficiency loss of the explicit structures used in these papers.  So next on the block: Continuations from Generalized Stack Inspection.  That, and figuring out just how the two mysteriously underspecified primitives in the Cyclone paper really work (time to read the corresponding technical report), and contemplate the problems of forgetting to use the ‘only’ construct in the first paper.





Preparing to Build a Rails App

24 10 2007

I’ve been thinking for some time about building a system to organize all of the papers I need (or just want) to read, whether I’ve read them, etc., along with some indexing and tagging system so I can search for papers both by topic, and by my motivation for reading them.

As I’ve been enjoying playing with Ruby lately while working on crawling online postings of course textbooks for Mocha, and there’s been so much hullabaloo about Ruby on Rails, I’ve decided to give Rails a shot.

The instructions given for getting up and running are pretty straightforward, and short. Downloading, building, and installing Ruby was no problem. Downloading and installing RubyGems was also without incident. Pretty smooth so far. The next step in the instructions looks very straightforward: run gem install rails –include-dependencies. This yielded an error about not finding the rails package, which seems like a pretty big problem for one of Ruby’s flagship apps. Googling revealed that for some reason, you need to purge and regenerate your package cache sometimes before this works. So I did that. Still, no luck, but the error this time is that the package server refused my connection. More googling yielded the helpful suggestion “Your error message seems like a Rubyforge problem. It happens occasionally. Just try it again later.” For something that bills itself as being akin to apt-get, it has a long way to go when a user’s first interaction is:

[colin@sliver:~/Desktop/rubygems-0.9.4]$ gem install rails --include-dependencies
Bulk updating Gem source index for: http://gems.rubyforge.org
ERROR:  While executing gem ... (Gem::GemNotFoundException)
    Could not find rails (> 0) in any repository
[colin@sliver:~/Desktop/rubygems-0.9.4]$ gem env
RubyGems Environment:
  - VERSION: 0.9.4 (0.9.4)
  - INSTALLATION DIRECTORY: /usr/local/lib/ruby/gems/1.8
  - GEM PATH:
     - /usr/local/lib/ruby/gems/1.8
  - REMOTE SOURCES:
     - http://gems.rubyforge.org
[colin@sliver:~/Desktop/rubygems-0.9.4]$ rm -f /usr/local/lib/ruby/gems/1.8/source_cache
[colin@sliver:~/Desktop/rubygems-0.9.4]$ gem install rails --include-dependencies
Bulk updating Gem source index for: http://gems.rubyforge.org
ERROR:  While executing gem ... (Gem::GemNotFoundException)
    Could not find rails (> 0) in any repository
[colin@sliver:~/Desktop/rubygems-0.9.4]$ gem update
Updating installed gems...
Bulk updating Gem source index for: http://gems.rubyforge.org
Gems: [] updated
[colin@sliver:~/Desktop/rubygems-0.9.4]$ gem install rails --include-dependencies
ERROR:  While executing gem ... (Gem::RemoteFetcher::FetchError)
    Errno::ECONNREFUSED reading http://gems.rubyforge.org/gems/rake-0.7.3.gem
[colin@sliver:~/Desktop/rubygems-0.9.4]$ gem install rails --include-dependencies
ERROR:  While executing gem ... (Gem::RemoteFetcher::FetchError)
    Errno::ECONNREFUSED reading http://gems.rubyforge.org/gems/actionwebservice- 1.2.5.gem
[colin@sliver:~/Desktop/rubygems-0.9.4]$ gem install rails --include-dependencies
ERROR:  While executing gem ... (Gem::RemoteFetcher::FetchError)
    Errno::ECONNREFUSED reading http://gems.rubyforge.org/gems/rails-1.2.5.gem
[colin@sliver:~/Desktop/rubygems-0.9.4]$ gem install rails --include-dependencies
ERROR:  While executing gem ... (Gem::RemoteFetcher::FetchError)
    Errno::ECONNREFUSED reading http://gems.rubyforge.org/gems/rails-1.2.5.gem
[colin@sliver:~/Desktop/rubygems-0.9.4]$ gem install rails --include-dependencies
Successfully installed rails-1.2.5
[colin@sliver:~/Desktop/rubygems-0.9.4]$

Hm. At least the next step to generate a simple rails app works fine.





Slight Changes

22 10 2007

Some slight changes have been brewing for some time at this blog.

Mainly, I will start writing here about as many papers as I can of those I read. The motivation for this is twofold: increasing frequency of writing, and understanding papers better. My thesis advisor has been telling me for some time now that I need to write a one paragraph summary of a paper after I finish reading it, that I should not need to refer back to the paper in order to do this, and that I should not move on to the next paper without producing a summary for the previous. I’ll try to group these into sets of relevant papers which also tie together my thoughts on the papers and ideas I get from them.

The next few papers will be on Singularity and provably correct garbage collectors. The JIT JavaScript VM I was originally going to write with friends this semester split into multiple groups when another team member decided to pursue another project and took half the group with him. I intended to stay with the original project, but only two others made the same choice. And three people is not enough to implement a complete JIT compiler from scratch in 3 months. We meandered through a number of topics, and still don’t have a clear final direction (which is worrisome considering this is supposed to turn into my honors thesis). We’ve settled more or less into the broad area of cross language memory management and runtime systems. We’ve read a number of papers: several on conservative garbage collection, a couple on Haskell FFI, the Xerox PARC Portable Common Runtime, and a few others. I’ll try to write about all of them at some point, again grouped into logical sets. That’s all for now – I have to spend time getting ready for upcoming midterms and interviews.