Summer Reading

8 06 2011

I’m headed back to Microsoft for the summer, to do some program analysis work.  I’ve got quite a lengthy commute from where I live in Seattle, which should be made much more bearable by the Microsoft Connector bus.  Since I won’t be driving, I should have plenty of time to get some reading done.  My reading goals for the summer are to work my way through:

  • Type Theory and Functional Programming: An older (1991) book on type theory as in (dependent) constructive type theory, that appears to also (so far) do a good job connecting type theory to logic.  It’s hard to find a thorough reference that builds up to dependent types as they’re related to proofs and programs, from first principles of logic.  I believe there’s some overlap between this book and the next (Coq’Art), but TTFP is so far a much gentler introduction than the first few chapters of Coq’Art.  And since it’s out of print, the author has generously posted a PDF version online, so it’s convenient and free in addition to useful.
  • Coq’Art: Interactive Theorem Proving and Program Development: The reference on using Coq for proofs and program development.  Thorough, but dense so far.  It seems mostly focused on the use of Coq, but covers the underlying Calculus of Inductive Constructions as well.  After discovering first-hand how painful it can be to maintain a proof by hand as a formal system evolves, I’d really like to formalize all of my future proofs in a mechanized proof assistant.  And I have a desire to do some work on a Ynot-style embedding of a richly typed language.
  • Lambda-Calculus and Combinators: An Introduction: Covers what you’d expect from the title: connections between the lambda calculus and combinatory logic.
  • Semantics Engineering with PLT Redex: Covers some core background on specifying abstract machines (normally spread across many older papers), as well as the use of PLT Redex.  The authors to a good job of articulating the structure and purpose of the book here.
Quite a substantial undertaking to be done all outside of work (while still having a life), but I should at least be able to make a dent in things.  The fact that there’s some overlap in material among all of these books should also make it easier to get through all of them (and help the overlapping material sink in).  I’m already almost a third of the way through TTFP.  I’m not sure what order I’ll read the others, but I’m excited about all of them.




Problems with C and C++ Separate Compilation

11 11 2008

After graduation, a couple months of watching television, driving cross-country (if you get the chance you should drive across northern Wyoming), settling in at Microsoft and living in Seattle, I’m back.  And I’m annoyed at C.

C is a fantastic language in many ways.  It is essentially an abstract assembly language.  Almost any general-purpose operation which can be done in assembly can be done in C, and it makes building large, relatively portable systems much easier.  The only things which can’t be done directly in C are operations on specific registers (and it’s easy enough to link in short assembly routines when that’s necessary.

Most of my early interest in programming languages and most of my problems when I first started doing systems work were related to basic typing issues: the ugliness of casting things to void pointers and back, the conversions between various integer types, and other relatively mundane C errors which are easy to make and hard to debug.  I came to believe that additional features other than type system and memory safety improvements in other languages, while extremely useful, were mostly great conveniences rather than fundamental improvements.

But the past several months have changed my mind.  While the ease of turning a pointer to one type of object into a pointer to another type in the same place is certainly a bane as often as it is a boon, a reasonably experienced C programmer begins to recognize common symptoms for such problems.  A more serious, though less frequently encountered problem has to do with type identity and versioning.

Consider the case where you write an application (in C) to use an external library.  Your application interfaces with this library through two means: #include-ing its public header, and being linked to the library’s object file.  Initially these two interfaces will probably be fine (if, for example, you just installed this library).  Now move forward a couple months.  Update your library.  Did your update include the object file and the header file?  If not then any of sizes or layout changes to the library’s data types might cause non-obvious errors; your application will happily compile and link, but the results you get back from the library may not be what you expect.

What if it’s your library, or just an object file in your project?  These tend to have a fair amount of turnover.  Most moderately-sized projects use separate compilation to separate code changes and avoid recompiling the same code repeatedly if it doesn’t change.  But when tying these object files together, there are no checks to ensure that data structures exchanged between object files are consistent; the C compilation model assumes that your data structure definitions are stable, or you recompile from scratch every time.  It also makes the reasonable assumption that the same compiler is used for every object file.  On the off chance you violate that expectation (perhaps with a compiler update), memory layouts of the same structure definition may differ between object files.

It’s possible to work around this problem with a build system if you track every header file dependency explicitly.  For large projects, this can be difficult.  Especially with fast-moving projects, it’s easy to add an include to a .c file without remembering to add the dependency to the build system configuration.  Once this missing dependency goes unnoticed for some time it becomes considerably more difficult to track down, and developers end up either spending their time debugging the build system or resorting to rebuilding from scratch every time in favor of the broken incremental build.

Another permutation of the same problem is that of unrelated structures with the same name.  It’s easy to imagine a large system with two subsystems defining structures named CALLBACK_ARGS.  What happens when one section of code needs to interact with both of these systems?  If all appropriate headers are included, then the name collision will be detected.  If only one of the conflicting headers is included, then depending on how the headers are organized it becomes trivially easy to pass the wrong structure to a function.  Especially when working on a new system, it usually seems reasonable to assume that structures of the same name are the same semantic (and in-memory) structure.

Namespaces can help alleviate the same-name problem: including only one structure’s header and trying to pass that to another function will result in an error complaining about passing an argument of type Subsystem1::CALLBACK_ARGS* to a function expecting a Subsystem2::CALLBACK_ARGS*.  This doesn’t actually prevent you from declaring two structures of the same name in the same namespace in separate header files, but if namespaces are used judiciously to separate subsystems then the likelihood of doing so accidentally is greatly reduced.

The versioning problem is a direct result of how #include works in C.  Rather than being a direct part of the language, #include is a preprocessor directive equivalent to “take the text of the specified file and pretend I typed it in place right here, then pass that result to the actual compiler.”  At its core most C compilers only handle single files at a time, so they don’t actually know anything about other object files (or at least, they don’t directly use information about other object files).  That’s the linker’s job, and the linker knows nothing about structures per se – only matching symbolic references.

One solution is to store all structure layout information in object files, and generate code for accessing those structures once at link time.  This slows the linking process, but prevents the mismatched definition problem; all code for accessing the structure is generated at the same time from the same definition.  This blurs the distinction between compiler and linker, but adds great value.

Doing this at compile time for static linking is relatively cheap and straightforward.  Doing this at load-link time is a bit trickier.  While compilers and static linkers can play any tricks they want for code which only interacts directly with itself, dynamically linked executable formats must be defined in standard ways, limiting what can be done.  I don’t know of any major executable formats which support this (most were designed in the heyday of C and C++, when they were still the best languages around), but that is a matter of format standards rather than a technical limitation.  This would be more expensive than current dynamic linking, but doable.   A compiler could choose to use a richer format for its own object files and then resort to standard formats when asked to generate a standard library or executable.  OCaml does this; for a Test.cmx and Mod.cmx compiled to objects using differing interface files for a Test module data structure:

Yggdrasil:caml colin$ ocamlopt Test.cmx Mod.cmx
Files Mod.cmx and Test.cmx make inconsistent assumptions over interface Test
Yggdrasil:caml colin$ 

Unfortunately C and C++ have a compilation and linking model which is now so well-established that I suspect any proposal to fix this in the standards for those languages would likely meet with significant resistance.  Though at the same time, I can’t think of any desired C\C++ semantics that this would break, so maybe it could happen.





WordPress for iPhone Bug

23 07 2008

The few of you who subscribe to this blog’s RSS feed may have noticed a post with the title “!$title$!” preceding this post briefly.  Apparently this is a known bug in the recently released WordPress application for the iPhone. For many users, as soon as you add your account to the program or if you open a new post then back out of it, the program makes a test entry and leaves it.  This bug is so easy to trigger, I’m surprised it made it out the door (or that a random ‘test post’ function made it in the final product at all).





Done with Undergrad, Procedural Generation, Model Checking

17 05 2008

I’m finally done with undergrad.  It feels like 4 years just blew by in the blink of an eye.  And come fall, I’ll be working, but also applying to grad schools.  What a ride.

Incidentally, I did finish my thesis.  It’s here: Type-Safe Stack Inspection for Garbage Collector Implementation.  Every time I write that title now I wish I’d cut it off as “Type-Safe Stack Inspection.”  Oh well, too late now.  I meant to post about it right after I finished, but the final edits ran into finals period, which consumed the past two weeks of my life.

On to other projects.

Currently, Brendan, Rob and I are working on building a fairly ambitious mapping tool.  We decided to use procedural modeling implement a multiple-level-of-detail real-time procedural generator for, well, planets.  The idea is that eventually we’ll start by rendering a planet with continents and oceans on screen, and the user can zoom in – to nearly any level of detail.  Moving closer to the planet gradually adds greater and greater levels of detail to the planet, laying out rivers, mountains and valleys, then cities and roads, laying out city streets and buildings, and possibly even building internals.  We think we can do it quickly and without using unreasonable amounts of memory by taking advantage of the fact that procedural generation is deterministic if you reuse the same random number generator seed.  So we can cache seeds for areas along with constraints on edges for some areas, and use negligible amounts of memory for storing enough information that you can back out or cross the planet, have the old region freed from memory, return and see the same things.  We’re aiming pretty high, so we’ll see what happens.

I also picked up a copy of a book my advisor told me about, Principles of Model Checking.  It was just released earlier this month, and it’s great so far.  I read a couple of the original LTL and CTL model checking papers (LTL: Model Checking Using Automata Theory, CTL: Automatic Verification of Finite-State Concurrent Systems Using Temporal Logic Specifications) a couple years ago in a seminar, but I didn’t have enough background in CS to really have all the details quite click for me back then (those papers were actually the first time I saw automata in CS, before I took a theory of computation class).  This book is shaping up to be a good refresher for me (also filling in gaps from what I missed the first time I looked at this material) and seems like it would be a good introduction for those new to model checking as well.  Sadly there is not yet a place to look inside it online.





Mercurial + MLKit + OS X = Failure, plus workaround

21 03 2008

My thesis, as I’ve mentioned several times before, has me hacking up the source to the ML Kit compiler. Because the current build arrangement I’ve been using requires a great deal of RAM, I’ve been using my desktop machine and CS department machines exclusively for building. However, I’m going home for part of spring break this coming week, there’s no computer with appropriate specs at home, and editing code over SSH is rather painful because of the high latency involved. The obvious solution, since I’m using Mercurial to manage my thesis, was to do a checkout on my laptop (a G4 Powerbook lacking enough RAM to build, and lacking the correct processor architecture to test), edit locally, and just push intermediate versions back to the CS systems to build there. This would result in some useless crap in the commit logs, but it beats the alternatives. I had an existing checkout from back in January, when I was using OCaml, so I went to update and saw the following:

Yggdrasil:thesis colin$ hg up
0 files updated, 0 files merged, 0 files removed, 0 files unresolved
Yggdrasil:thesis colin$ hg pull
remote: ####################
pulling from ssh://cs/thesis/
searching for changes
adding changesets
adding manifests
adding file changes
added 44 changesets with 18762 changes to 17825 files (run 'hg update' to get a working copy)
remote:
remote: Forwarding you to cslab0a
remote:
remote: ####################
remote:
Yggdrasil:thesis colin$ hg up
abort: case-folding collision between mlkit/kit/basis/MLB/RI_PROF/Splaymap.sml.d and mlkit/kit/basis/MLB/RI_PROF/SPLAYMAP.sml.d
Yggdrasil:thesis colin$

I saw this, noticed it was a temp file (ML Kit makes an MLB directory for temp files) and hg rm’d all such files in the CS copy, committed, and tried again. Same result. Poked around online, and after a bit, found a message from someone with a similar problem, with a similar case difference between two distinct files, and a response pointing out the problem with Mac OS X’s case-insensitive filesystem.

When I installed the latest OS X, I intentionally opted for case-insensitive because many Mac programs have a habit of breaking on case-sensitive filesystems. And now I can’t have a copy of the ML Kit source on my machine. It’s common practice in SML programs to use all-caps names for files containing signatures (akin to interfaces for the non-ML folk) and using “upper” camel case for implementations. So I can’t just rename a couple files, it seems you just can’t have the ML Kit source on most OS X installations’ local disks.

So was it time to give up on this? No, of course not – I need to work on my damn thesis while I’m home. Screw Apple and it’s crappy filesystem design choice.

Fortunately Apple did make one excellent, excellent design choice with OS X, which was making disk images a standard, common object. So I used the Disk Utility tool to make a disk image containing a case-sensitive filesystem, mounted it, and did a checkout inside that. It’s pretty stupid that I needed to do this, but at least I can work on my thesis over break without running Vim over SSH.

Now if you’ll excuse me, my friend Owen is heading off to Apple’s filesystem team next year, and I need to go give him grief for his group’s terrible design choice.





Amazon’s Book Recommendation System is Quirky

15 03 2008

Like many others, I have an Amazon.com wishlist, allow them to track my purchases, and occasionally tell them that I own a particular book if it is recommended to me.  Amazon uses this along with purchase sets made by other customers to generate “better” recommendations for books I might find interesting.  One thing its recommendation algorithm doesn’t seem to take into account is general subject knowledge – it knows of subcategories, and purchases of multiple items by the same people, but has no concept of actual relationships between books in a category.  This is actually a pretty reasonable thing to omit since there’s a great deal of such knowledge which could be taken into account, and the work/reward ratio for maintaining that is probably not great.  But as a result, I’ve had some rather interesting recommendations.

Also, I’m linking to books for clarity, but there’s no Amazon referral crap going on.

Mathematics texts.  Books on my Amazon wishlist: Foundations of Mathematical Logic, The Calculi of Lambda Conversion, Mathematical Foundations of Information Theory, Categories for the Working Mathematician, Basic Category Theory for Computer Scientists, among others.  Some of the more amusing mathematics recommendations from Amazon: Pre-Algebra, Algebra 2.

Literature.  Books on my Amazon Wishlist: Going After Cacciato, Letters to a Young Poet, The Norton Anthology of Poetry, ‘Tis: A Memoir, The Dharma Bums, among others.  Amazon recommends: Vocabulary Workshop: Level C.

These are just the few that I dug up spending a couple minutes searching my emailed recommendations; I’ve seen plenty more out of place recommendations upon logging into Amazon.com.  While it’s certainly true that some people will buy books at both extremes of a field’s complexity (perhaps something advanced for themselves or a friend, and something basic for a child), it’s still an amusing quirk to have this disparity with recommendations for an individual because of the presentation.  Amazon.com presents recommendations as a set of items you should be interested in according to Amazon’s “understanding” of your personal interests, when it’s really just correlating your purchases and explicit interests with the purchases of other users.





Giving up on Ruby on Rails for Python and Django

24 02 2008

Well, a while back I wrote about starting to use Ruby on Rails to build myself a web application to track technical papers which I had read or planned to read. I gave up on RoR after a (very short) time period. Now, despite preferring to code in Ruby over Python, I’ve switched to Django. Why?

  • While I am capable of writing SQL myself, I really, really loathe it.
  • The Django documentation is easier to find/deal with
  • Django seems to do more busywork for you than Rails does.
  • It’s easy to poke around in the intermediate stages of development.

One of the pain points of RoR is the fact that you still have to deal directly with your database engine (likely, as in my case, MySQL). I’ve learned to deal with SQL after working on Mocha for a while. That doesn’t mean I like it. In fact, it’s still a pain. And the worst part of dealing with SQL is setting up tables, which is in fact the only database work Rails requires you to do – the worst part. It then does some magic to connect your ruby to the schema you’ve specified. Django lets you define the data model directly in Python. From there, it automatically creates the SQL tables, including creating join tables for many-to-many relationships, which is where I hit the main roadblock with Rails; I knew how to define the tables for the many-to-many relationship, but for whatever reason couldn’t easily find information on making a many-to-many relationship explicit in the Ruby model. The only SQL I had to do with Django was enough to create a new database, and ‘CREATE DATABASE whatever’ isn’t much to do, even if you’ve never used SQL directly before.

The many-to-many relationships documentation for Django is still a bit confusing, but clear enough that you can play with things, and unlike RoR, I could find it. The RoR documentation seems to consist primarily of the class API docs (only useful once you already mostly know what you’re doing) and somewhat mediocre 3rd-party tutorials. The Django documentation is actually organized as a manual. Overall the Django documentation is clear, and the initial tutorial on the Django site is incredibly useful – I’ve built everything thus far simply following along with the tutorial and tweaking things as needed – following links to more detailed documentation on what data types are available, using many-to-many instead of many-to-one, etc. Django also creates far fewer files, and almost all of them have an immediately obvious purpose. There’s less magic in the system, without making you do more work. It also makes setting up an administrative interface much easier, and includes a module for easy authentication setup for your application in the default install.

This is not to say that picking up Django has been nothing but butterflies and ponies; I’ve run into a few rough edges.

One design decision in Django that I’m not sure I’m happy with (though I’m not unhappy with it either) is the decision to orient it towards a “newsroom model” as the tutorial describes it. This means that the architecture was designed with the intention that there were distinct sets of users: publishers and readers. Django includes a big module (the one just mentioned above) for managing users and creating an administrative interface (‘django.contrib.admin’) and the documentation suggests its use. While you’re certainly free not to use it, its presence encourages its use, thereby encouraging the same model the developers had in mind. This isn’t necessarily a bad thing. I’m not sure it makes a difference to me, though it definitely encourages a split interface between the two sides of the application, which might not be right for all cases.

The real rough edges I’ve been rubbing up against are mostly regarding the many-to-many relationship. While I wasn’t expecting everything to be perfect, the disparity in easy flexibility between normal object fields and many-to-many fields is surprising. For example, the administrative interface Django suggests includes support for a lot of easy UI additions, such as searching on fields. This works great for standard fields, and not at all for many-to-many fields. In my case I have a many-to-many mapping relating papers and authors. For other fields in the Paper or Author models, I can add a search_fields = [‘title’] and make it easy to search the titles in the UI. Doing the same for the author list (a many-to-many field) does nothing. The semantics aren’t clear, but there doesn’t seem to be an easy way to add this functionality. The other issue is the manager class which is actually returned for member accesses to that field of the Paper object; it masquerades as a list, but isn’t one. It has an iterator, and can be used normally in that way, but most of the standard list uses don’t apply. In my case, I wanted to use string.join() to concatenate the author names in a listing. I can accumulate a string with a for loop, but that’s not the most natural or idiomatic way to express that action. Little things like that.

Please, if it seems like I’m doing something retarded, or missing some part of the documentation on Django’s many-to-many relations, please tell me. Also, as a final note, it’s probably worth mentioning that the version of Django in Debian and Ubuntu’s repositories is slightly older than the version the main tutorial is for. I followed a link from that tutorial to the tutorial for the previous release.





Working with Large ML Code Bases

11 02 2008

It’s interesting how tool prevalence and standardization (or lack thereof) for a programming language are very strong indicators of how many people have used it to build significant projects.

My honors thesis has me hacking up ML implementations – I started with OCaml, but later switched to MLkit after realizing that it already had support for some features I was working to implement in OCaml. As you might imagine, the source for a compiler and runtime for reasonably mature languages (the ML family of languages started in the mid-70s) is a reasonably hefty chunk of code. The layouts of the MLkit source tree is fairly clear (there is a conveniently-named “Compiler” directory, containing such subdirectories as “Backend/X86/”). The layout of the OCaml source is slightly less intuitive, but it comes with an informative README which points out where various parts of the system reside. Yet even within these more focused segments of the source trees, there is a lot of code. And it’s not boilerplate – when you need functions and types for spitting out most of the relevant instructions on the x86 architecture, or handling every abstract syntax tree node of a very expressive language, you’re going to have a lot of code. But there are no good tools for managing a code base of this size in any ML descendant.

I ran into this to some degree with the class project for my software engineering class last year (CS190 – 2007’s page is missing… I need to look into that). We wrote a frontend to GDB using OCaml. The idea was to display separate, independently-controlled panes for each thread of a multithreaded program (obviously, this didn’t scale well past, oh, 5 threads). It worked reasonably well at the end of the semester, but not well enough that we would ever release it. There were a few times, in a project which totaled only a few hundred lines of code, that I wanted some mechanized help working through other modules quickly to find what I needed. Several times I resorted to emailing a peer – hardly a terrible horror, but not something which should have been necessary for simply finding a function in a small code base. The only tools we had for managing the code base were…. Subversion, and VIm.

MLkit is nearly 240K lines of code in over 1000 files (including comments, but as it is primarily a research project, those are few and far between, plus a few odd linebreaks). Even cutting it down to just the compiler (ignoring its dependencies elsewhere in the tree) we’re looking at almost 56K lines of code in over 150 files.

C, C++ and Java programmers have a plethora of tools at their disposal for working with code bases of this size. When I was at Sun this past summer, between cscope and having OpenGrok set up on the kernel code, finding definitions, declarations, types, etc. was usually a breeze. Both have their limitations, and their bugs, but they’re still great. And that’s not even using a lot of the IDE support present in major IDEs like Visual Studio or Eclipse.

ML programmers have no reliable systems which I can find.

A quick search for OCaml IDEs yields a few results. A page listing a few of what look to be mostly-dead OCaml plugins for Eclipse. Of these ODT has apparently seen activity this year, but seems to be mostly syntax highlighting and a build tool. A from-scratch IDE, Camelia, built by my friend Nate when we TAed CS017 (now unfortunately renamed CSCI0170, thanks to the abomination that is Sungard‘s Banner…). Searching for any of OCaml, SML, ML, or Haskell with “cscope” on Google yields nothing of use. Jane Street Capital uses OCaml. I wonder what tools they use.

Oddly enough, it seems the winner in this case may be the F# plugin for Visual Studio. It has support for following references around large code bases, and is actively maintained (as MS intends it to be a mainstream .NET language in the near future). Unfortunately, it can also only deal with F# files, which are close to, but not quite close enough to Standard ML files….

Perhaps I’ll build a cscope-alike for ML myself.

After I finish my thesis.

EDIT: A commenter on reddit suggested taking a look at ocamlbrowser. It seems to be at least mostly what I’m looking for (for OCaml), though I can’t seem to get it to do anything other than display the types for functions in the standard module library, so I can’t say for sure. There’s a setting to change the module search path, but it doesn’t seem to change anything for me. I also find it odd that despite the fact that it is included with the OCaml base distribution (at least in Debian), no web search for any query I can think of which expresses the sort of task ocamlbrowser is supposed to facilitate yields any results referencing it. This suggests that very few people use it – maybe I’m not the only one who can’t get it to work on an arbitrary code base easily. Might play with it some more once I get some more time.

EDIT2: Another reddit commenter mentioned otags (in the vein of ctags), which looks good.  If only I was still working in OCaml :-p On the other hand, looking up the link for ctags made me look at that again (I had foolishly assumed that it, like cscope, supported only C and C++).  In particular, apparently someone reimplemented ctags for more language support (and called it Exuberant Ctags).  The language support list claims that it in fact works on SML!  A download and compile later, I’m quite the happy camper!  It works just like the standard ctags.  Many thanks to zem on reddit for inspiring a very fortuitous Google search.





Variant Interfaces

1 01 2008

Today I finished my first pass reading of Colimits for Concurrent Collectors, and was struck with an interesting idea. The paper is a demonstration of an approach to refine specifications using the notion of colimits. I don’t fully understand the refinement process yet (hence the “first pass reading”). The paper contains many examples specifications written in a variant of Specware, to specify a safe interface between a mutator and garbage collector without using stop-the-world semantics – this is the case study for the refinement.

An interesting pattern of sorts is used. A number of the specifications for portions of the system are parameterized by other specifications. A common idiom in the examples are lines similar to:

MORPHISM Cleaning-View = ONLY nodes,roots,free,marked,workset,unmark

What this line means is that the current specification, when it refers to an instance of the Cleaning-View specification, can only access the mentioned observers. This is interesting if we translate it to interfaces in an object-oriented environment. What this amounts to is essentially defining an interface (here, a specification) and then multiple subsets of this interface (here, imported subsets of observer functions). One application of this (and I think the most interesting) is using the sub-interfaces as restricted capabilities on objects. This is a reasonably common use of interfaces in OO languages.

You can already mimic this in normal object oriented languages by defining minimal subsets, and larger subsets as extending the smaller interfaces. But what about keeping siblings with overlap in sync? A nice alternative would be to be able to specify all of the siblings together, as a set of variant interfaces (similar to the notion of variant types). For example, in psuedo-Java:

public interface Primary {
    void func1();
    void func2(int);
    void func3(double);

    subset Takeargs {func2, func3};
    subset Noargs {func1};
    subset Nofloatingpoint {func1, func2}
}

This example would of course need to be cleaned up a bit to deal with method overloading.

The problem with doing the equivalent of this in normal Java is keeping these “sub-interfaces” in sync with the main interface, and keeping overlapping siblings in sync without just having an arbitrarily large hierarchy of interfaces. Allowing joint specifications, and then referring to variants through names like Primary.Takeargs (to say that a variable should be of type Primary, but the current scope should only have access to the Takeargs subset) would ease this difficulty, making the use of interfaces for restricted control slightly easier to deal with as a programmer.

I also wonder what could be done with this as an extension to a language such as Spec#, with pre- and post-conditions on methods. You could then deduce that objects using only a restricted subset of an interface might maintain certain invariants (such as making no modifications to an object), more easily than before. More importantly, you could guarantee it statically at compile time – the interface variants would be subtypes of the main interface type. Without the explicit sub- and sibling-interface relationships it would still be possible to statically verify properties of interest, as the compiler can see what methods are called on various objects, but this approach lets the programmer encode this restriction easily.

Ultimately this is just syntactic sugar for a construct which can already be used, but it encourages the programmer to make the type checker do some more verification work for the programmer. It’s also sort of a static version of something similar to the E caretaker for capability delegation or more generally the object capability model(lacking revocation, because all of this happens at compile time). I should read some of the EROS and Coyotos papers…

It would be interesting to hack up a prototype implementation of this, but I don’t feel like mucking about right in the internals of another compiler (thesis work…), and unfortunately I don’t think any of the good meta-programming languages are really appropriate for this. Scheme is dynamically-typed, Ruby is duck-typed (so really, dynamically typed), and you need static typing to get the programming benefits of this. Perhaps it’s finally time for me to learn MetaOCaml.





Typing the Stack, and Why Tag-Free Garbage Collection Doesn’t Quite Work

28 11 2007

One of the most significant barriers to implementing a completely safe garbage collector lies with collecting the stack. Heap datastructures are relatively easy to walk — that is to say, that one can generate a walk function for each variant of heap datastructures, etc., in the language being collected. The stack is more difficult because it is not an explicit structure in the language. The other work I’ve read towards this (described in the post linked above) made rather awkward contortions to get around this fact. Wang and Appel CPS’ed and converted closures for their entire programs. Wang and Fluet side-step the issue by writing a safe collector for the internals of an interpreter – which isn’t really quite garbage collection in the normal sense, since they just have a Scheme stack structure defined in Cyclone which they then traverse. The harder problem is to actually type the machine-native stack.

  • The Essence of Compiling with Continuations: Towards this, my partners and I read the work defining A-Normal Form, or ANF. The authors examined a number of CPS based compilers to see what work they actually performed (many compilers use CPS as an intermediate representation to support some transformations). They found that to mitigate the code expansion of CPS transformations, and to get rid of some of the execution overhead from relatively trivial continuations, the compilers performed a number of additional reductions, pulling the continuation out of argument lists and leaving it implicit in a register, and a few other tricks. Adding these transformations together, it turns out that the sum total of these actions is essentially CPSing the source program, performing the normal reductions, and then un-CPSing the program. The authors then present an equivalent transformation directly on the source language, which is simpler to implement. This essentially accomplishes the same thing as using SSA – giving a name to every intermediate result of the computation.

This may help with typing the stack – we might be able to generate each frame’s type from the ANF of the program. Obviously some more work needs to be done here; how this might be done isn’t immediately obvious. I complained that we would have to mark the stack frames in some way, and to suggest this was not as terrible as I thought, our advisor suggested reading a couple papers by Benjamin Goldberg which pretty much wrapped up work on garbage collection without tags. In a nutshell, it really only works for languages lacking polymorphic operators.

  • Tag-Free Garbage Collection for Strongly Typed Programming Languages: This paper describes a way to sidestep tagging pointers on the stack into the heap (the notion was mentioned earlier by Appel, but implementation details were omitted). It does this basically by embedding frame-specific garbage collection routine pointers right after function calls. It then changes compilation so that returning from a routine goes an extra instruction later. Goldberg says that this works fine because the retl instruction on SPARC is only a pseudo-instruction which compiles to a jump with a hard-coded offset – I wonder how portable this is. This of course only works in a straightforward way for monomorphically typed languages. To extend the scheme to support polymorphic functions, they begin passing type-specific traversal functions to the frame-specific collection functions. This is then extended yet again to support higher-order functions – the overall strategy is to stash location-specific pointers in select locations to provide ready access to correctly-typed collector procedures in an efficient manner. There is also an addendum at the end about implementing collection for parallel languages. One interesting idea presented which isn’t explored in much depth is that there can be different collection procedures for different stages of each procedure. This allows for optimization such as not traversing uninitialized pointers early in the procedure, or not traversing dead local variables during collections after their last access in the function.
  • Polymorphic Type Reconstruction for Garbage Collection without Tags: This second paper revisits much of the content of the first paper, more clearly presenting the type reconstruction algorithm. It then addresses the fact that it is possible to have objects reachable from the garbage collection roots for which it is impossible to reconstruct the type, and proves that in the absence of polymorphic primitive operators, it is possible to reconstruct the value of any useful reachable object, because any object which matters will be passed to a primitive operation at some point, and types can be reconstructed from there. He later touches briefly on the issue of having a polymorphic equality operator (or any polymorphic primitive), and the problem with the operator needing to use data of the objects to do its job, without knowing the type, and mentions offhand that this is dealt with by implicitly passing a method dictionary to the operator, which amounts to a tag. This is still tag-free in the sense that there are no extra bits used up in the pointer, but not fully tag-free because the compiler still needs to pass explicit runtime hints about types.