Summer Reading

8 06 2011

I’m headed back to Microsoft for the summer, to do some program analysis work.  I’ve got quite a lengthy commute from where I live in Seattle, which should be made much more bearable by the Microsoft Connector bus.  Since I won’t be driving, I should have plenty of time to get some reading done.  My reading goals for the summer are to work my way through:

  • Type Theory and Functional Programming: An older (1991) book on type theory as in (dependent) constructive type theory, that appears to also (so far) do a good job connecting type theory to logic.  It’s hard to find a thorough reference that builds up to dependent types as they’re related to proofs and programs, from first principles of logic.  I believe there’s some overlap between this book and the next (Coq’Art), but TTFP is so far a much gentler introduction than the first few chapters of Coq’Art.  And since it’s out of print, the author has generously posted a PDF version online, so it’s convenient and free in addition to useful.
  • Coq’Art: Interactive Theorem Proving and Program Development: The reference on using Coq for proofs and program development.  Thorough, but dense so far.  It seems mostly focused on the use of Coq, but covers the underlying Calculus of Inductive Constructions as well.  After discovering first-hand how painful it can be to maintain a proof by hand as a formal system evolves, I’d really like to formalize all of my future proofs in a mechanized proof assistant.  And I have a desire to do some work on a Ynot-style embedding of a richly typed language.
  • Lambda-Calculus and Combinators: An Introduction: Covers what you’d expect from the title: connections between the lambda calculus and combinatory logic.
  • Semantics Engineering with PLT Redex: Covers some core background on specifying abstract machines (normally spread across many older papers), as well as the use of PLT Redex.  The authors to a good job of articulating the structure and purpose of the book here.
Quite a substantial undertaking to be done all outside of work (while still having a life), but I should at least be able to make a dent in things.  The fact that there’s some overlap in material among all of these books should also make it easier to get through all of them (and help the overlapping material sink in).  I’m already almost a third of the way through TTFP.  I’m not sure what order I’ll read the others, but I’m excited about all of them.




Problems with C and C++ Separate Compilation

11 11 2008

After graduation, a couple months of watching television, driving cross-country (if you get the chance you should drive across northern Wyoming), settling in at Microsoft and living in Seattle, I’m back.  And I’m annoyed at C.

C is a fantastic language in many ways.  It is essentially an abstract assembly language.  Almost any general-purpose operation which can be done in assembly can be done in C, and it makes building large, relatively portable systems much easier.  The only things which can’t be done directly in C are operations on specific registers (and it’s easy enough to link in short assembly routines when that’s necessary.

Most of my early interest in programming languages and most of my problems when I first started doing systems work were related to basic typing issues: the ugliness of casting things to void pointers and back, the conversions between various integer types, and other relatively mundane C errors which are easy to make and hard to debug.  I came to believe that additional features other than type system and memory safety improvements in other languages, while extremely useful, were mostly great conveniences rather than fundamental improvements.

But the past several months have changed my mind.  While the ease of turning a pointer to one type of object into a pointer to another type in the same place is certainly a bane as often as it is a boon, a reasonably experienced C programmer begins to recognize common symptoms for such problems.  A more serious, though less frequently encountered problem has to do with type identity and versioning.

Consider the case where you write an application (in C) to use an external library.  Your application interfaces with this library through two means: #include-ing its public header, and being linked to the library’s object file.  Initially these two interfaces will probably be fine (if, for example, you just installed this library).  Now move forward a couple months.  Update your library.  Did your update include the object file and the header file?  If not then any of sizes or layout changes to the library’s data types might cause non-obvious errors; your application will happily compile and link, but the results you get back from the library may not be what you expect.

What if it’s your library, or just an object file in your project?  These tend to have a fair amount of turnover.  Most moderately-sized projects use separate compilation to separate code changes and avoid recompiling the same code repeatedly if it doesn’t change.  But when tying these object files together, there are no checks to ensure that data structures exchanged between object files are consistent; the C compilation model assumes that your data structure definitions are stable, or you recompile from scratch every time.  It also makes the reasonable assumption that the same compiler is used for every object file.  On the off chance you violate that expectation (perhaps with a compiler update), memory layouts of the same structure definition may differ between object files.

It’s possible to work around this problem with a build system if you track every header file dependency explicitly.  For large projects, this can be difficult.  Especially with fast-moving projects, it’s easy to add an include to a .c file without remembering to add the dependency to the build system configuration.  Once this missing dependency goes unnoticed for some time it becomes considerably more difficult to track down, and developers end up either spending their time debugging the build system or resorting to rebuilding from scratch every time in favor of the broken incremental build.

Another permutation of the same problem is that of unrelated structures with the same name.  It’s easy to imagine a large system with two subsystems defining structures named CALLBACK_ARGS.  What happens when one section of code needs to interact with both of these systems?  If all appropriate headers are included, then the name collision will be detected.  If only one of the conflicting headers is included, then depending on how the headers are organized it becomes trivially easy to pass the wrong structure to a function.  Especially when working on a new system, it usually seems reasonable to assume that structures of the same name are the same semantic (and in-memory) structure.

Namespaces can help alleviate the same-name problem: including only one structure’s header and trying to pass that to another function will result in an error complaining about passing an argument of type Subsystem1::CALLBACK_ARGS* to a function expecting a Subsystem2::CALLBACK_ARGS*.  This doesn’t actually prevent you from declaring two structures of the same name in the same namespace in separate header files, but if namespaces are used judiciously to separate subsystems then the likelihood of doing so accidentally is greatly reduced.

The versioning problem is a direct result of how #include works in C.  Rather than being a direct part of the language, #include is a preprocessor directive equivalent to “take the text of the specified file and pretend I typed it in place right here, then pass that result to the actual compiler.”  At its core most C compilers only handle single files at a time, so they don’t actually know anything about other object files (or at least, they don’t directly use information about other object files).  That’s the linker’s job, and the linker knows nothing about structures per se – only matching symbolic references.

One solution is to store all structure layout information in object files, and generate code for accessing those structures once at link time.  This slows the linking process, but prevents the mismatched definition problem; all code for accessing the structure is generated at the same time from the same definition.  This blurs the distinction between compiler and linker, but adds great value.

Doing this at compile time for static linking is relatively cheap and straightforward.  Doing this at load-link time is a bit trickier.  While compilers and static linkers can play any tricks they want for code which only interacts directly with itself, dynamically linked executable formats must be defined in standard ways, limiting what can be done.  I don’t know of any major executable formats which support this (most were designed in the heyday of C and C++, when they were still the best languages around), but that is a matter of format standards rather than a technical limitation.  This would be more expensive than current dynamic linking, but doable.   A compiler could choose to use a richer format for its own object files and then resort to standard formats when asked to generate a standard library or executable.  OCaml does this; for a Test.cmx and Mod.cmx compiled to objects using differing interface files for a Test module data structure:

Yggdrasil:caml colin$ ocamlopt Test.cmx Mod.cmx
Files Mod.cmx and Test.cmx make inconsistent assumptions over interface Test
Yggdrasil:caml colin$ 

Unfortunately C and C++ have a compilation and linking model which is now so well-established that I suspect any proposal to fix this in the standards for those languages would likely meet with significant resistance.  Though at the same time, I can’t think of any desired C\C++ semantics that this would break, so maybe it could happen.





WordPress for iPhone Bug

23 07 2008

The few of you who subscribe to this blog’s RSS feed may have noticed a post with the title “!$title$!” preceding this post briefly.  Apparently this is a known bug in the recently released WordPress application for the iPhone. For many users, as soon as you add your account to the program or if you open a new post then back out of it, the program makes a test entry and leaves it.  This bug is so easy to trigger, I’m surprised it made it out the door (or that a random ‘test post’ function made it in the final product at all).





Done with Undergrad, Procedural Generation, Model Checking

17 05 2008

I’m finally done with undergrad.  It feels like 4 years just blew by in the blink of an eye.  And come fall, I’ll be working, but also applying to grad schools.  What a ride.

Incidentally, I did finish my thesis.  It’s here: Type-Safe Stack Inspection for Garbage Collector Implementation.  Every time I write that title now I wish I’d cut it off as “Type-Safe Stack Inspection.”  Oh well, too late now.  I meant to post about it right after I finished, but the final edits ran into finals period, which consumed the past two weeks of my life.

On to other projects.

Currently, Brendan, Rob and I are working on building a fairly ambitious mapping tool.  We decided to use procedural modeling implement a multiple-level-of-detail real-time procedural generator for, well, planets.  The idea is that eventually we’ll start by rendering a planet with continents and oceans on screen, and the user can zoom in – to nearly any level of detail.  Moving closer to the planet gradually adds greater and greater levels of detail to the planet, laying out rivers, mountains and valleys, then cities and roads, laying out city streets and buildings, and possibly even building internals.  We think we can do it quickly and without using unreasonable amounts of memory by taking advantage of the fact that procedural generation is deterministic if you reuse the same random number generator seed.  So we can cache seeds for areas along with constraints on edges for some areas, and use negligible amounts of memory for storing enough information that you can back out or cross the planet, have the old region freed from memory, return and see the same things.  We’re aiming pretty high, so we’ll see what happens.

I also picked up a copy of a book my advisor told me about, Principles of Model Checking.  It was just released earlier this month, and it’s great so far.  I read a couple of the original LTL and CTL model checking papers (LTL: Model Checking Using Automata Theory, CTL: Automatic Verification of Finite-State Concurrent Systems Using Temporal Logic Specifications) a couple years ago in a seminar, but I didn’t have enough background in CS to really have all the details quite click for me back then (those papers were actually the first time I saw automata in CS, before I took a theory of computation class).  This book is shaping up to be a good refresher for me (also filling in gaps from what I missed the first time I looked at this material) and seems like it would be a good introduction for those new to model checking as well.  Sadly there is not yet a place to look inside it online.





Mercurial + MLKit + OS X = Failure, plus workaround

21 03 2008

My thesis, as I’ve mentioned several times before, has me hacking up the source to the ML Kit compiler. Because the current build arrangement I’ve been using requires a great deal of RAM, I’ve been using my desktop machine and CS department machines exclusively for building. However, I’m going home for part of spring break this coming week, there’s no computer with appropriate specs at home, and editing code over SSH is rather painful because of the high latency involved. The obvious solution, since I’m using Mercurial to manage my thesis, was to do a checkout on my laptop (a G4 Powerbook lacking enough RAM to build, and lacking the correct processor architecture to test), edit locally, and just push intermediate versions back to the CS systems to build there. This would result in some useless crap in the commit logs, but it beats the alternatives. I had an existing checkout from back in January, when I was using OCaml, so I went to update and saw the following:

Yggdrasil:thesis colin$ hg up
0 files updated, 0 files merged, 0 files removed, 0 files unresolved
Yggdrasil:thesis colin$ hg pull
remote: ####################
pulling from ssh://cs/thesis/
searching for changes
adding changesets
adding manifests
adding file changes
added 44 changesets with 18762 changes to 17825 files (run 'hg update' to get a working copy)
remote:
remote: Forwarding you to cslab0a
remote:
remote: ####################
remote:
Yggdrasil:thesis colin$ hg up
abort: case-folding collision between mlkit/kit/basis/MLB/RI_PROF/Splaymap.sml.d and mlkit/kit/basis/MLB/RI_PROF/SPLAYMAP.sml.d
Yggdrasil:thesis colin$

I saw this, noticed it was a temp file (ML Kit makes an MLB directory for temp files) and hg rm’d all such files in the CS copy, committed, and tried again. Same result. Poked around online, and after a bit, found a message from someone with a similar problem, with a similar case difference between two distinct files, and a response pointing out the problem with Mac OS X’s case-insensitive filesystem.

When I installed the latest OS X, I intentionally opted for case-insensitive because many Mac programs have a habit of breaking on case-sensitive filesystems. And now I can’t have a copy of the ML Kit source on my machine. It’s common practice in SML programs to use all-caps names for files containing signatures (akin to interfaces for the non-ML folk) and using “upper” camel case for implementations. So I can’t just rename a couple files, it seems you just can’t have the ML Kit source on most OS X installations’ local disks.

So was it time to give up on this? No, of course not – I need to work on my damn thesis while I’m home. Screw Apple and it’s crappy filesystem design choice.

Fortunately Apple did make one excellent, excellent design choice with OS X, which was making disk images a standard, common object. So I used the Disk Utility tool to make a disk image containing a case-sensitive filesystem, mounted it, and did a checkout inside that. It’s pretty stupid that I needed to do this, but at least I can work on my thesis over break without running Vim over SSH.

Now if you’ll excuse me, my friend Owen is heading off to Apple’s filesystem team next year, and I need to go give him grief for his group’s terrible design choice.





Amazon’s Book Recommendation System is Quirky

15 03 2008

Like many others, I have an Amazon.com wishlist, allow them to track my purchases, and occasionally tell them that I own a particular book if it is recommended to me.  Amazon uses this along with purchase sets made by other customers to generate “better” recommendations for books I might find interesting.  One thing its recommendation algorithm doesn’t seem to take into account is general subject knowledge – it knows of subcategories, and purchases of multiple items by the same people, but has no concept of actual relationships between books in a category.  This is actually a pretty reasonable thing to omit since there’s a great deal of such knowledge which could be taken into account, and the work/reward ratio for maintaining that is probably not great.  But as a result, I’ve had some rather interesting recommendations.

Also, I’m linking to books for clarity, but there’s no Amazon referral crap going on.

Mathematics texts.  Books on my Amazon wishlist: Foundations of Mathematical Logic, The Calculi of Lambda Conversion, Mathematical Foundations of Information Theory, Categories for the Working Mathematician, Basic Category Theory for Computer Scientists, among others.  Some of the more amusing mathematics recommendations from Amazon: Pre-Algebra, Algebra 2.

Literature.  Books on my Amazon Wishlist: Going After Cacciato, Letters to a Young Poet, The Norton Anthology of Poetry, ‘Tis: A Memoir, The Dharma Bums, among others.  Amazon recommends: Vocabulary Workshop: Level C.

These are just the few that I dug up spending a couple minutes searching my emailed recommendations; I’ve seen plenty more out of place recommendations upon logging into Amazon.com.  While it’s certainly true that some people will buy books at both extremes of a field’s complexity (perhaps something advanced for themselves or a friend, and something basic for a child), it’s still an amusing quirk to have this disparity with recommendations for an individual because of the presentation.  Amazon.com presents recommendations as a set of items you should be interested in according to Amazon’s “understanding” of your personal interests, when it’s really just correlating your purchases and explicit interests with the purchases of other users.





Giving up on Ruby on Rails for Python and Django

24 02 2008

Well, a while back I wrote about starting to use Ruby on Rails to build myself a web application to track technical papers which I had read or planned to read. I gave up on RoR after a (very short) time period. Now, despite preferring to code in Ruby over Python, I’ve switched to Django. Why?

  • While I am capable of writing SQL myself, I really, really loathe it.
  • The Django documentation is easier to find/deal with
  • Django seems to do more busywork for you than Rails does.
  • It’s easy to poke around in the intermediate stages of development.

One of the pain points of RoR is the fact that you still have to deal directly with your database engine (likely, as in my case, MySQL). I’ve learned to deal with SQL after working on Mocha for a while. That doesn’t mean I like it. In fact, it’s still a pain. And the worst part of dealing with SQL is setting up tables, which is in fact the only database work Rails requires you to do – the worst part. It then does some magic to connect your ruby to the schema you’ve specified. Django lets you define the data model directly in Python. From there, it automatically creates the SQL tables, including creating join tables for many-to-many relationships, which is where I hit the main roadblock with Rails; I knew how to define the tables for the many-to-many relationship, but for whatever reason couldn’t easily find information on making a many-to-many relationship explicit in the Ruby model. The only SQL I had to do with Django was enough to create a new database, and ‘CREATE DATABASE whatever’ isn’t much to do, even if you’ve never used SQL directly before.

The many-to-many relationships documentation for Django is still a bit confusing, but clear enough that you can play with things, and unlike RoR, I could find it. The RoR documentation seems to consist primarily of the class API docs (only useful once you already mostly know what you’re doing) and somewhat mediocre 3rd-party tutorials. The Django documentation is actually organized as a manual. Overall the Django documentation is clear, and the initial tutorial on the Django site is incredibly useful – I’ve built everything thus far simply following along with the tutorial and tweaking things as needed – following links to more detailed documentation on what data types are available, using many-to-many instead of many-to-one, etc. Django also creates far fewer files, and almost all of them have an immediately obvious purpose. There’s less magic in the system, without making you do more work. It also makes setting up an administrative interface much easier, and includes a module for easy authentication setup for your application in the default install.

This is not to say that picking up Django has been nothing but butterflies and ponies; I’ve run into a few rough edges.

One design decision in Django that I’m not sure I’m happy with (though I’m not unhappy with it either) is the decision to orient it towards a “newsroom model” as the tutorial describes it. This means that the architecture was designed with the intention that there were distinct sets of users: publishers and readers. Django includes a big module (the one just mentioned above) for managing users and creating an administrative interface (‘django.contrib.admin’) and the documentation suggests its use. While you’re certainly free not to use it, its presence encourages its use, thereby encouraging the same model the developers had in mind. This isn’t necessarily a bad thing. I’m not sure it makes a difference to me, though it definitely encourages a split interface between the two sides of the application, which might not be right for all cases.

The real rough edges I’ve been rubbing up against are mostly regarding the many-to-many relationship. While I wasn’t expecting everything to be perfect, the disparity in easy flexibility between normal object fields and many-to-many fields is surprising. For example, the administrative interface Django suggests includes support for a lot of easy UI additions, such as searching on fields. This works great for standard fields, and not at all for many-to-many fields. In my case I have a many-to-many mapping relating papers and authors. For other fields in the Paper or Author models, I can add a search_fields = ['title'] and make it easy to search the titles in the UI. Doing the same for the author list (a many-to-many field) does nothing. The semantics aren’t clear, but there doesn’t seem to be an easy way to add this functionality. The other issue is the manager class which is actually returned for member accesses to that field of the Paper object; it masquerades as a list, but isn’t one. It has an iterator, and can be used normally in that way, but most of the standard list uses don’t apply. In my case, I wanted to use string.join() to concatenate the author names in a listing. I can accumulate a string with a for loop, but that’s not the most natural or idiomatic way to express that action. Little things like that.

Please, if it seems like I’m doing something retarded, or missing some part of the documentation on Django’s many-to-many relations, please tell me. Also, as a final note, it’s probably worth mentioning that the version of Django in Debian and Ubuntu’s repositories is slightly older than the version the main tutorial is for. I followed a link from that tutorial to the tutorial for the previous release.








Follow

Get every new post delivered to your Inbox.