Problems with C and C++ Separate Compilation

11 11 2008

After graduation, a couple months of watching television, driving cross-country (if you get the chance you should drive across northern Wyoming), settling in at Microsoft and living in Seattle, I’m back.  And I’m annoyed at C.

C is a fantastic language in many ways.  It is essentially an abstract assembly language.  Almost any general-purpose operation which can be done in assembly can be done in C, and it makes building large, relatively portable systems much easier.  The only things which can’t be done directly in C are operations on specific registers (and it’s easy enough to link in short assembly routines when that’s necessary.

Most of my early interest in programming languages and most of my problems when I first started doing systems work were related to basic typing issues: the ugliness of casting things to void pointers and back, the conversions between various integer types, and other relatively mundane C errors which are easy to make and hard to debug.  I came to believe that additional features other than type system and memory safety improvements in other languages, while extremely useful, were mostly great conveniences rather than fundamental improvements.

But the past several months have changed my mind.  While the ease of turning a pointer to one type of object into a pointer to another type in the same place is certainly a bane as often as it is a boon, a reasonably experienced C programmer begins to recognize common symptoms for such problems.  A more serious, though less frequently encountered problem has to do with type identity and versioning.

Consider the case where you write an application (in C) to use an external library.  Your application interfaces with this library through two means: #include-ing its public header, and being linked to the library’s object file.  Initially these two interfaces will probably be fine (if, for example, you just installed this library).  Now move forward a couple months.  Update your library.  Did your update include the object file and the header file?  If not then any of sizes or layout changes to the library’s data types might cause non-obvious errors; your application will happily compile and link, but the results you get back from the library may not be what you expect.

What if it’s your library, or just an object file in your project?  These tend to have a fair amount of turnover.  Most moderately-sized projects use separate compilation to separate code changes and avoid recompiling the same code repeatedly if it doesn’t change.  But when tying these object files together, there are no checks to ensure that data structures exchanged between object files are consistent; the C compilation model assumes that your data structure definitions are stable, or you recompile from scratch every time.  It also makes the reasonable assumption that the same compiler is used for every object file.  On the off chance you violate that expectation (perhaps with a compiler update), memory layouts of the same structure definition may differ between object files.

It’s possible to work around this problem with a build system if you track every header file dependency explicitly.  For large projects, this can be difficult.  Especially with fast-moving projects, it’s easy to add an include to a .c file without remembering to add the dependency to the build system configuration.  Once this missing dependency goes unnoticed for some time it becomes considerably more difficult to track down, and developers end up either spending their time debugging the build system or resorting to rebuilding from scratch every time in favor of the broken incremental build.

Another permutation of the same problem is that of unrelated structures with the same name.  It’s easy to imagine a large system with two subsystems defining structures named CALLBACK_ARGS.  What happens when one section of code needs to interact with both of these systems?  If all appropriate headers are included, then the name collision will be detected.  If only one of the conflicting headers is included, then depending on how the headers are organized it becomes trivially easy to pass the wrong structure to a function.  Especially when working on a new system, it usually seems reasonable to assume that structures of the same name are the same semantic (and in-memory) structure.

Namespaces can help alleviate the same-name problem: including only one structure’s header and trying to pass that to another function will result in an error complaining about passing an argument of type Subsystem1::CALLBACK_ARGS* to a function expecting a Subsystem2::CALLBACK_ARGS*.  This doesn’t actually prevent you from declaring two structures of the same name in the same namespace in separate header files, but if namespaces are used judiciously to separate subsystems then the likelihood of doing so accidentally is greatly reduced.

The versioning problem is a direct result of how #include works in C.  Rather than being a direct part of the language, #include is a preprocessor directive equivalent to “take the text of the specified file and pretend I typed it in place right here, then pass that result to the actual compiler.”  At its core most C compilers only handle single files at a time, so they don’t actually know anything about other object files (or at least, they don’t directly use information about other object files).  That’s the linker’s job, and the linker knows nothing about structures per se – only matching symbolic references.

One solution is to store all structure layout information in object files, and generate code for accessing those structures once at link time.  This slows the linking process, but prevents the mismatched definition problem; all code for accessing the structure is generated at the same time from the same definition.  This blurs the distinction between compiler and linker, but adds great value.

Doing this at compile time for static linking is relatively cheap and straightforward.  Doing this at load-link time is a bit trickier.  While compilers and static linkers can play any tricks they want for code which only interacts directly with itself, dynamically linked executable formats must be defined in standard ways, limiting what can be done.  I don’t know of any major executable formats which support this (most were designed in the heyday of C and C++, when they were still the best languages around), but that is a matter of format standards rather than a technical limitation.  This would be more expensive than current dynamic linking, but doable.   A compiler could choose to use a richer format for its own object files and then resort to standard formats when asked to generate a standard library or executable.  OCaml does this; for a Test.cmx and Mod.cmx compiled to objects using differing interface files for a Test module data structure:

Yggdrasil:caml colin$ ocamlopt Test.cmx Mod.cmx
Files Mod.cmx and Test.cmx make inconsistent assumptions over interface Test
Yggdrasil:caml colin$ 

Unfortunately C and C++ have a compilation and linking model which is now so well-established that I suspect any proposal to fix this in the standards for those languages would likely meet with significant resistance.  Though at the same time, I can’t think of any desired C\C++ semantics that this would break, so maybe it could happen.





Exciting Upcoming OS Improvements

19 10 2007

I must apologize for the lack of posts – the first few weeks of the semester, plus the job hunt, are extremely time-consuming. I have half-finished drafts of 3 articles, but no time to do the solid revising and research they need. I try to make my technical entries very precise, accurate, and backed by links to reputable sources.

But this is just a brief entry, because I’m excited for upcoming updates to three wonderful OSes:

Mac OS X Leopard
This has been anticipated for some time. It ships on the 26th of October if you weren’t already aware, and has some wonderful features. TimeMachine initially excited me the most. I heard about it while watching a periodically updated blog post of WWDC 2006 coverage with coworkers at NetApp, and to us it immediately suggested that Apple finally wised up a bit and implemented snapshots in one of their filesystems. The fact that TimeMachines requires an external hard drive makes it clear that this isn’t quite the case, which is a bit surprising given that it has been acknowledged that Leopard has at least some support for ZFS. Supposedly Leopard will only support reading from ZFS – alas, my dreams of a dual-boot Solaris/OS X Macbook with a shared ZFS pool will have to wait for another day.

More exciting to me is the addition of DTrace to Mac OS X in the form of Instruments, a snazzy GUI on top of DTrace. This is going to be a killer developer application. DTrace is very powerful, and fairly flexible, but has a bit of a learning curve to do more advanced things. I’m very optimistic about how discoverable an Apple GUI can make this.

And of course, after many years of using multiple desktops on Linux, they’re finally in OS X. For those who can’t wait, or don’t want to upgrade just for multiple desktops, Desktop Manager and Virtue Desktops work reasonably well, though for obvious reasons they’re no longer under development.

That said, my Powerbook is finally going, so I’m probably just going to buy a new Macbook next time the hardware is updated, and get Leopard that way instead of shelling out money for an upgrade. If it weren’t time for a hardware upgrade for me, I think I’d probably still do it just to have DTrace on my Mac.

OpenSolaris Project Indiana
What is Project Indiana? Many things, but primarily two things: an effort to create an all-open-source version of OpenSolaris (which currently includes some binary blobs to run well), and a place to prototype things like the new installer, stable ZFS root and boot, and the new package system. It was uncertain when a prototype of all these things would arrive, but it seems that a developer release will be available in the next couple weeks. This is enough to make me hold off on finishing customization on the new workstation I just got; I’m going to wait, and install this development version from scratch. I’m sure I’ll run into plenty of bugs, but that’s fine – it’s exciting! Also, an additional benefit of doing a reinstall is that I can make an extra slice for doing live upgrades of my system, which the preinstalled configuration doesn’t support. I can’t wait.

[Update: Found a very thorough description of Project Indiana.]

KGDB in Linux
Despite being postponed, it looks like a proper kernel debugger is headed into the mainline Linux kernel. Linux has actually had kernel debuggers for some time, but they were external patches. Being in the mainline kernel will mean better stability and will likely increase use among kernel developers.

This doesn’t directly impact end users, because most users don’t debug kernels. It does however affect them indirectly, because it will help kernel developers find (and fix) bugs faster. For kernel developers, a proper kernel debugger is a blessing. I used kmdb extensively this summer working in the Solaris Kernel Group. I can’t imagine how frustrated I would have been without it. Being able to step kernel code makes it almost as easy to debug as userland code (with some exceptions, obviously). Mac OS X has also had a well integrated kernel debugger (two in fact) for some time as well.

I’m really glad Linux is finally going to integrate this – the Apple documentation on kernel debuggers is spartan, and Solaris is still (unfortunately) not as easy to get up and running as Linux, and anyone who wants to hack on a kernel benefits greatly from having a solid kernel debugger. Hopefully this will encourage more people to jump the gap to kernel work, since this makes it more approachable.








Follow

Get every new post delivered to your Inbox.