Kenton's Weekend Projects: August 2010

Sunday, August 29, 2010

No coding this weekend

I spent this weekend on things far more important than neat projects. I expect to get back to work next weekend.

Last Monday, though, I did manage to write up some proof-of-concept code to intercept open() system calls via LD_PRELOAD, as described in last week's post. It works pretty well. I'm excited to actually apply this technique next weekend.

Monday, August 23, 2010

Ekam: No longer restricted to building self, and can be taught new rules

Two big updates to Ekam this weekend! As always, the source code is at:

http://code.google.com/p/ekam/

Updates

Arbitrary Code

Ekam is no longer restricted to only building code in the "ekam" namespace. It will now build pretty much any C++ code you throw at it, so long as the code depends only on the standard C/C++ library. In fact, when I pointed Ekam at the Protocol Buffers source code, it successfully churned out protoc! (It didn't do as well compiling the tests, though.)

I ended up accomplishing this not by indexing libraries, as I had planned, but instead by changing the behavior when a dependency is not found. Now, if the link action cannot find any object files defining a particular symbol, it just goes ahead and tries to link anyway to see what happens. Only after failing does it decide to wait for the symbols to become available -- and it retries again every time a symbol for which it was waiting shows up. So, once all non-libc symbols are available, the link succeeds.

Now, this has the down side that Ekam will go through a lot of failed linker invocations. But, I look at this as an optimization problem, not a correctness problem. We can use heuristics to make Ekam avoid too many redundant link attempts. For example, we can remember what happened last time. Or, an easier approach would be to just make sure failed operations are not tried again until everything else is done.

Related to this change, Ekam will now re-run even a successful operation if dependencies that were not originally available become available. I talked about why you'd want to do this last week.

Defining new rules

You can now implement new build rules by simply writing a shell script and putting it in the source tree, as I proposed last week. You can now see what the actual code for compiling C++ looks like (linking is not yet implemented as a script). Of course, you don't have to write these as shell scripts. Any executable whose name ends in .ekam-rule will be picked up by Ekam -- even binaries that were themselves compiled by Ekam.

Refactoring

I seem to spend a lot of time refactoring. Working on C++ outside of Google is almost like learning a whole new language. I get to use whatever style I want, including whatever C++ features I want. I've been refining a very specific style, and I keep realizing ways to improve it that require going back and rewriting a bunch of stuff. Some key aspects of the style I'm using:

Exceptions are allowed, but expected never to occur in normal usage.
Ownership of pointers, and passing of that ownership, is explicit. I have a template class called OwnedPtr for this. In fact, I never use new -- instead, I call OwnedPtr::allocate() (passing it the desired constructor parameters). It's not possible to release ownership of the pointer, except by moving it to another OwnedPtr. Thus, the only way for memory to become unreachable without being deleted is by creating an ownership cycle. Yet, I don't use reference counting, since it is notoriously slow in the presence of multiple cores.
I'm using single-threaded event-driven I/O, since Ekam itself is just a dispatcher and does not need to utilize multiple cores. Events need to be cancelable. Originally, I had every asynchronous method optionally return a canceler object which could be called to cancel the operation. This got surprisingly hairy to implement, since the event effectively had to cancel the canceler when complete. Also, events owned their callbacks, which made a lot of things awkward and tended to lead to cyclic ownership (doh). What turned out to be much simpler was to simply have every asynchronous method return an object representing the ongoing operation. To cancel the operation, delete the object. With this approach, deleting a high-level object naturally causes everything it is doing to be canceled via cascading destructors, with no need to really keep track of cancellation.

Ideas

Currently, the compile action does not yet resolve header dependencies -- if you need to use special include directories you must specify CXXFLAGS manually. This is, of course, bad. But how can Ekam detect header dependencies?

One approach would be to attempt to compile and, if not successful, try to parse the error messages to determine missing includes. This is, of course, going to be pretty brittle -- it would need to understand every compiler's output, possibly for every locale.

Another approach would be to write some sort of an "include scanner" which looks for #include directives. It would not necessarily have to evaluate branches (#ifdefs and such), since it could just conservatively look for all the headers that are mentioned and take whatever it can find. However, this would still be pretty complicated to write and slow to run, and it wouldn't be able to handle things like macro-derived header names (yes, you can do that).

So here's my idea: Run the compiler, but use LD_PRELOAD to inject into it a custom implementation of open(). This implementation will basically send an RPC to Ekam (using the already-implement plugin interface) asking where to find the file, then replace the path with what Ekam sends back. The injected open() could pay attention only to paths in a certain directory which would be added to the compiler's include path. Problem solved! And better yet, this solution naturally extends to other kinds of actions and programming languages.

Sunday, August 15, 2010

Ekam: Works on Linux and OSX, and looks pretty

Above is what Ekam looks like when it is compiling some code. The fuchsia lines are currently running, while blue lines are completed. Errors, if there were any, would be red, and passing tests would be green (although Ekam does not yet run tests).

This weekend I:

Implemented the fancy output above.
Got Ekam working on Linux and OSX.
Did a ton of refactoring to make components more reusable. This means that visible progress ought to come much quicker when I next work on it.

Some critical features are still missing:

Ekam starts fresh every time it runs; it does not detect what tasks can be skipped.
Ekam can only really be used to build itself, as it assumes that any symbol that doesn't contain the word "ekam" (i.e. is not in the "ekam" namespace) is satisfied by libc. To fix this I will need to make Ekam actually index libc to determine what symbols can be ignored.
The only way to add new rule types (e.g. to support a new language) is to edit the Ekam source code.
Ekam does not run tests.

As always, source code is at: http://ekam.googlecode.com/

Defining new rules

I've been thinking about this, and I think the interface Ekam will provide for defining new rules will be at the process level. So, you must write a program that implements your rule. You can write your rule implementation in any language, but the easiest approach will probably be to do it as a shell script.

For example, a simplistic version of the rule to compile a C++ source file might look like:

#! /bin/sh

set -e

if $# == 0; then
  # Ekam is querying the script.
  # Tell it that we like .cpp files.
  echo triggerOnFilePattern '*.cpp'
  exit 0
fi

INPUT=$1

BASENAME=`basename $INPUT .cpp`

# Ask Ekam where to put the output file.
echo newOutput ${BASENAME}.o
read OUTPUT

# Compile, making sure all logs go to stderr.
c++ $CXXFLAGS -c $INPUT -o $OUTPUT >&2

# Tell Ekam about the symbols provided by the output file.
nm $OUTPUT | grep '[^ ]*  *[ABCDGRSTV] ' |
  sed -e 's/^[^ ]*  *. \(.*\)$/provide c++symbol:\1/g'

# Tell Ekam we succeeded.
echo success

You would then give this script a name like c++compile.ekamrule. When Ekam sees .ekamrule files during its normal exploration of the source tree, it would automatically query them and then start using them.

Of course, shell scripts are not the most efficient beasts. You might later decide that this script really needs to be replaced by something written in a lower-level language. Of course, Ekam doesn't care what language the program is written in -- you could easily use Python instead, or even C++. (Of course, to write the rule for compiling C++ in C++ would necessitate some bootstrapping.)

Resolving conflicts

Several people have asked what happens in cases where, say, two different libraries define the same symbol (e.g. various malloc() implementations). I've been thinking about this, and I think I'm settling on a solution involving preferences. Essentially, somewhere you would declare "This package prefers tcmalloc over libc.". The declaration would probably go in a file called ekamhints and would apply to the directory in which in resides and all its subdirectories. With this hint in place, if tcmalloc is available (either installed, or encountered in the source tree), Ekam will notice that both it and libc provide the symbol "malloc" which, of course, lots of other things depend on. It will then consult the preference declaration, see that tcmalloc is preferred, and use that.

When no preferences are declared, Ekam could infer preferences based on the distance between the dependency and the dependent. For example, a symbol declared in the same package is likely to be preferred over one declared elsewhere, and Ekam can use that preference automatically. Also, symbols defined in the source code tree are likely preferred over those defined in installed libraries. Additionally, preferences could be configured by the user, e.g. to force code to use tcmalloc even if the developer provided no hint to do so.

Note that this preferences system allows for an interesting alternative to the feature tests traditionally done by configure scripts: Write multiple implementations of your functionality using different features, placing each in a different file. Declare preferences stating which implementation to prefer. Then, let Ekam try to build them all. The ones that fail to compile will be discarded, and the most-prefered implementation from those remaining will be used. Obviously, this approach won't work everywhere, but it does cover a fairly wide variety of use cases.

Build ordering

There is still a tricky issue here, though: Ekam doesn't actually know what code will provide what symbols until it compiles said code. So, for example, say that you plop tcmalloc into your code tree in the hopes that Ekam will automagically link it in to your code. You run Ekam, and it happens to compile your code first, before ever looking into the tcmalloc directory. When it links your binaries, it doesn't know about tcmalloc's "malloc" implementation, so it just uses libc's. Only later on does it discover tcmalloc.

This is a hard problem. The easy solution would be to provide a way to tell Ekam "Please compile the tcmalloc package first." or "The symbol 'malloc' will be provided by the tcmalloc directory; do not link any binaries depending on it until that directory is ready.". But, these feel too much like writing an old-fashion build file to me. I'd like things to be more automatic.

Another option, then, is to simply let Ekam do the wrong thing initially -- let it link those binaries without tcmalloc -- and then have it re-do those steps later when it realizes it has better options. This may sound like a lot of redundant work, but is it? It's only the link steps that would have to be repeated, and only the ones which happened to occur before tcmalloc became available. But perhaps one of the binaries is a code generator: will we have to regenerate and re-compile all of its output? Regenerate, yes, but not recompile -- assuming the output is identical to before, Ekam can figure out not to repeat any actions that depend on it. Finally, and most importantly, note that Ekam can remember what happened the last time it built the code. Once it has seen tcmalloc once, it can remember that in the future it should avoid linking anything that uses malloc until tcmalloc has been built.

Given these heuristics, I think this approach could work out quite nicely, and requires minimal user intervention.

Tuesday, August 10, 2010

FreeBSD LAN Party, and some Ekam work

My weekend was mostly consumed by a LAN party. And the most interesting part: I used FreeBSD exclusively for the whole party.

Games that ran natively:

Quake (darkplaces port)

Games that completely worked under Wine:

Duke Nukem 3D (xDuke port)
Alien Swarm
StarCraft 2
Unreal Tournament 2004*
Unreal Tournament 3*

Games that ran but had significant problems under Wine:

Left 4 Dead 2: Disconnected after a random time period claiming that the server could not authenticate me with Steam. Bug 23286, still undiagnosed.
Team Fortress 2*: Fonts in configuration windows did not render. However, in-game fonts rendered fine, so the game was playable after copying over my configs from a Windows installation. Bug 19522, also undiagnosed.

Games that did not run at all under Wine:

none

I'm seriously considering running the game stations in my future house on FreeBSD (or Linux) so that I can customize them better (and avoid paying for Windows).

* Games that I installed and tested, but were not played at the party.

Ekam work

I did spend some time working on Ekam, but it was mostly refactoring. Nothing interesting to report. (I should have it working on Linux and OSX soon, though.)

Monday, August 2, 2010

Kake: A build system with no build files

UPDATE: Renamed to "Ekam" because "Kake" apparently looks like a misspelling of a vulger German word. Link below updated.

I finally got some real coding done this weekend.

http://code.google.com/p/ekam/

Kake is a build system which automatically figures out what to build and how to build it purely based on the source code. No separate "makefile" is needed.

Kake works by exploration. For example, when it encounters a file ending in ".cpp", it tries to compile the file. If there are missing includes, Kake continues to explore until it finds headers matching them. When Kake builds an object file and discovers that it contains a "main" symbol, it tries to link it as an executable, searching for other object files to satisfy all symbol references therein.

You might ask, "But wouldn't that be really slow for a big codebase?". Not necessarily. Results of previous exploration can be cached. These caches may be submitted to version control along with the code. Only the parts of the code which you modify must be explored again. Meanwhile, you don't need to waste any time messing around with makefiles. So, overall, time ought to be saved.

Current status

Currently, Kake is barely self-hosting: it knows how to compile C++ files, and it knows how to look for a "main" function and link a binary from it. There is a hack in the code right now to make it ignore any symbols that aren't in the "kake2" namespace since Kake does not yet know anything about libraries (not even the C/C++ runtime libraries).

That said, when Kake first built itself, it noticed that one of the source files was completely unused, and so did not bother to include it. Kake is already smarter than me.

Kake currently requires FreeBSD, because I used kqueue for events and libmd to calculate SHA-256 hashes. I thought libmd was standard but apparently not. That's easy enough to fix when I get a chance, after which Kake should run on OSX (which has kqueue), but will need some work to run on Linux or Windows (no kqueue).

There is no documentation, but you probably don't want to actually try using the existing code anyway. I'll post more when it is more usable.

Future plans

First and foremost, I need to finish the C++ support, including support for libraries. I also need to make Kake not rebuild everything every time it is run -- it should remember what it did last time. Kake should also automatically run any tests that it finds and report the results nicely.

Eventually I'd like Kake to run continuously in the background, watching as you make changes to your code, and automatically rebuilding stuff as needed. When you actually run the "kake" command, it will usually be able to give you an immediate report of all known problems, since it has already done the work. If you just saved a change to a widely-used header, you might have to wait.

Then I'd like to integrate Kake into Eclipse, so that C++ development can feel more like Java (which Eclipse builds continuously).

I'd like to support other languages (especially Java) in addition to C++. I hope to write a plugin system which makes it easy to extend Kake with rules for building other languages.

Kake should eventually support generating makefiles based on its exploration, so that you may ship those makefiles with your release packages for people who don't already have Kake.

Kake will, of course, support code generators, including complex cases where the code generator is itself built from sources in the same tree. Protocol Buffers are an excellent test case.

To scale to large codebases, I'd like to develop a system where many Kake users can share some central database which keeps track of build entities in submitted code, so that Kake need not actually explore the whole code base just to resolve dependencies for the part you are working on.