Showing posts with label ekam. Show all posts
Showing posts with label ekam. Show all posts

Tuesday, February 22, 2011

Converting Ekam to C++0x

I converted Ekam to C++0x. As always, all code is at:

http://code.google.com/p/ekam/

Note that Ekam now requires a C++0x compiler. Namely, it needs GCC 4.6, which is not officially released yet. I didn't have much trouble compiling and using the latest snapshot, but I realize that it is probably more work than most people want to do. Hopefully 4.6 will be officially released soon.

Introduction

When writing Ekam, with no company style guide to stop me, I have found myself developing a very specific and unique style of C++ programming with a heavy reliance on RAII. Some features of this style:

  • I never use operator new directly (much less malloc()), but instead use a wrapper which initializes an OwnedPtr. This class is like scoped_ptr in that it wraps a pointer and automatically deletes it when the OwnedPtr is destroyed. However, unlike scope_ptr, there is no way to release a pointer from an OwnedPtr except by transferring it to another OwnedPtr. Thus, the only way that an object pointed to by an OwnedPtr could ever be leaked (i.e. become unreachable without being reclaimed) is if you constructed an OwnedPtr cycle. This is actually quite hard to do by accident -- much harder than creating a cycle of regular pointers.
  • Ekam is heavily event-driven. Any function call which starts an asynchronous operation returns an OwnedPtr<AsyncOperation>. Deleting this object cancels the operation.
  • All OS handles (e.g. file descriptors) are wrapped in objects that automatically close them.

These features turn out to work extremely well together.

A common problem in multi-tasking C++ code (whether based on threads or events) is that cancellation is very difficult. Typically, an asynchronous operation calls some callback at some future time, and the caller is expected to ensure that the callback's context is still valid at the time that it is called. If you're lucky, the operation can be canceled by calling some separate cancel() function. However, it's often the case that this function simply causes the callback to complete sooner, because it's considered too easy to leak memory if an expected callback is never called. So, you still have to wait for the callback.

So what happens if you really just want to kill off an entire large, complex chunk of your program all at once? It turns out this is something I need to do in Ekam. If a build action is in progress and one of its inputs changes, the action should be immediately halted. But actions can involve arbitrary code that can get fairly complex. What can Ekam do about it?

Well, with the style I've been using, cancellation is actually quite easy. Because all allocated objects must be anchored to another object via an OwnedPtr, if you delete a high-level object, you can be pretty sure that all the objects underneath will be cleanly deleted. And because asychronous operations are themselves represented using objects, and deleting those objects cancel the corresponding operations, it's nearly impossible to accidentally leave an operation running after its context has been deleted.

Problem: OwnedPtr transferral

So what does this have to do with C++0x? Well, there are some parts of my style that turn out to be a bit awkward.

Transferring an OwnedPtr to another OwnedPtr looked like this:

OwnedPtr<MyObject> ptr1, ptr2;
//...
ptr1.adopt(&ptr2);

Looks fine, but this means that the way to pass ownership into a function call is by passing a pointer to an OwnedPtr, getting a little weird:

void Foo::takeOwnership(OwnedPtr<Bar>* barToAdopt) {
  this->bar.adopt(barToAdopt);
}

...

Foo foo;
OwnedPtr<Bar> bar;
...
foo.takeOwnership(&bar);

Returning an OwnedPtr is even more awkward:

void Foo::releaseOwnership(OwnedPtr<Bar>* output) {
  output->adopt(&this->bar);
}

...

OwnedPtr<Bar> bar;
foo.releaseOwnership(&bar);
bar->doSomething();

Furthermore, the way to allocate an owned object was through a method of OwnedPtr itself, which was kind of weird to call:

OwnedPtr<Bar> bar;
bar.allocate(constructorParam1, constructorParam2);
foo.takeOwnership(&bar);

This turned out to be particularly ugly when allocating a subclass:

OwnedPtr<BarSub> barSub;
barSub.allocate(constructorParam1, constructorParam2);
OwnedPtr<Bar> bar;
bar.adopt(&barSub);
foo.takeOwnership(&bar);

So I made a shortcut for that:

OwnedPtr<Bar> bar;
bar.allocateSubclass<BarSub>(
    constructorParam1, constructorParam2);
foo.takeOwnership(&bar);

Still, dealing with OwnedPtrs remained difficult. They just didn't flow right with the rest of the language.

Rvalue references

This is all solved by C++0x's new "rvalue references" feature. When a function takes an "rvalue reference" as a parameter, it only accepts references to values which are safe to clobber, either because the value is an unnamed temporary (which will be destroyed immediately when the function returns) or because the caller has explicitly indicated that it's OK to clobber the value.

Most of the literature on rvalue references talks about how they can be used to avoid unnecessary copies and to implement "perfect forwarding". These are nice, but what I really want is to implement a type that can only be moved, not copied. OwnedPtrs explicitly prohibit copying, since this would lead to double-deletion. However, moving an OwnedPtr is perfectly safe. By implementing move semantics using rvalue references, I was able to make it possible to pass OwnedPtrs around using natural syntax, without any risk of unexpected ownership stealing (as with the old auto_ptr).

Now the code samples look like this:

// Transferring ownership.
OwnedPtr<MyObject> ptr1, ptr2;
...
ptr1 = ptr2.release();

// Passing ownership to a method.
void Foo::takeOwnership(OwnedPtr<Bar> bar) {
  this->bar = bar.release();
}
...
Foo foo;
OwnedPtr<Bar> bar;
...
foo.takeOwnership(bar.release());

// Returning ownership from a method.
OwnedPtr<Bar> Foo::releaseOwnership() {
  return this->bar.release();
}
...
OwnedPtr<Bar> bar = foo.releaseOwnership();
bar->doSomething();

// Allocating an object.
OwnedPtr<Bar> bar = newOwned<Bar>(
    constructorParam1, constructorParam2);

// Allocating a subclass.
OwnedPtr<Bar> bar = newOwned<BarSub>(
    constructorParam1, constructorParam2);

So much nicer! Notice that the release() method is always used in contexts where ownership is being transfered away from a named OwnedPtr. This makes it very clear what is going on and avoids accidents. Notice also that release() is NOT needed if the OwnedPtr is an unnamed temporary, which allows complex expressions to be written relatively naturally.

Problem: Callbacks

While working better than typical callback-based systems, my style for asynchronous operations in Ekam was still fundamentally based on callbacks. This typically involved a lot of boilerplate. For example, here is some code to implement an asynchronous read, based on the EventManager interface which provides asynchronous notification of readability:

class ReadCallback {
public:
  virtual ~ReadCallback();
    
  virtual void done(size_t actual);
  virtual void error(int number);
};

OwnedPtr<AsyncOperation> readAsync(
    EventManager* eventManager,
    int fd, void* buffer, size_t size,
    ReadCallback* callback) {
  class ReadOperation: public EventManager::IoCallback,
                       public AsyncOperation {
  public:
    ReadOperation(int fd, void* buffer, size_t size, 
                  ReadCallback* callback)
        : fd(fd), buffer(buffer), size(size),
          callback(callback) {}
    ~ReadOperation() {}
    
    OwnedPtr<AsyncOperation> inner;
  
    // implements IoCallback
    virtual void ready() {
      ssize_t n = read(fd, buffer, size);
      if (n < 0) {
        callback->error(errno);
      } else {
        callback->done(n);
      }
    }
    
  private:
    int fd;
    void* buffer;
    size_t size;
    ReadCallback* callback;
  }

  OwnedPtr<ReadOperation> result =
      newOwned<ReadOperation>(
        fd, buffer, size, callback);
  result.inner = eventManager->onReadable(fd, result.get());
  return result.release();
}

That's a lot of code to do something pretty trivial. Additionally, the fact that callbacks transfer control from lower-level objects to higher-level ones causes some problems:

  • Exceptions can't be used, because they would propagate in the wrong direction.
  • When the callback returns, the calling object may have been destroyed. Detecting this situation is hard, and delaying destruction if needed is harder. Most callback callers are lucky enough not to have anything else to do after the call, but this isn't always the case.

C++0x introduces lambdas. Using them, I implemented E-style promises. Here's what the new code looks like:

Promise<size_t> readAsync(
    EventManager* eventManager,
    int fd, void* buffer, size_t size) {
  return eventManager->when(eventManager->onReadable(fd))(
    [=](Void) -> size_t {
      ssize_t n = read(fd, buffer, size);
      if (n < 0) {
        throw OsError("read", errno);
      } else {
        return n;
      }
    });
}

Isn't that pretty? It does all the same things as the previous code sample, but with so much less code. Here's another example which calls the above:

Promise<size_t> readPromise = readAsync(
    eventManager, fd, buffer, size);
Promise<void> pendingOp =
    eventManager->when(readPromise)(
      [=](size_t actual) {
        // Copy to stdout.
        write(STDOUT_FILENO, buffer, actual);
      }, [](MaybeException error) {
        try {
          // Force exception to be rethrown.
          error.get();
        } catch (const OsError& e) {
          fprintf(stderr, "%s\n", e.what());
        }
      })

Some points:

  • The return value of when() is another promise, for the result of the lambda.
  • The lambda can return another promise instead of a value. In this case the new promise will replace the old one.
  • You can pass multiple promises to when(). The lambda will be called when all have completed.
  • If you give two lambdas to when(), the second one is called in case of exceptions. Otherwise, exceptions simply propagate to the lambda returned by when().
  • Promise callbacks are never executed synchronously; they always go through an event queue. Therefore, the body of a promise callback can delete objcets without worrying that they are in use up the stack.
  • when() takes ownership of all of its arguments (using rvalue reference "move" semantics). You can actually pass things other than promises to it; they will simply be passed through to the callback. This is useful for making sure state required by the callback is not destroyed in the meantime.
  • If you destroy a promise without passing it to when(), whatever asynchronous operation it was bound to is canceled. Even if the promise was already fulfilled and the callback is simply sitting on the event queue, it will be removed and will never be called.

Having refactored all of my code to use promises, I do find them quite a bit easier to use. For example, it turns out that much of the complication in using Linux's inotify interface, which I whined about a few months ago, completely went away when I started using promises, because I didn't need to worry about callbacks interfering with each other.

Conclusion

C++ is still a horribly over-complicated language, and C++0x only makes that worse. The implementation of promises is a ridiculous mess of template magic that is pretty inscrutable. However, for those who deeply understand C++, C++0x provides some very powerful features. I'm pretty happy with the results.

Sunday, December 5, 2010

Ekam Eclipse Plugin

Ekam now has an Eclipse plugin! As always, all code is at:

http://code.google.com/p/ekam/

A story

So, let me start with a story. A little earlier today, I was making a small edit to Ekam. I made a change, saved the file, then saw an error marker (a red squiggly underline) appear on the line of code I had just written. I hovered over it and a tooltip popped up describing a compiler error. I figured out what the problem was, edited another file to fix it, and saved that. The error marker in the first file immediately went away.

It wasn't until several moments later than I realized that Eclipse's CDT (C++ developer tools) does not actually mark errors like this in real time. Normally you have to run a build first, and let Eclipse collect errors from the build output. It's normally only in Java that you get such fast feedback as you type.

What had actually happened here is that when I saved the file, Ekam immediately built it, then it sent the error info to my Ekam Eclipse plugin, which in turn marked the error in my editor. This was the first time I had actually edited Ekam code with the new plugin active. It literally took less than a minute for the plugin to start saving me time, and I didn't even notice until after the fact because the process seemed to natural.

As soon as I realized what had just happened, I started squealing with delight.

Features

Here's what the plugin does so far:

  • Connects to a running Ekam process in continuous-build mode.
  • Individual build actions are displayed in a tree view, corresponding to the source tree.
  • Output from each action is shown and errors in GCC format can be clicked to browse to the location of the error in the source code.
  • Errors are also actively marked using Eclipse "markers", meaning you get the squiggly red underline in your text editor.
  • Passing tests have a big green icon while failures (whether tests or compiles) have a big red icon (and also cause parent directories to be marked, so you can find problems in a large tree).

UI is Weird

I have very little experience with UI code. One thing I'm finding interesting is that when I run my code, I immediately notice obvious UX problems that weren't at all obvious when I was imagining things in my head. Little tweaks can make a huge difference. I really have to try things to find out what works.

Time to Apply

Neither Ekam as a whole nor the Eclipse plugin in particular are anywhere near complete. However, I think they are working well enough now that it's time to work on something else using Ekam. I could go on adding features that I think are useful forever, but actually trying to use it will tell me exactly what the biggest pain points are. I will then go back and tweak Ekam as needed to facilitate whatever I'm working on.

So, next week, I plan to work on Captain Proto, which I've put off for too long.

Sunday, October 24, 2010

Ekam continuously builds

I just pushed some changes to Ekam adding a mode in which it monitors your source code and immediately rebuilds when a file changes. As usual, code is at:

http://code.google.com/p/ekam/

Just run Ekam with -c to put it in continuous mode. When it finishes building, it will wait for changes and then rebuild. It also notices changes which happen mid-build and accounts for them, so you don't need to worry about confusing it as you edit.

I've only implemented this for FreeBSD and OSX so far, but things are still broken on Linux anyway (see previous post).

kqueue/EVFILT_VNODE vs. inotify

Speaking of Linux and monitoring directory trees, it turns out that Linux has a completely different interface for doing this vs. FreeBSD/OSX. The latter systems implement kqueue, which, via the EVFILT_VNODE filter, can monitor changes to a file pointed at by an open file descriptor. This includes directories -- adding or removing a file from a directory counts as modifying the directory.

Linux, on the other hand, has inotify, a completely different interface that, in the end, does the same thing.

There is a key difference, though: kqueue requires you to have an open file descriptor for every file and directory you wish to monitor. inotify takes a path instead. Normally I prefer interfaces based on file descriptors because they are more object-oriented. However, the number of file descriptors that may be opened at one time is typically limited to a few thousand -- easily less than the number of files in a large source tree. On OSX (which seems to have lower limits than FreeBSD) I was not even able to monitor the protobuf source tree without hitting the limit.

Furthermore, watching a directory using inotify also means watching all files in the directory. While I have to assume that this is NOT transitive (so you can't watch an entire tree with one call), it still (presumably) greatly reduces the overhead of watching an entire tree, since the OS doesn't have to flag every file individually.

I'm still waiting for confirmation from the freebsd-questions mailing list on whether EVFILT_VNODE is really so much less scalable, but lots of Googling didn't produce any answers. I just see lots of people asking for inotify on FreeBSD and being told to use EVFILT_VNODE instead, as if it were an equal replacement.

This may mean that FreeBSD and OSX will have to use some sort of gimpy polling strategy instead, but that will have its own scalability issues as the directory tree gets large. Sigh.

I may give up and switch my desktop to Linux. Fully-supported Chrome and Eclipse would be nice, as well as a more robust package update mechanism (I'm getting sick of compiling things). Figuring out how to do everything on FreeBSD was fun, but getting a bit old now.

Monday, October 11, 2010

Ekam finds includes, runs tests

Sorry for missing the last couple weeks. I have been working on Ekam, but didn't have time to get a whole lot done and didn't feel like I had anything worth talking about. But now, an update! As always, the code can be found at:

http://code.google.com/p/ekam/

Ekam now compiles and runs tests:


Here you can see it running some tests from protocol buffers. It automatically detects object files which register tests with Google Test and links them against gtest_main (if necessary). Then, it runs them, highlighting passing tests in green, and printing error logs of failing tests which show just the failures:

By the way, compile failures look pretty nice too:

Is that... word wrap? And highlighting the word "error" so you can see it easily? *gasp* Unthinkable!

Intercepting open()

Ekam now intercepts open() and other filesystem calls made by the compiler. It works by using LD_PRELOAD (DYLD_INSERT_LIBRARIES on Mac) to inject a custom shared library into the compiler process. This library defines its own implementation of open() which makes an RPC to the Ekam process in order to map file names to their actual physical locations. Essentially, it creates a virtual filesystem, but is lighter-weight than FUSE. This serves two purposes:

  1. Ekam can detect what the dependencies of an action are, e.g. what headers are included by a C++ source file, based on what files the compiler tries to open. With this information it can construct the build graph and determine when actions need to be re-run. (See previous blog entries about how Ekam manages dependencies.)
  2. Ekam can locate dependencies just-in-time, rather than telling the compiler where to look ahead of time. For example, Ekam does not need to set up the C++ compiler's include path to cover all the directories of everything the code might include. Instead, the include path contains only a single "magic" directory. When the compiler tries to open a file in that directory, Ekam searches for that file among all public headers it knows about across the source tree.

The down side of this approach is that every OS has quirks that must be worked around. FreeBSD seems to be the least quirky: the only thing I found weird about this system is that I had to define both open() and _open() to catch everything (and similarly for all other system calls I wanted to intercept). OSX and Linux, meanwhile, have some ridiculous hacks going on in which certain system calls are remapped to different names at compile time, so the injected library has to be sure to intercept the remapped name instead of the original. I still am running into some trouble on Linux where some calls to open() seem to bypass my code, while others hit it just fine. I need to spend more time working on it, but I do not have a Linux machine at home. (Maybe someone else would like to volunteer?)

Try it out

If you are running FreeBSD or OSX, you can try using Ekam to build protobufs or your own code (currently broken on Linux as described above). I've updated the quick start guide with instructions.

Monday, September 13, 2010

Ekam: Core model improvements

Finally got some Ekam work done again. As always, code can be found at:

http://code.google.com/p/ekam/

This weekend and last were spent slightly re-designing the basic model used to track dependencies between actions. I think it is simpler and more versatile now.

Ekam build model

I haven't really discussed Ekam's core logic before, only the features built on top of it. Let's do that now. Here are the basic objects that Ekam works with:

  • Files: Obviously, there are some set of input (source) files, some intermediate files, and some output files. Actually, at the moment, there is no real support for "outputs" -- everything goes to the "intermediate files" directory (tmp) and you have to dig out the one you're interested in.
  • Tags: Each file has a set of tags. Each tag is just a text string (or rather, a hash of a text string, for efficiency). Tags can mean anything, but usually each tag indicates something that the file provides which something else might depend on.
  • Actions: An action takes some set of input files and produces some set of output files. The inputs may be source files, or they may be the outputs of other actions. An action is specific to a particular set of files -- e.g. each C++ source file has a separate "compile" action. An action may search for inputs by tag, and may add new tags to any file (inputs and outputs). Note that applying tags to inputs is useful for implementing actions which simply scan existing files to see what they provide.
  • Rules: A rule describes how to construct a particular action given a particular input file with a particular tag. In fact, currently the class representing a rule in Ekam is called ActionFactory. Each rule defines some set of "trigger" tags in which it is interested, and whenever Ekam encounters one of those tags, the rule is asked to generate an action based on the file that defined the tag.

Example

There is one rule which defines how to compile C++ source files. This rule triggers on the tag "filetype:.cpp", so Ekam calls it whenever it sees a file name with the .cpp extension. The rule compiles the file to produce an object file, and adds a tag to the object file for every C++ symbol defined within it.

Meanwhile, another rule defines how to link object files into a binary. This rule triggers on the tag "c++symbol:main" to pick up object files which define a main() function. When it finds one, it checks that object file to see what external symbols it references, and then asks Ekam to find other object files with the corresponding tags. It does this recursively until it can't find any more objects, then attempts to link (even if some symbols weren't found).

If not all objects needed by the binary have been compiled yet, then this link will fail. That's fine, because Ekam remembers what tags the action asked for. If, later on, one of the missing tags shows up, Ekam will retry the link action that failed before, to see if the new tag makes a difference. Assuming the source code is complete, the link should eventually succeed. If not, once Ekam has nothing left to do, it will report to the user whatever errors remain.

Note that Ekam will retry an action any time any of the tags it asked for change. So, for example, say the binary calls malloc(). The link action may have searched for "c++symbol:malloc" and found nothing. But, the link may have succeeded despite this, because malloc() is defined by the C runtime. Later on, Ekam might find some other definition of malloc() elsewhere. When it does, it will re-run the link action from before to make sure it gets a chance to link in this new malloc() implementation instead.

Say Ekam then finds yet another malloc() implementation somewhere else. Ekam will then try to decide which malloc() is preferred by the link action. By default, the file which is closest to the action's trigger file will be used. So when linking foo/bar/main.o, Ekam will prefer foo/mem/myMalloc.cpp over bar/anotherMalloc.cpp -- the file name with the longest common prefix is preferred. Ties are broken by alphabetical ordering. In the future, it will be possible for a package to explicitly specify preferences when the default is not good enough.

The work I did over the last two weekends made Ekam able to handle and choose between multiple instances of the same tag.

Up next

  • I still have the code laying around to intercept open() calls from arbitrary processes. I intend to use this to intercept the compiler's attempts to search for included headers, and translate those into Ekam tag lookups. Once Ekam responds with a file name, the intercepted open() will open that file instead. Thus the compile action will not need to know ahead of time what the dependencies are, in order to construct an include path.
  • I would like to implement the mechanism by which preferences are specified sometime soon. I think the mechanism should also provide a way to define visibility of files and/or tags defined within a package, in order to prevent others from depending on your internal implementation details.
  • I need to make Ekam into a daemon that runs continuously in the background, detecting when source files change and immediately rebuilding. This will allow Ekam to perform incremental builds, which currently it does not do. Longer term, Ekam should persist its state somehow, but I think simply running as a daemon should be good enough for most users. Rebuilding from scratch once a day or so is not so bad, right?
  • Rules, rules, rules! Implement fully-featured C++ rules (supporting libraries and such), Java rules, Protobufs, etc.
  • Documentation. Including user documentation, implementation documentation, and code comments.

After the above, I think Ekam will be useful enough to start actually using.

Monday, August 23, 2010

Ekam: No longer restricted to building self, and can be taught new rules

Two big updates to Ekam this weekend! As always, the source code is at:

http://code.google.com/p/ekam/

Updates

Arbitrary Code

Ekam is no longer restricted to only building code in the "ekam" namespace. It will now build pretty much any C++ code you throw at it, so long as the code depends only on the standard C/C++ library. In fact, when I pointed Ekam at the Protocol Buffers source code, it successfully churned out protoc! (It didn't do as well compiling the tests, though.)

I ended up accomplishing this not by indexing libraries, as I had planned, but instead by changing the behavior when a dependency is not found. Now, if the link action cannot find any object files defining a particular symbol, it just goes ahead and tries to link anyway to see what happens. Only after failing does it decide to wait for the symbols to become available -- and it retries again every time a symbol for which it was waiting shows up. So, once all non-libc symbols are available, the link succeeds.

Now, this has the down side that Ekam will go through a lot of failed linker invocations. But, I look at this as an optimization problem, not a correctness problem. We can use heuristics to make Ekam avoid too many redundant link attempts. For example, we can remember what happened last time. Or, an easier approach would be to just make sure failed operations are not tried again until everything else is done.

Related to this change, Ekam will now re-run even a successful operation if dependencies that were not originally available become available. I talked about why you'd want to do this last week.

Defining new rules

You can now implement new build rules by simply writing a shell script and putting it in the source tree, as I proposed last week. You can now see what the actual code for compiling C++ looks like (linking is not yet implemented as a script). Of course, you don't have to write these as shell scripts. Any executable whose name ends in .ekam-rule will be picked up by Ekam -- even binaries that were themselves compiled by Ekam.

Refactoring

I seem to spend a lot of time refactoring. Working on C++ outside of Google is almost like learning a whole new language. I get to use whatever style I want, including whatever C++ features I want. I've been refining a very specific style, and I keep realizing ways to improve it that require going back and rewriting a bunch of stuff. Some key aspects of the style I'm using:

  • Exceptions are allowed, but expected never to occur in normal usage.
  • Ownership of pointers, and passing of that ownership, is explicit. I have a template class called OwnedPtr for this. In fact, I never use new -- instead, I call OwnedPtr::allocate() (passing it the desired constructor parameters). It's not possible to release ownership of the pointer, except by moving it to another OwnedPtr. Thus, the only way for memory to become unreachable without being deleted is by creating an ownership cycle. Yet, I don't use reference counting, since it is notoriously slow in the presence of multiple cores.
  • I'm using single-threaded event-driven I/O, since Ekam itself is just a dispatcher and does not need to utilize multiple cores. Events need to be cancelable. Originally, I had every asynchronous method optionally return a canceler object which could be called to cancel the operation. This got surprisingly hairy to implement, since the event effectively had to cancel the canceler when complete. Also, events owned their callbacks, which made a lot of things awkward and tended to lead to cyclic ownership (doh). What turned out to be much simpler was to simply have every asynchronous method return an object representing the ongoing operation. To cancel the operation, delete the object. With this approach, deleting a high-level object naturally causes everything it is doing to be canceled via cascading destructors, with no need to really keep track of cancellation.

Ideas

Currently, the compile action does not yet resolve header dependencies -- if you need to use special include directories you must specify CXXFLAGS manually. This is, of course, bad. But how can Ekam detect header dependencies?

One approach would be to attempt to compile and, if not successful, try to parse the error messages to determine missing includes. This is, of course, going to be pretty brittle -- it would need to understand every compiler's output, possibly for every locale.

Another approach would be to write some sort of an "include scanner" which looks for #include directives. It would not necessarily have to evaluate branches (#ifdefs and such), since it could just conservatively look for all the headers that are mentioned and take whatever it can find. However, this would still be pretty complicated to write and slow to run, and it wouldn't be able to handle things like macro-derived header names (yes, you can do that).

So here's my idea: Run the compiler, but use LD_PRELOAD to inject into it a custom implementation of open(). This implementation will basically send an RPC to Ekam (using the already-implement plugin interface) asking where to find the file, then replace the path with what Ekam sends back. The injected open() could pay attention only to paths in a certain directory which would be added to the compiler's include path. Problem solved! And better yet, this solution naturally extends to other kinds of actions and programming languages.

Sunday, August 15, 2010

Ekam: Works on Linux and OSX, and looks pretty


Above is what Ekam looks like when it is compiling some code. The fuchsia lines are currently running, while blue lines are completed. Errors, if there were any, would be red, and passing tests would be green (although Ekam does not yet run tests).

This weekend I:

  • Implemented the fancy output above.
  • Got Ekam working on Linux and OSX.
  • Did a ton of refactoring to make components more reusable. This means that visible progress ought to come much quicker when I next work on it.

Some critical features are still missing:

  • Ekam starts fresh every time it runs; it does not detect what tasks can be skipped.
  • Ekam can only really be used to build itself, as it assumes that any symbol that doesn't contain the word "ekam" (i.e. is not in the "ekam" namespace) is satisfied by libc. To fix this I will need to make Ekam actually index libc to determine what symbols can be ignored.
  • The only way to add new rule types (e.g. to support a new language) is to edit the Ekam source code.
  • Ekam does not run tests.

As always, source code is at: http://ekam.googlecode.com/

Defining new rules

I've been thinking about this, and I think the interface Ekam will provide for defining new rules will be at the process level. So, you must write a program that implements your rule. You can write your rule implementation in any language, but the easiest approach will probably be to do it as a shell script.

For example, a simplistic version of the rule to compile a C++ source file might look like:

#! /bin/sh

set -e

if $# == 0; then
  # Ekam is querying the script.
  # Tell it that we like .cpp files.
  echo triggerOnFilePattern '*.cpp'
  exit 0
fi

INPUT=$1

BASENAME=`basename $INPUT .cpp`

# Ask Ekam where to put the output file.
echo newOutput ${BASENAME}.o
read OUTPUT

# Compile, making sure all logs go to stderr.
c++ $CXXFLAGS -c $INPUT -o $OUTPUT >&2

# Tell Ekam about the symbols provided by the output file.
nm $OUTPUT | grep '[^ ]*  *[ABCDGRSTV] ' |
  sed -e 's/^[^ ]*  *. \(.*\)$/provide c++symbol:\1/g'

# Tell Ekam we succeeded.
echo success

You would then give this script a name like c++compile.ekamrule. When Ekam sees .ekamrule files during its normal exploration of the source tree, it would automatically query them and then start using them.

Of course, shell scripts are not the most efficient beasts. You might later decide that this script really needs to be replaced by something written in a lower-level language. Of course, Ekam doesn't care what language the program is written in -- you could easily use Python instead, or even C++. (Of course, to write the rule for compiling C++ in C++ would necessitate some bootstrapping.)

Resolving conflicts

Several people have asked what happens in cases where, say, two different libraries define the same symbol (e.g. various malloc() implementations). I've been thinking about this, and I think I'm settling on a solution involving preferences. Essentially, somewhere you would declare "This package prefers tcmalloc over libc.". The declaration would probably go in a file called ekamhints and would apply to the directory in which in resides and all its subdirectories. With this hint in place, if tcmalloc is available (either installed, or encountered in the source tree), Ekam will notice that both it and libc provide the symbol "malloc" which, of course, lots of other things depend on. It will then consult the preference declaration, see that tcmalloc is preferred, and use that.

When no preferences are declared, Ekam could infer preferences based on the distance between the dependency and the dependent. For example, a symbol declared in the same package is likely to be preferred over one declared elsewhere, and Ekam can use that preference automatically. Also, symbols defined in the source code tree are likely preferred over those defined in installed libraries. Additionally, preferences could be configured by the user, e.g. to force code to use tcmalloc even if the developer provided no hint to do so.

Note that this preferences system allows for an interesting alternative to the feature tests traditionally done by configure scripts: Write multiple implementations of your functionality using different features, placing each in a different file. Declare preferences stating which implementation to prefer. Then, let Ekam try to build them all. The ones that fail to compile will be discarded, and the most-prefered implementation from those remaining will be used. Obviously, this approach won't work everywhere, but it does cover a fairly wide variety of use cases.

Build ordering

There is still a tricky issue here, though: Ekam doesn't actually know what code will provide what symbols until it compiles said code. So, for example, say that you plop tcmalloc into your code tree in the hopes that Ekam will automagically link it in to your code. You run Ekam, and it happens to compile your code first, before ever looking into the tcmalloc directory. When it links your binaries, it doesn't know about tcmalloc's "malloc" implementation, so it just uses libc's. Only later on does it discover tcmalloc.

This is a hard problem. The easy solution would be to provide a way to tell Ekam "Please compile the tcmalloc package first." or "The symbol 'malloc' will be provided by the tcmalloc directory; do not link any binaries depending on it until that directory is ready.". But, these feel too much like writing an old-fashion build file to me. I'd like things to be more automatic.

Another option, then, is to simply let Ekam do the wrong thing initially -- let it link those binaries without tcmalloc -- and then have it re-do those steps later when it realizes it has better options. This may sound like a lot of redundant work, but is it? It's only the link steps that would have to be repeated, and only the ones which happened to occur before tcmalloc became available. But perhaps one of the binaries is a code generator: will we have to regenerate and re-compile all of its output? Regenerate, yes, but not recompile -- assuming the output is identical to before, Ekam can figure out not to repeat any actions that depend on it. Finally, and most importantly, note that Ekam can remember what happened the last time it built the code. Once it has seen tcmalloc once, it can remember that in the future it should avoid linking anything that uses malloc until tcmalloc has been built.

Given these heuristics, I think this approach could work out quite nicely, and requires minimal user intervention.

Monday, August 2, 2010

Kake: A build system with no build files

UPDATE: Renamed to "Ekam" because "Kake" apparently looks like a misspelling of a vulger German word. Link below updated.

I finally got some real coding done this weekend.

http://code.google.com/p/ekam/

Kake is a build system which automatically figures out what to build and how to build it purely based on the source code. No separate "makefile" is needed.

Kake works by exploration. For example, when it encounters a file ending in ".cpp", it tries to compile the file. If there are missing includes, Kake continues to explore until it finds headers matching them. When Kake builds an object file and discovers that it contains a "main" symbol, it tries to link it as an executable, searching for other object files to satisfy all symbol references therein.

You might ask, "But wouldn't that be really slow for a big codebase?". Not necessarily. Results of previous exploration can be cached. These caches may be submitted to version control along with the code. Only the parts of the code which you modify must be explored again. Meanwhile, you don't need to waste any time messing around with makefiles. So, overall, time ought to be saved.

Current status

Currently, Kake is barely self-hosting: it knows how to compile C++ files, and it knows how to look for a "main" function and link a binary from it. There is a hack in the code right now to make it ignore any symbols that aren't in the "kake2" namespace since Kake does not yet know anything about libraries (not even the C/C++ runtime libraries).

That said, when Kake first built itself, it noticed that one of the source files was completely unused, and so did not bother to include it. Kake is already smarter than me.

Kake currently requires FreeBSD, because I used kqueue for events and libmd to calculate SHA-256 hashes. I thought libmd was standard but apparently not. That's easy enough to fix when I get a chance, after which Kake should run on OSX (which has kqueue), but will need some work to run on Linux or Windows (no kqueue).

There is no documentation, but you probably don't want to actually try using the existing code anyway. I'll post more when it is more usable.

Future plans

First and foremost, I need to finish the C++ support, including support for libraries. I also need to make Kake not rebuild everything every time it is run -- it should remember what it did last time. Kake should also automatically run any tests that it finds and report the results nicely.

Eventually I'd like Kake to run continuously in the background, watching as you make changes to your code, and automatically rebuilding stuff as needed. When you actually run the "kake" command, it will usually be able to give you an immediate report of all known problems, since it has already done the work. If you just saved a change to a widely-used header, you might have to wait.

Then I'd like to integrate Kake into Eclipse, so that C++ development can feel more like Java (which Eclipse builds continuously).

I'd like to support other languages (especially Java) in addition to C++. I hope to write a plugin system which makes it easy to extend Kake with rules for building other languages.

Kake should eventually support generating makefiles based on its exploration, so that you may ship those makefiles with your release packages for people who don't already have Kake.

Kake will, of course, support code generators, including complex cases where the code generator is itself built from sources in the same tree. Protocol Buffers are an excellent test case.

To scale to large codebases, I'd like to develop a system where many Kake users can share some central database which keeps track of build entities in submitted code, so that Kake need not actually explore the whole code base just to resolve dependencies for the part you are working on.