Sunday, August 15, 2010

Ekam: Works on Linux and OSX, and looks pretty

Above is what Ekam looks like when it is compiling some code. The fuchsia lines are currently running, while blue lines are completed. Errors, if there were any, would be red, and passing tests would be green (although Ekam does not yet run tests).

This weekend I:

  • Implemented the fancy output above.
  • Got Ekam working on Linux and OSX.
  • Did a ton of refactoring to make components more reusable. This means that visible progress ought to come much quicker when I next work on it.

Some critical features are still missing:

  • Ekam starts fresh every time it runs; it does not detect what tasks can be skipped.
  • Ekam can only really be used to build itself, as it assumes that any symbol that doesn't contain the word "ekam" (i.e. is not in the "ekam" namespace) is satisfied by libc. To fix this I will need to make Ekam actually index libc to determine what symbols can be ignored.
  • The only way to add new rule types (e.g. to support a new language) is to edit the Ekam source code.
  • Ekam does not run tests.

As always, source code is at:

Defining new rules

I've been thinking about this, and I think the interface Ekam will provide for defining new rules will be at the process level. So, you must write a program that implements your rule. You can write your rule implementation in any language, but the easiest approach will probably be to do it as a shell script.

For example, a simplistic version of the rule to compile a C++ source file might look like:

#! /bin/sh

set -e

if $# == 0; then
  # Ekam is querying the script.
  # Tell it that we like .cpp files.
  echo triggerOnFilePattern '*.cpp'
  exit 0


BASENAME=`basename $INPUT .cpp`

# Ask Ekam where to put the output file.
echo newOutput ${BASENAME}.o

# Compile, making sure all logs go to stderr.
c++ $CXXFLAGS -c $INPUT -o $OUTPUT >&2

# Tell Ekam about the symbols provided by the output file.
nm $OUTPUT | grep '[^ ]*  *[ABCDGRSTV] ' |
  sed -e 's/^[^ ]*  *. \(.*\)$/provide c++symbol:\1/g'

# Tell Ekam we succeeded.
echo success

You would then give this script a name like c++compile.ekamrule. When Ekam sees .ekamrule files during its normal exploration of the source tree, it would automatically query them and then start using them.

Of course, shell scripts are not the most efficient beasts. You might later decide that this script really needs to be replaced by something written in a lower-level language. Of course, Ekam doesn't care what language the program is written in -- you could easily use Python instead, or even C++. (Of course, to write the rule for compiling C++ in C++ would necessitate some bootstrapping.)

Resolving conflicts

Several people have asked what happens in cases where, say, two different libraries define the same symbol (e.g. various malloc() implementations). I've been thinking about this, and I think I'm settling on a solution involving preferences. Essentially, somewhere you would declare "This package prefers tcmalloc over libc.". The declaration would probably go in a file called ekamhints and would apply to the directory in which in resides and all its subdirectories. With this hint in place, if tcmalloc is available (either installed, or encountered in the source tree), Ekam will notice that both it and libc provide the symbol "malloc" which, of course, lots of other things depend on. It will then consult the preference declaration, see that tcmalloc is preferred, and use that.

When no preferences are declared, Ekam could infer preferences based on the distance between the dependency and the dependent. For example, a symbol declared in the same package is likely to be preferred over one declared elsewhere, and Ekam can use that preference automatically. Also, symbols defined in the source code tree are likely preferred over those defined in installed libraries. Additionally, preferences could be configured by the user, e.g. to force code to use tcmalloc even if the developer provided no hint to do so.

Note that this preferences system allows for an interesting alternative to the feature tests traditionally done by configure scripts: Write multiple implementations of your functionality using different features, placing each in a different file. Declare preferences stating which implementation to prefer. Then, let Ekam try to build them all. The ones that fail to compile will be discarded, and the most-prefered implementation from those remaining will be used. Obviously, this approach won't work everywhere, but it does cover a fairly wide variety of use cases.

Build ordering

There is still a tricky issue here, though: Ekam doesn't actually know what code will provide what symbols until it compiles said code. So, for example, say that you plop tcmalloc into your code tree in the hopes that Ekam will automagically link it in to your code. You run Ekam, and it happens to compile your code first, before ever looking into the tcmalloc directory. When it links your binaries, it doesn't know about tcmalloc's "malloc" implementation, so it just uses libc's. Only later on does it discover tcmalloc.

This is a hard problem. The easy solution would be to provide a way to tell Ekam "Please compile the tcmalloc package first." or "The symbol 'malloc' will be provided by the tcmalloc directory; do not link any binaries depending on it until that directory is ready.". But, these feel too much like writing an old-fashion build file to me. I'd like things to be more automatic.

Another option, then, is to simply let Ekam do the wrong thing initially -- let it link those binaries without tcmalloc -- and then have it re-do those steps later when it realizes it has better options. This may sound like a lot of redundant work, but is it? It's only the link steps that would have to be repeated, and only the ones which happened to occur before tcmalloc became available. But perhaps one of the binaries is a code generator: will we have to regenerate and re-compile all of its output? Regenerate, yes, but not recompile -- assuming the output is identical to before, Ekam can figure out not to repeat any actions that depend on it. Finally, and most importantly, note that Ekam can remember what happened the last time it built the code. Once it has seen tcmalloc once, it can remember that in the future it should avoid linking anything that uses malloc until tcmalloc has been built.

Given these heuristics, I think this approach could work out quite nicely, and requires minimal user intervention.

No comments:

Post a Comment