Monday, September 13, 2010

Ekam: Core model improvements

Finally got some Ekam work done again. As always, code can be found at:

http://code.google.com/p/ekam/

This weekend and last were spent slightly re-designing the basic model used to track dependencies between actions. I think it is simpler and more versatile now.

Ekam build model

I haven't really discussed Ekam's core logic before, only the features built on top of it. Let's do that now. Here are the basic objects that Ekam works with:

  • Files: Obviously, there are some set of input (source) files, some intermediate files, and some output files. Actually, at the moment, there is no real support for "outputs" -- everything goes to the "intermediate files" directory (tmp) and you have to dig out the one you're interested in.
  • Tags: Each file has a set of tags. Each tag is just a text string (or rather, a hash of a text string, for efficiency). Tags can mean anything, but usually each tag indicates something that the file provides which something else might depend on.
  • Actions: An action takes some set of input files and produces some set of output files. The inputs may be source files, or they may be the outputs of other actions. An action is specific to a particular set of files -- e.g. each C++ source file has a separate "compile" action. An action may search for inputs by tag, and may add new tags to any file (inputs and outputs). Note that applying tags to inputs is useful for implementing actions which simply scan existing files to see what they provide.
  • Rules: A rule describes how to construct a particular action given a particular input file with a particular tag. In fact, currently the class representing a rule in Ekam is called ActionFactory. Each rule defines some set of "trigger" tags in which it is interested, and whenever Ekam encounters one of those tags, the rule is asked to generate an action based on the file that defined the tag.

Example

There is one rule which defines how to compile C++ source files. This rule triggers on the tag "filetype:.cpp", so Ekam calls it whenever it sees a file name with the .cpp extension. The rule compiles the file to produce an object file, and adds a tag to the object file for every C++ symbol defined within it.

Meanwhile, another rule defines how to link object files into a binary. This rule triggers on the tag "c++symbol:main" to pick up object files which define a main() function. When it finds one, it checks that object file to see what external symbols it references, and then asks Ekam to find other object files with the corresponding tags. It does this recursively until it can't find any more objects, then attempts to link (even if some symbols weren't found).

If not all objects needed by the binary have been compiled yet, then this link will fail. That's fine, because Ekam remembers what tags the action asked for. If, later on, one of the missing tags shows up, Ekam will retry the link action that failed before, to see if the new tag makes a difference. Assuming the source code is complete, the link should eventually succeed. If not, once Ekam has nothing left to do, it will report to the user whatever errors remain.

Note that Ekam will retry an action any time any of the tags it asked for change. So, for example, say the binary calls malloc(). The link action may have searched for "c++symbol:malloc" and found nothing. But, the link may have succeeded despite this, because malloc() is defined by the C runtime. Later on, Ekam might find some other definition of malloc() elsewhere. When it does, it will re-run the link action from before to make sure it gets a chance to link in this new malloc() implementation instead.

Say Ekam then finds yet another malloc() implementation somewhere else. Ekam will then try to decide which malloc() is preferred by the link action. By default, the file which is closest to the action's trigger file will be used. So when linking foo/bar/main.o, Ekam will prefer foo/mem/myMalloc.cpp over bar/anotherMalloc.cpp -- the file name with the longest common prefix is preferred. Ties are broken by alphabetical ordering. In the future, it will be possible for a package to explicitly specify preferences when the default is not good enough.

The work I did over the last two weekends made Ekam able to handle and choose between multiple instances of the same tag.

Up next

  • I still have the code laying around to intercept open() calls from arbitrary processes. I intend to use this to intercept the compiler's attempts to search for included headers, and translate those into Ekam tag lookups. Once Ekam responds with a file name, the intercepted open() will open that file instead. Thus the compile action will not need to know ahead of time what the dependencies are, in order to construct an include path.
  • I would like to implement the mechanism by which preferences are specified sometime soon. I think the mechanism should also provide a way to define visibility of files and/or tags defined within a package, in order to prevent others from depending on your internal implementation details.
  • I need to make Ekam into a daemon that runs continuously in the background, detecting when source files change and immediately rebuilding. This will allow Ekam to perform incremental builds, which currently it does not do. Longer term, Ekam should persist its state somehow, but I think simply running as a daemon should be good enough for most users. Rebuilding from scratch once a day or so is not so bad, right?
  • Rules, rules, rules! Implement fully-featured C++ rules (supporting libraries and such), Java rules, Protobufs, etc.
  • Documentation. Including user documentation, implementation documentation, and code comments.

After the above, I think Ekam will be useful enough to start actually using.