Implementing Metadata in Cloje
This week I've been doing planning for Cloje, and trying to figure out how I might approach implementing certain Clojure features. Even if I'm not going to write the code for several weeks, I want to avoid painting myself into a corner.
One such feature is support for metadata. I see three potential approaches one might take when implementing metadata:
- Wrap the object with a metadata container
- Inject metadata into the object
- Store the metadata separately from the object
Each of these solutions is pretty hairy, and there are significant trade-offs involved. So, I wrote this blog post to explore these approaches and hopefully help me decide which to use.
Wrap the object with a metadata container
My first inclination was to wrap objects in a metadata container. The container would be a structure with two slots: one slot for the object being wrapped, and another slot for a hash map containing the metadata.
The advantages of this approach are:
- The container structure is simple and portable to implement.
- Any type of object can have metadata. You don't have to design new types just to add a slot for metadata.
- The metadata "travels with" the object, so you don't have to worry much about the metadata lingering after the object has been garbage collected.
The disadvantage of this approach is that the object must be extracted from the container to use it. That has some nasty implications:
- Cloje's code, and perhaps some Cloje users' code, would be littered with calls to
strip-meta
(or whatever the extraction function is called). - You could not simply "call" a function that has metadata, you would have to extract and then call it. So instead of
(foo bar)
you would have to write((strip-meta foo) bar)
or(invoke foo bar)
. Every time.
Until recently, I thought that perhaps a clever reader macro could solve two problems at once: calling functions with metadata, and IFn (Clojure's ability to "call" certain non-function types as if they were functions). The reader macro would change how s-expressions are read, so that a function call like (foo bar)
would be read as (invoke foo bar)
, and invoke
would be a function that would (if necessarry) extract foo
''s object from the metadata container, and "call" it.
Besides the fact that "change how s-expressions are read" is full of nasty pitfalls, this plan would break macros. All macros. And special forms. You could no longer write (if x y z)
because it would be read as (invoke if x y z)
, and the implementation would complain that the variable if
is unbound.
To work around that, Cloje would need to know, when it sees code like (foo bar)
, whether foo
is a macro or not. Either the reader macro would need to handle macro calls specially (by not putting invoke
in front of them), or invoke
would need to be a macro so that (invoke foo bar)
could expand to (foo bar)
if foo
is a macro.
Either way, that would entail some sort of macro registry, to record which symbols are macros/special forms, and which symbols are not. Scheme implementations must have their own internal macro registries, but the question is whether Cloje would be able to query that registry. As far as I know, there is no portable way to do that in standard Scheme, so either Cloje would have to use implementation-specific ways (thus making Cloje less portable), or Cloje would need its own macro registry, which would probably have to be implemented in implementation-specific ways (also making Cloje less portable).
And, that still leaves the problem that Cloje's code, and some Cloje users' code, would be littered with calls to strip-meta
, anywhere you allow a value that could possibly have metadata.
Inject metadata into the object
Another approach would be to inject the metadata right into the object. This would require that the object has a slot to support such metadata.
I'm guessing that this is the approach that Clojure uses, based on the fact that only certain types of objects support metadata: symbols, collections, and functions. The error message if you try to add metadata to an unsupported type is "Metadata can only be applied to IMetas", so apparently there is an IMeta interface in Clojure's implementation, although it doesn't seem to be publicly documented.
Frankly, building metadata support into types is the best choice, if you're implementing a whole new language like Clojure. Cloje's situation is different. I'm trying to build an API on top of existing languages, and those existing languages already have many of the types I need, and I'd like to retain compatibility with those native types if possible.
I happen to know that CHICKEN has some support for something like metadata, on certain types of objects. You can use extend-procedure
to attach an extra data slot to a function, although it only supports one slot, so Cloje metadata would clobber any other uses of that slot. Also on CHICKEN, symbols can have property lists, so you could have a "cloje-meta" property on symbols. But because symbols are interned in Scheme but not interned in Clojure, this would result in a semantic discrepancy. And it wouldn't be any help with other kinds of objects. Plus, these are not necessarily portable to other host languages.
I could take the Clojure route, and create new data types with a slot for metadata built in. I'll be creating new data types anyway, to add hash-sets, and possibly persistent immutable data structures in the future.
But in the cases of symbols, vectors, hash maps, and functions, which already exist in the host language, there doesn't seem to be any advantage to creating multiple new types, rather than creating a single general-purpose metadata container structure described in the first approach. You still have to extract the base value before using it, so the code is still littered with extraction functions calls. Except now you have N different extraction functions instead of 1!
Store the metadata separately from the object
The final approach is to store the metadata separately from the object, for example in a central registry.
The advantages of this are:
- Any type of object can have metadata. You don't have to build in a metadata slot.
- You don't have to perform any extraction or conversion when using the object normally.
- If you never use metadata, you pay no cost.
But the disadvantages are:
- You can't have separate metadata on two symbols with the same name. (I call this a "disadvantage" because it differs from Clojure.)
- You have to be very careful in how you implement the central registry, or you'll get memory leaks or other badness.
The second issue is the more significant one.
For example, you might think of using a hash table as the registry, with the object being the key and the metadata being the value. But if it's a strongly-referenced hash table (like Clojure's maps), the object would never be garbage collected — even if your code isn't using the object anymore, it's still being referenced by the registry.
There is such a thing as weakly-referenced hash table, which don't prevent the object from being garbage collected. Racket supports hash tables with weak keys. CHICKEN doesn't, but it does support weak "locatives", so I think it would be feasible to roll your own weak data structure.
But there's another possible memory leak: if the original object gets garbage collected, but the metadata lingers in the registry. So you would need some way to clear out old metadata. Ideally you would want a hook into the garbage collector, so you can perform some cleanup code when necessary. But if you were really desperate, you could loop through your weak hash table (or whatever structure) and remove any metadata for weak references that are broken.
I am not a data structures expert, so there may be some structure that would be a better solution to this problem than a weak hash table. If you know about such things, please add a comment and let me know.
Final Thoughts
So, there are the three potential approaches I thought of to implement metadata in Cloje. All of them have pretty significant trade-offs, so it's not clear which approach to use.
Of course, there's no rule that says I have to choose just one approach and use it for everything. For example, I could use a metadata registry for defn
'd functions, but use metadata containers for everything else. That way, you don't have to use invoke
for every function call. Plus, defn
'd functions are very unlikely to be garbage collected, so it's less vital that the registry promptly and efficiently dispose of lingering metadata.
I do need to do some more analysis to see how often / in what situations Cloje users would need to use strip-meta
to access raw values. Ideally, Cloje would do that all behind the scenes.
Of course, if all else fails, there's always a fourth approach to supporting metadata in Cloje: don't!