Cloje types vs host types, revisited

2015-07-14 • 1909 words

While preparing the Cloje 0.2 release last week, I came across something I had written, part of the goals for Cloje:

Importing all of Cloje will allow you to write code in a language very similar to Clojure. Or if you prefer, you will be able to pick and choose which pieces of Cloje to import, so that you are still mostly writing code in Scheme/Lisp, but with some added features from Clojure.

I had written that over a month ago, long before I had explored host interop issues. Reading it again during the release preparations made me wonder if I had neglected this use case when considering the host interop API design.

My consideration of host interop was from the angle of someone who was writing in Cloje, and wanted to interact with certain host language features or libraries. I hadn't really considered someone who was writing in the host language, and wanted to integrate certain Cloje features or libraries. It made me wonder whether it was accurate to say that users would be able to "mostly write code in Scheme/Lisp", considering that much of Cloje will probably depend on a bunch of new types.

This got me thinking on a more fundamental level about Cloje's stance regarding host types. Should it be possible (and acceptable/idiomatic) for Cloje users to decide to write Cloje programs that primarily (or exclusively) use host types? Or are Cloje users expected to primarily use Cloje types, and only use host types for host interop?

Note: When I talk about "Cloje types" and "host types" in this post, I am mostly thinking of lists, vectors, hash maps/tables, and strings. Those are the types where Cloje and the host language would have "colliding" types (analogous types with different implementations), and thus the source of possible trouble. Certain other types (such as symbols, numbers, and functions) are the same type in Cloje as on the host, so no conversion (implicit or explicit) would be needed.

To help me organize my thoughts and guide my decision, I have sketched out a few scenarios for different stances Cloje might take regarding host types, and explored some of the implications of each scenario.

Scenario 1: Strictly Cloje types

In this scenario, the only Cloje functions that would work with non-Cloje types are the functions used to convert host types to Cloje types. Other functions would throw a type error if given a host type.

As a designer/implementer, this is the most appealing scenario to me. The semantics are crisp and clean. There are no implicit conversions, and thus no ambiguity about what type will be returned. When implementing algorithms, I don't have to worry about mutability, or differences in behavior (e.g. of host hash tables, which vary between Scheme implementations). Plus, this scenario would have the simplest code and the fewest necessary test cases (with the possible exception of scenario 4), which means less work to implement and maintain. And, this scenario would leave the most possibilities for backwards compatible change in the future (the API could be made less strict without breaking backwards compatibility).

But considering the use case of someone writing "mostly in Scheme/Lisp", this feels very rigid. It establishes a strict separation between host code and Cloje code. It would be very tedious to interweave host functions and Cloje functions within the same section of code, because everything would be wrapped with explicit conversion functions.

Also, I think this scenario would be rather frustrating when using string literals and quoted list. I cannot implement the reader syntax to create Cloje string literals or quoted Cloje lists, because the same syntaxes are used by the host language. Therefore, under this scenario many users would probably accidentally create host strings/lists, especially at the REPL, and become frustrated when Cloje throws a type error.

Examples:

(vector? #(1 2)) would return false.
cloje-vector? would not exist (it would be called vector?). host-vector? might exist, for convenience when writing host interop code.
(mapv my-fn #(1 2)) would throw a type error.
(into '() my-coll) would throw a type error (because '() is a host list).
(string/reverse "foo") would throw a type error (because "foo" is a host string).

Scenario 2: Implicit conversion

In this scenario, Cloje functions would accept certain host types, but implicitly convert them to the corresponding Cloje type before doing any work on them. Cloje functions would return Cloje types even if originally given host types.

This is essentially the robustness principle: be conservative in what you send, be liberal in what you accept. That sounds nice in theory (who doesn't like robustness?) but it can backfire. For example, it can hide mistakes, where you are doing something wrong, but you never realize it because the system silently accepts the input anyway. That can lead to maintenance headaches later, for everyone involved. (One example of this: early web browsers accepted invalid HTML, which led to a lot of invalid HTML being published, which made it necessary for all browsers to continue supporting invalid HTML indefinitely.)

So, keeping in mind that "robustness" is not necessarily always a good thing, is it a good thing in this specific case? Is the user convenience of (sometimes) not having to do explicit type conversion, worth the muddier semantics (sometimes the type you get back won't be the type you originally gave) and increased implementation complexity?

Normally I would be inclined to say no, it's not worth it. Users should prefer Cloje types, and only use host types for host interop, and type conversion should be explicit. But the situation with string literals and quoted lists really stinks. As a matter of practicality, I need to accommodate host strings and host lists.

If I'm going to implicitly convert host strings and host lists, I suppose for consistency I should also implicitly convert host vectors. Host hash tables are less clear, because of the potential for key collision due to different equality tests.

Examples:

(vector? #(1 2)) would return true, as if the host vector had been implicitly converted.
cloje-vector? and host-vector? would allow users to distinguish between Cloje vectors and host vectors.
(mapv my-fn #(1 2)) would return a Cloje vector.
(into '() my-coll) would return a Cloje list.
(string/reverse "foo") would return a Cloje string.

Scenario 3: First-class host types

In this scenario, host types are treated as first-class, equally valid and important as Cloje types. Many Cloje functions would operate on host types and return the same host type as given.

Instead of rejecting host types like in scenario 1, or merely accommodating them like in scenario 2, this scenario embraces and fully supports them. This would be extremely convenient for host interop, and for integrating Cloje into codebases written in the host language.

But, as a designer and implementer, this scenario makes me nervous. Host types like lists, vectors, hash tables, and strings are (usually) mutable. I would either have to do a lot of defensive copying, or foist a lot of risk onto users.

If I do defensive copying, many functions would be much less efficient when given host types, and some might have different semantics (returning a copy instead of an identical object). If I don't do defensive copying, the mutability of the objects would nullify Cloje's safety guarantees. It would be up to the user to ensure that no mutation occurs — not in their own code, and not in any library functions they call. Otherwise, the whole thing could fall apart. Admittedly, that is business as usual for most languages, but I don't necessarily want to perpetuate the problem.

I might be okay with allowing unsafe host types, if I could be sure the user really knew what they were doing. But unfortunately, Cloje wouldn't be able to tell whether the user was using host types intentionally, or by accident because they forgot to perform explicit conversion somewhere in their code. I suppose, in theory, I could program an option into Cloje to enable support for host types, disabled by default. But I'm not particularly inclined to make Cloje more complicated, merely to give users the option to shoot themselves in the foot.

The implementation of Cloje would be much more complex in this scenario, with some functions having twice as many branches (and test cases), because host types would often have to be handled differently from Cloje types.

Examples:

(vector? #(1 2)) would return true, because a host vector is a valid kind of vector.
cloje-vector? and host-vector? would allow users to distinguish between Cloje vectors and host vectors.
(mapv my-fn #(1 2)) would return a host vector.
(into '() my-coll) would return a host list.
(string/reverse "foo") would return a host string.

Scenario 4: Strictly host types

In this scenario, Cloje would use host types almost exclusively. New types (like hash sets) may be added to fill gaps, but Cloje would use host types whenever available. There would be no persistent immutable data structures.

This is, in fact, how Cloje works right now (as of 0.2). But that is only a temporary state of affairs to allow the project to gain some momentum before tackling the hard stuff. This scenario would mean making it a permanent matter of policy.

This scenario has all the inefficiency/risk of allowing mutable types described in scenario 3, except that users don't even have the choice to use immutable types (except on Racket, which offers immutable variants of many types). On the plus side, host interop would be super easy, and I wouldn't have to learn how to implement a hash array mapped trie!

Joking aside, this is not an acceptable scenario. Immutable data structures are a Good Thing™ and I want to encourage their use. They are also one of the best and most fundamental aspects of Clojure. A clone of Clojure that doesn't even offer immutable data structures would be missing the point.

Final thoughts

After considering these scenarios, I'm leaning towards something in between scenarios 1 and 2.

I mentioned in scenario 2 that if I'm going to implicitly convert strings and lists, then for consistency's sake I "should" also implicitly convert vectors, and maybe hash tables. But really, I could implicitly convert lists and strings, yet require explicit conversion for vectors or hash tables.

Much like scenario 1, users would be expected to use Cloje types in general, but implicit conversion for host lists and host strings would be provided as a concession to practicality, because otherwise it would be painful to use string literals and quoted lists. The implicit conversion of those two types would be a special case, not a general stance.

As a designer, it feels a bit icky and inconsistent to treat those two types differently from the others, but it might be the best compromise between clarity and practicality. It addresses the most serious usability issues of scenario 1, while still encouraging the use of immutable Cloje types, and avoiding unsafe implicit conversions of hash tables.

What about the use case I mentioned earlier, of "mostly writing code in Scheme/Lisp, but with some added features from Clojure"? This is still possible, although perhaps not as convenient as you might wish. Rather than casually weaving host code and Cloje code together, you would probably want to maintain a well-defined separation, where some parts of the codebase are Cloje-oriented and other parts are host-oriented. Supporting tighter integration between the host and Cloje would, unfortunately, have negative effects that outweigh the benefits.

John Croisant

Cloje types vs host types, revisited

Scenario 1: Strictly Cloje types

Scenario 2: Implicit conversion

Scenario 3: First-class host types

Scenario 4: Strictly host types

Final thoughts