Surprises in clojure.set

2015-06-16 • 2385 words

[Update August 2, 2015: Parts of this post have been revised to "tone down the snark". This post was a byproduct of frustration with the design of the clojure.set API, but in some places I had expressed that frustration in ways that were merely cathartic and sensational, not constructive or informative.]

Lately I have been implementing the clojure.set API for Cloje, and as usual, Clojure's APIs are full of surprises, particularly if you pass unexpected types.

Indeed, every clojure.set function I have looked at so far has its own peculiar way of handling this obvious edge case. In general, the lack of type checking results in unexpected or unpredictable behavior that may confuse the user and make bugs harder to diagnose.

Let's start with some fairly mundane examples, then work our way up to the truly puzzling. I'll then explain why this is a problem, and describe how I'm addressing it in Cloje.

difference

What happens if you try to get the difference between a set and a non-set collection? Let's try it:

user=> (clojure.set/difference #{1 2 3} '(1 3))
#{2}

Hey, that's pretty good! That is one of the two possible results I would expect, the other being a type error. What happens if we change the first argument from a set to a list?

user=> (clojure.set/difference '(1 2 3) '(1 3))
  ClassCastException clojure.lang.PersistentList cannot be cast to clojure.lang.IPersistentSet
  clojure.core/disj (core.clj:1449)

There you go, a type error. Okay, that's not too bad. What happens if we pass in just one list?

user=> (clojure.set/difference '(1 2 3))
(1 2 3)

That's strange. It returned a list, even though the function is supposed to return a set. Out of morbid curiosity, what happens if we give it something that isn't even a collection?

user=> (clojure.set/difference 'wat)
wat

Oh dear. Apparently if you give it a single argument, it simply returns that argument, even if it is a completely invalid type. This behavior could easily mask some bugs in user code, making them harder to diagnose.

intersection

How about the intersection function? How does it handle arguments that aren't sets?

user=> (clojure.set/intersection #{1 2 3} '(1 3))
(1 3)

So, whereas difference returned a set that had the "correct" result, intersection here has returned a list with the "correct" result. But wait, let's try something slightly different, adding another element to the second argument:

user=> (clojure.set/intersection #{1 2 3} '(1 3 5))
IllegalArgumentException contains? not supported on type: clojure.lang.PersistentList
  clojure.lang.RT.contains (RT.java:724)

Woops. Even though the argument types are the same in both examples, merely changing the contents of the second argument has caused a totally different behavior. As we will later see with union, this difference in behavior is triggered by the difference in relative size of the two collections. My sympathies to any programmer who discovers this oddity in production.

For the sake of completeness, let's see what happens if the first argument is not a set:

user=> (clojure.set/intersection '(1 2 3) '(1 3))
IllegalArgumentException contains? not supported on type: clojure.lang.PersistentList
  clojure.lang.RT.contains (RT.java:724)

Okay, the error message is a little confusing (I didn't call contains?, I called intersection), but at least it tells us that the type is wrong somewhere.

What if we call it with a single argument that is not a set?

user=> (clojure.set/intersection '(1 2 3))
(1 2 3)
user=> (clojure.set/intersection 'sigh)
sigh

Just like difference, it merely returns the argument, even if it is a completely invalid type.

There are some more surprises if you pass a vector, thanks to the fact that intersection calls contains? under the hood. But I don't want to ruin the next surprise.

subset? and superset?

=> (clojure.set/subset? #{0 1} [3 4])
true

Huh? How can the set #{0 1} be considered a subset of the vector [3 4], when they have no elements in common? If you change the vector to a list, you'll get another surprise, which will also give you a hint about what's going on:

=> (clojure.set/subset? #{0 1} '(3 4))
IllegalArgumentException contains? not supported on type: clojure.lang.PersistentList
 clojure.lang.RT.contains (RT.java:724)

Aha! Apparently subset? calls the function contains? on the second collection to test whether each element from the first collection exists in the second collection. If you are familiar with the contains? function, you may know that it has a very peculiar behavior when used on a vector. Despite its name, the contains? function does not actually test whether a vector contains the given item, it tests whether the given item is a valid (i.e. in-bounds) index of the vector.

So, because 0 and 1 are valid indexes of [3 4], contains? returns true in both cases, so subset? returns true. Thus, the clojure.set API claims that #{0 1} is a subset of [3 4], despite what your math teacher might tell you.

But it's not all bad. If you swap the types so that the second argument is a set, the function is perfectly willing to accommodate you, and returns sensible results:

user=> (clojure.set/subset? [0 1] #{3 4})
false
user=> (clojure.set/subset? [3 4] #{3 4})
true

The superset? function has similar surprises, but the order of arguments is flipped:

user=> (clojure.set/superset? [3 4] [0 1])
true
user=> (clojure.set/superset? [3 4] [3 4])
false
user=> (clojure.set/superset? #{3 4} [3 4])
true

select

user=> (clojure.set/select even? '(2 4))
(2 4)

Hey, that's not so bad! The select function returned all the elements that passed the predicate, just like it's supposed to. It's kind of weird that it returned a list instead of a set, but whatever.

But wait, that list only contained elements that passed the predicate anyway. What if we pass it a collection with some elements that fail the predicate?

user => (clojure.set/select even? '(2 4 5))
ClassCastException clojure.lang.PersistentList cannot be cast to clojure.lang.IPersistentSet
  clojure.core/disj (core.clj:1449)

Oops. Apparently if all the elements of the given collection pass the predicate, the select function simply returns the argument unchanged, even if it is not a set. It only throws a type error if an element fails the predicate.

In other words, the return type of the select function, and indeed whether it returns at all or throws an error, depends not just on the argument types, and not just on the contents of the arguments, but on the interplay of those contents with the predicate function.

It's not hard to imagine a situation where your code works most of the time, but a call to select, nested several layers deep in a data processing routine, mysteriously crashes when a user enters a certain value that causes the predicate to return false. Good luck debugging that.

union

The union function is a prime specimen, both for the unpredictability of its results, and the clearly documented history of how it came to be this way.

user=> (clojure.set/union '#{no} '[no way] '{type checks} '(here))
[no way no [type checks] here]

There's simply no way that a a vector containing nondistinct elements could be considered a sensible return value for a set union function. But wait, it gets better! Consider this slightly modified code, where the first argument is now a vector, and the second argument is now a set:

user=> (clojure.set/union '[no] '#{no way} '{type checks} '(here))
#{no [type checks] way here}

It works! The function has returned a set containing the distinct elements from all the given collections, as you might expect. But why does this code do what we expect, when the other very similar code does not? If anything, a reasonable user would expect the previous example to return a set, because its first argument was a set, and this example to return a vector, since its first argument is a vector. What is going on?

It turns out that this is the result of a well-intentioned optimization, combined with Clojure's naïve attitude about types. From what I understand by reading the issue description, the union function picks out the collection with the most elements as the starting point, then merges the remaining collections into it. The idea is that this will usually result in fewer intermediary collections, because the larger collection is more likely to already contain the elements of the smaller collections. (The intersection function, as we saw above, does a similar thing, but it uses the smallest collection as the starting point.)

This makes intuitive sense, and works great when the largest collection is a set (as in our second example), but doesn't work so well when the largest collection is not a set (as in our first example). And in cases where you don't know the sizes of the collections ahead of time, its behavior is completely unpredictable — it may work one day, then fail the next.

The patch author provided multiple warnings that this approach would result in unexpected behavior if the function is used with non-sets, and suggested adding type checks or conversion. Rich Hickey said he was "not in favor" of type checks or other accommodations for handling non-sets. It's unclear whether he understood what the implications of that decision would be.

What's wrong with this?

The clojure.set API is intended to be used only with sets, and the examples above involve calling them with arguments that are not sets. What's the big deal if they behave strangely when given arguments of the wrong type?

When you're designing an API, you have to consider edge cases. And in a dynamically typed language like Clojure, one of the most obvious edge cases is, "What if this function is called with an argument of the wrong type?" In general, there are two sensible approaches an API designer might take to handle such edge cases:

Be strict: throw an error to inform the user that the wrong type of argument was passed.
Be accommodating: automatically convert to a compatible type if possible, then perform the algorithm.

There are valid arguments to be made about which of these approaches is preferable in a given situation. But Clojure took neither approach. Instead, Clojure's approach can be described as "Be naïve: attempt to perform the algorithm without doing any type checks or conversion".

This approach results in a user-hostile system. If you call a function with the wrong type of argument — even accidentally or indirectly (e.g. you use a library that uses clojure.set under the hood) — this approach will cause you to suffer more than is necessary. Instead of helping you figure out what went wrong, the API will sometimes silently return nonsense results, obscuring the problem and making it harder to debug. Rather than the API being a predictable tool that you can rely on, it becomes a risk that you must guard against.

We can do better

The clojure.set API is not the only part of Clojure that has this problem. As I described previously, the clojure.string API also exhibits undocumented and inconsistent behavior when given unexpected types (although the impact there is, thankfully, less severe). Because Clojure's APIs have been this way so long, it would be perilous to try to fix them, because that might break existing software that relies (intentionally or unintentionally) on the current behavior.

That was one of the reasons I decided not to faithfully emulate Clojure anymore. My experience with clojure.set has confirmed that I made the right decision. Instead of being doomed to repeat Clojure's mistakes, I now have the opportunity to learn from them.

My implementation of the set API marks the first situation where I have deliberately introduced discrepancies between Clojure and Cloje. They behave the same when given sets (thus offering an easy workaround for compatibility), but I made a conscious decision to fix the way the API handles non-sets.

I mentioned earlier that there are two general approaches: being strict (throwing an error), or being accommodating (automatically converting). I am generally inclined to make APIs strict, because that tends to reveal bugs in user code sooner and more clearly. The error indicates immediately where the invalid type was passed, rather than propagating the error "further down the line".

But in the case of the set API, I decided to make the API accommodating. For one thing, Clojure is generally an accommodating language, especially for collections. There are already many generic operations that accept any type that implements the collection interface or sequence interface. And there are common use cases where a user may want to perform a union, intersection, etc. using collections other than sets.

In Clojure, users must explicitly convert all collections to sets ahead of time, just to avoid the sort of unexpected behavior I have described above. But many of the algorithms do not inherently require all arguments to be sets. They can operate correctly and efficiently using one set as the accumulator, and merely require the other data structures to support linear iteration. So, we can avoid unnecessary intermediary structures if you let the implementation figure out what, if any, conversion needs to be done.

So, for all the set operations I have implemented thus far — union, difference, intersection, select, subset?, superset? — I have decided that they will accept arguments of any seqable collection (set, list, vector, map, string), sequence, or nil — much like many generic collection functions in Clojure. And they will return a set, even if given non-set arguments. But, they will throw a type error if given a non-seqable type, such as a symbol or a number.

I have not yet considered the relational algebra functions in the clojure.set namespace (index, join, project, rename) or the other functions (map-invert, rename-keys), which I assume are helpers for the relational algebra functions. So, I don't know what peculiar behavior they may have, or what the best approach is for handling unexpected types.

I am also still considering what to do with disj. It is not in the clojure.set namespace, but in Clojure it is only defined to work on sets, and it is very similar to difference. But its counterpart, conj, accepts any collection type, and returns a collection of that same type. So if I made disj accept other collections, should it always return a set, or should it return the same collection type that was given? How should it handle sequences? There is a lot to consider. For now I will probably make it throw an error if given anything other than a set, like Clojure does.

John Croisant