Data Structures and Serialization

2012-02-24 • 852 words

Lately, I've been tackling several separate, but interconnected, systems in Ambienome: data structures, serialization, object selection, and the physics engine. These are complex topics, so I've written this in two parts. In this first part, I talk about data structures and serialization, and in the second part I talk about the physics engine and object selection.

Data structure simply means the way some data is structured or organized, either in memory while the program is running, or when the data is saved to a file or sent over a network. Is it stored in a hash table? An array? An instance of a class? What slots/members does the class have? How are different pieces of data related to each other? And so on.

Serialization is a topic very closely related to data structure. Serialization involves converting from one data structure to a simpler data structure, usually some human-redable text or a raw binary sequence with some specific order or pattern. The reverse process, deserialization, involves converting the simpler data structure back into the original structure (or a similar one). Serialization is most often used when saving data to a file or sending it over a network.

Personally, I consider binary serialization to be a measure of last resort, to be used only when performance is absolutely critical. Binary formats are notoriously fragile and difficult to extend, especially if they are poorly designed.

(I recently worked on a project where the previous programmer wrote code to "serialize" a C struct by copying a chunk of raw bytes from memory and sending it over the network. To "deserialize" it, they would — I kid you not — copy the bytes into memory and type-cast it to the struct type! They had given absolutely no thought to portability, extensibility, validation, or robustness. And the person who wrote it was ostensibly a professional programmer, and had been paid for the work.)

So, I use human-readable text formats whenever possible. Probably the two most popular formats for this kind of serialization are XML and JSON. But Lisp already has a simple, human-readable, portable, built-in serialization format: S-expressions, the same format used for Lisp code. I'm still experimenting with the structure, but here's an example of a simple scene as it is serialized now:

(:instance :scene
 :props nil
 :creatures
 ((:instance :creature
   :name "Creature 1"
   :body
   (:instance :part
    :props
    ((:shape :label "Shape" :value :rectangle)
     (:angle :label "Angle" :type :angle :value 0)
     (:scale :label "Scale" :type :real :value 120)
     (:size :label "Size" :type :vec :value #C(1.0d0 1.0d0))
     (:pos :label "Position" :type :vec :value #C(75.0d0 150.0d0))
     (:depth :label "Depth" :type :real :value 0)
     (:color :label "Color" :type :color :value #(0.7 0.1 0.6))
     (:name :label "Name" :type :string :value nil))
    :children nil))))

It's human-readable, very compact for how much information it expresses, and Lisp intrinsically knows how to read and print it. I might write a little bit of code to get rid of the colons (which indicate Lisp keyword symbols) to make it even easier for humans to read.

I should point out that even though the structure is serialized as lists, the data is actually stored in the program as class instances, arrays, and hash tables. They are converted to lists before being serialized, then inserted back into an array or hash table when deserialized. In addition to making the output easier for humans to read, adding this layer of abstraction means I can change the internal data structure while easily maintaining compatibility with older versions.

Besides decisions about whether to use lists, arrays, hash tables, or whatnot, I've also been experimenting with how the various kinds of objects relate to each other. As it is now, there are three main kinds of objects:

Scene objects, which represent a particular scenario of creatures, scene-wide behaviors, and properties such as background color and the scene's creator.
Creature objects, which represent a collection of parts with some behaviors and properties such as its name and creator. A creature directly holds a main part, called the body, of which all the other shapes are "descendants".
Parts, which represent a shape, like a circle or triangle, with properties such as its size, color, density. A part can have "children", meaning they are linked together in a hierarchy.

One noteworthy thing about these data structures is the way object properties are stored. Instead of storing color, position, size, etc. directly as slots/members of a class instance, they are all stored together in the object's "props" hash table. This makes it more flexible, as props can be added or removed at any time. At some point, scene creators will be able to add their own props that are used by custom behaviors. For that same purpose, props are stored with metadata about their type, label, and so on.

There's another kind of object that I may be adding soon: joints, which are used with the physics engine to connect parts that move independently, but are joined together with a spring, pivot, or whatnot. But, I might just make joints another type of part. I'll talk about this and other physics engine stuff in my next post.

John Croisant

Data Structures and Serialization