Abstract data types

One of the hard parts about computer programming is that, in general, programs are bigger than brains. Unless you have an unusually capacious brain, it is unlikely that you will be able to understand even a modestly large program in its entirety. So in order to be able to write and debug large programs, it is important to be able to break it up into pieces, where each piece can be treated as a tool whose use and description is simpler (and therefor fits in your brain better) than its actual code. Then you can forget about what is happening inside that piece, and just concentrate on how to use it from the outside.

This process of wrapping functionality up in a box and forgetting about its internals is called abstraction, and it is the single most important concept in computer science. In these notes we will describe a particular kind of abstraction, the construction of abstract data types or ADTs. Abstract data types are data types whose implementation is not visible to their user; from the outside, all the user knows about an ADT is what operations can be performed on it and what those operations are supposed to do.

ADTs have an outside and an inside. The outside is called the interface; it consists of the minimal set of type and function declarations needed to use the ADT. The inside is called the implementation; it consists of type and function definitions, and sometime auxiliary data or helper functions, that are not visible to users of the ADT. This separation between interface and implementation is called the abstraction barrier, and allows the implementation to change without affecting the rest of the program.

What joins the implementation to the interface is an abstraction function. This is a function (in the mathematical sense) that takes any state of the implementation and trims off any irrelevant details to leave behind an idealized pictures of what the data type is doing. For example, a linked list implementation translates to a sequence abstract data type by forgetting about the pointers used to hook up the elements and just keeping the sequence of elements themselves. To exclude bad states of the implementation (for example, a singly-linked list that loops back on itself instead of having a terminating null pointer), we may have a representation invariant, which is just some property of the implementation that is always true. Representation invariants are also useful for detecting when we’ve bungled our implementation, and a good debugging strategy for misbehaving abstract data type implementations is often to look for the first point at which they violated some property that we thought was an invariant.

Some programming language include very strong mechanisms for enforcing abstraction barriers. C relies somewhat more on politeness, and as a programmer you violate an abstraction barrier (by using details of an implementation that are supposed to be hidden) at your peril. In C, the interface will typically consist of function and type declarations contained in a header file, with implementation made up of the corresponding function definitions (and possibly a few extra static functions) in one or more .c files. The opaque struct technique can be used to hide implementation details of the type.

Boxed vs unboxed types

Object-oriented languages like Java make a distinction between boxed types, which are allocated on the heap and accessed through pointers, and unboxed types or primitive types, which are small enough to be passed around directly.

In C terms, a boxed type would typically be implemented as a struct allocated using malloc, often by a constructor function that takes responsibility for doing the allocation, initializing the underlying struct as appropriate, and returning a pointer to the new struct. This gives an instance of the type that lives until we decide to free it. A separate destructor function will need to be provided to free any space used by the struct; this is especially important if the data structure requires more than one call to malloc, although in simple cases it may be enough for the destructor to just be a simple wrapper around free. When passing an instance of a boxed type to or from a function, we copy a pointer to the heap-allocated structure. This structure is not itself copied, which is more efficient than copying if the structure is large, and which allows the function to modify the structure if needed without having to explicitly return an updated version. Typically the functions only expected to modify the underlying structure will be the mutators provided by the abstract data type interface, and other functions that wish to operate on the structure will use these mutators to apply their changes. Similarly, the interface will often provide accessor functions to allow other functions the ability to obtain information about an instance without breaking the abstraction barrier that hides the actual implementation.

An unboxed type doesn’t use the heap, and an instance of an unboxed type is typically not a pointer but an actual C type that implements an instance directly. The built-in types like int and double are examples of unboxed types, but we can also construct unboxed types on top of structs. This works best when the structs are small enough that copying them around is a cheap operation and where the intended interpretation of an instance is as a constant value. An example might be an RGBA color value packed into a 32-bit struct:

struct rgba {
    uint8_t red;
    uint8_t green;
    uint8_t blue;
    uint8_t alpha;
};

This has the same size as an int on a typical machine, so allocating in on the heap and passing around a pointer to it would add a lot of unnecessary space overhead. Depending on how much abstraction we want to enforce, we might still expect the user to access values of this type only through functions provided by a library. For example, we might expect that the user only creates new RGBA colors through functions we provide, as in:

struct rgba c;

c = RBGAfromValues(122, 11, 13, 255);  // good
c = RGBAfromName("orange");            // good

c.red = 12;                            // please don't do this

Unfortunately, we can’t apply the opaque struct trick to unboxed structs, so there is no mechanism in the C language to exclude the bad case above. This makes unboxed types preferable if we want a (somewhat) enforceable abstraction barrier.

Even without the issue of abstraction barriers, most of the data structures we will build will be large enough and complex enough that it makes sense to implement them as boxed types just to avoid copying costs. But there may be occasions where the overhead of a boxed type is too much, so using an unboxed type makes more sense.

A danger with unboxed types

It is technically possible to have an unboxed type that includes pointers to instances of boxed types. This is almost always a mistake, as it may lead to multiple instances of the unboxed type all pointing to a single instance of the boxed type. An example would be if we defined a sized byte string as follows:

    struct bytes {
        size_t size; // size of data
        char *data;  // malloc'd block
    };

    // bad approach
    struct bytes b;
    b = bytesFromString("hi there");  // returns struct bytes
    struct bytes b2;
    b2 = b;

    // add some more text to the end
    bytesAppendString(&b2, ", world!");

This is legal C, but now we start with at least two struct bytes instances floating around that point to the same data block. So we will have to be vary careful to decide which one to call a destructor on, as calling a destructor on one of the instances will invalidate all the others. The issue is further confused if we have mutators (like bytesAppendString) that can replace the data field or change the size field in an instance: what happens to b when we call bytesAppendString on b2?

A safer approach is to make our bytes type boxed:

    struct bytes {
        size_t size; // size of data
        char *data;  // malloc'd block
    };

    // better approach
    struct bytes *b;
    b = bytesFromString("hi there");  // returns struct bytes *
    struct bytes *b2;
    b2 = b;

    // add some more text to the end
    bytesAppendString(b2, ", world!");

Now there is just a single instance of struct bytes that we happen to have two pointers to. We will still need to be careful about invalidating one of these pointers when we call a destructor on the other, but calling bytesAppendString will have the expected effect on the single underlying object.

A sequence type

Too much abstraction at once can be hard to take, so let’s look at a concrete example of an abstract data type. This ADT will represent an infinite sequence of ints. Each instance of the Sequence type supports a single operation seq_next that returns the next int in the sequence. We will also need to provide one or more constructor functions to generate new Sequences, and a destructorfunction to tear them down.

Here is an example of a typical use of a Sequence:

void
seq_print(Sequence s, int limit)
{
    int i;

    for(i = seq_next(s); i < limit; i = seq_next(s)) {
        printf("%d\n", i);
    }
}

Note that seq_print doesn’t need to know anything at all about what a Sequence is or how seq_next works in order to print out all the values in the sequence until it hits one greater than or equal to limit. This is a good thing— it means that we can use with any implementation of Sequence we like, and we don’t have to change it if Sequence or seq_next changes.

Interface

In C, the interface of an abstract data type will usually be declared in a header file, which is included both in the file that implements the ADT (so that the compiler can check that the declarations match up with the actual definitions in the implementation. Here’s a header file for sequences:

/* opaque struct: hides actual components of struct sequence,
 * which are defined in sequence.c */
typedef struct sequence *Sequence;

/* constructors */
/* all our constructors return a null pointer on allocation failure */

/* returns a Sequence representing init, init+1, init+2, ... */
Sequence seq_create(int init);

/* returns a Sequence representing init, init+step, init+2*step, ... */
Sequence seq_create_step(int init, int step);

/* destructor */
/* destroys a Sequence, recovering all interally-allocated data */
void seq_destroy(Sequence);

/* accessor */
/* returns the first element in a sequence not previously returned */
int seq_next(Sequence);

examples/ADT/sequence/sequence.h

Here we have defined two different constructors for Sequences, one of which gives slightly more control over the sequence than the other. If we were willing to put more work into the implementation, we could imagine building a very complicated Sequence type that supported a much wider variety of sequences (for example, sequences generated by functions or sequences read from files); but we’ll try to keep things simple for now. We can always add more functionality later, since the users won’t notice if the Sequence type changes internally.

Implementation

The implementation of an ADT in C is typically contained in one (or sometimes more than one) .c file. This file can be compiled and linked into any program that needs to use the ADT. Here is our implementation of Sequence:

#include <stdlib.h>

#include "sequence.h"

struct sequence {
    int next;   /* next value to return */
    int step;   /* how much to increment next by */
};

Sequence
seq_create(int init)
{
    return seq_create_step(init, 1);
}

Sequence
seq_create_step(int init, int step)
{
    Sequence s;

    s = malloc(sizeof(*s));
    if(s == 0) return 0;
    s->next = init;
    s->step = step;
    return s;
}

void
seq_destroy(Sequence s)
{
    free(s);
}

int
seq_next(Sequence s)
{
    int ret;            /* saves the old value before we increment it */

    ret = s->next;
    s->next += s->step;

    return ret;
}

examples/ADT/sequence/sequence.c

Things to note here: the definition of struct sequence appears only in this file; this means that only the functions defined here can (easily) access the next and step components. This protects Sequences to a limited extent from outside interference, and defends against users who might try to “violate the abstraction boundary” by examining the components of a Sequence directly. It also means that if we change the components or meaning of the components in struct sequence, we only have to fix the functions defined in sequence.c.

Compiling and linking

Now that we have sequence.h and sequence.c, how do we use them? Let’s suppose we have a simple main program:

#include <stdio.h>

#include "sequence.h"


void
seq_print(Sequence s, int limit)
{
    int i;

    for(i = seq_next(s); i < limit; i = seq_next(s)) {
        printf("%d\n", i);
    }
}


int
main(int argc, char **argv)
{
    Sequence s;
    Sequence s2;

    puts("Stepping by 1:");

    s = seq_create(0);
    seq_print(s, 5);
    seq_destroy(s);

    puts("Now stepping by 3:");

    s2 = seq_create_step(1, 3);
    seq_print(s2, 20);
    seq_destroy(s2);

    return 0;
}

examples/ADT/sequence/main.c

We can compile main.c and sequence.c together into a single binary with the command gcc main.c sequence.c. Or we can build a Makefile which will compile the two files separately and then link them. Using make may be more efficient, especially for large programs consisting of many components, since if we make any changes make will only recompile those files we have changed. So here is our Makefile:

CC=c99
CFLAGS=-g3 -Wall

all: seqprinter

seqprinter: main.o sequence.o
	$(CC) $(CFLAGS) -o [email protected] $^

test: seqprinter
	./seqprinter

# these rules say to rebuild main.o and sequence.o if sequence.h changes
main.o: main.c sequence.h
sequence.o: sequence.c sequence.h

clean:
	$(RM) -f seqprinter *.o

examples/ADT/sequence/Makefile

And now running make test produces this output. Notice how the built-in make variables [email protected] and $^ expand out to the left-hand side and right-hand side of the dependency line for building seqprinter.

$ make test
gcc -g3 -Wall   -c -o main.o main.c
gcc -g3 -Wall   -c -o sequence.o sequence.c
gcc -g3 -Wall -o seqprinter main.o sequence.o
./seqprinter
Stepping by 1:
0
1
2
3
4
Now stepping by 3:
1
4
7
10
13
16
19

Designing abstract data types

Now we’ve seen how to implement an abstract data type. How do we choose when to use when, and what operations to give it? Let’s try answering the second question first.

Parnas’s Principle

Parnas’s Principle is a statement of the fundamental idea of information hiding, which says that abstraction boundaries should be as narrow as possible:

  • The developer of a software component must provide the intended user with all the information needed to make effective use of the services provided by the component, and should provide no other information.
  • The developer of a software component must be provided with all the information necessary to carry out the given responsibilities assigned to the component, and should be provided with no other information.

(David Parnas, “On the Criteria to Be Used in Decomposing Systems into Modules,” Communications of the ACM, 15(12): 1059–1062, 1972.)

For ADTs, this means we should provide as few functions for accessing and modifying the ADT as we can get away with. The Sequence type we defined early has a particularly narrow interface; the developer of Sequence (whoever is writing sequence.c) needs to know nothing about what its user wants except for the arguments passed in to seq_create or seq_create_step, and the user only needs to be able to call seq_next. More complicated ADTs might provide larger sets of operations, but in general we know that an ADT provides a successful abstraction when the operations are all “natural” ones given our high-level description. If we find ourselves writing a lot of extra operations to let users tinker with the guts of our implementation, that may be a sign that either we aren’t taking our abstraction barrier seriously enough, or that we need to put the abstraction barrier in a different place.

When to build an abstract data type

The short answer: Whenever you can.

A better answer: The best heuristic I know for deciding what ADTs to include in a program is to write down a description of how your program is going to work. For each noun or noun phrase in the description, either identify a built-in data type to implement it or design an abstract data type.

For example: a grade database maintains a list of students, and for each student it keeps a list of grades. So here we might want data types to represent:

  • A list of students,
  • A student,
  • A list of grades,
  • A grade.

If grades are simple, we might be able to make them just be ints (or maybe doubles); to be on the safe side, we should probably create a Grade type with a typedef. The other types are likely to be more complicated. Each student might have in addition to their grades a long list of other attributes, such as a name, an email address, etc. By wrapping students up as abstract data types we can extend these attributes if we need to, or allow for very general implementations (say, by allowing a student to have an arbitrary list of keyword-attribute pairs). The two kinds of lists are likely to be examples of sequence types; we’ll be seeing a lot of ways to implement these as the course progresses. If we want to perform the same kinds of operations on both lists, we might want to try to implement them as a single list data type, which then is specialized to hold either students or grades; this is not always easy to do in C, but we’ll see examples of how to do this, too.

Whether or not this set of four types is the set we will finally use, writing it down gives us a place to start writing our program. We can start writing interface files for each of the data types, and then evolve their implementations and the main program in parallel, adjusting the interfaces as we find that we have provided too little (or too much) data for each component to do what it must.


Licenses and Attributions


Speak Your Mind