Structured data types

C has two kinds of structured data types: structs and unions. A struct holds multiple values in consecutive memory locations, called fields, and implements what in type theory is called a product type: the set of possible values is the Cartesian product of the sets of possible values for its fields. In contrast, a union has multiple fields but they are all stored in the same location: effectively, this means that only one field at a time can hold a value, making a union a sum type whose set of possible values is the union of the sets of possible values for each of its fields. Unlike what happens in more sensible programming languages, unions are not tagged: unless you keep track of this information somewhere else, you can’t tell which field in a union is being used, and you can store a value of one type in a union and try to read it back as a different type, and C won’t complain.

Structs

A struct is a way to define a type that consists of one or more other types pasted together. Here’s a typical struct definition:

struct string {
    int length;
    char *data;
};

This defines a new type struct string that can be used anywhere you would use a simple type like int or float. When you declare a variable with type struct string, the compiler allocates enough space to hold both an int and a char *.

Note that this declaration has a semicolon after the close brace. This can be confusing, since most close braces in C appear at the end of function bodies or compound statements, and are not followed by a semicolon. If you get strange error messages for lines following a struct definition, it’s worth checking to make sure that the semicolon is there.

You can get at the individual components of a struct using the . operator, like this:

struct string {
    int length;
    char *data;
};

int
main(int argc, char **argv)
{
    struct string s;

    s.length = 4;
    s.data = "this string is a lot longer than you think";

    puts(s.data);

    return 0;
}

examples/structs/structExample.c

Variables of type struct can be assigned to, passed into functions, returned from functions, just like any other type. Each such operation is applied componentwise; for example, s1 = s2; is equivalent to s1.length = s2.length; s1.data = s2.data;.

These operations are not used as often as you might think: typically, instead of copying around entire structures, C programs pass around pointers, as is done with arrays. Pointers to structs are common enough in C that a special syntax is provided for dereferencing them. Suppose we have:

    struct string s;            /* a struct */
    struct string *sp;          /* a pointer to a struct */

    s.length = 4;
    s.data = "another overly long string";

    sp = &s;                    /* sp now points to s */

We can then refer to elements of the struct string that sp points to (i.e. s) in either of two ways:

    puts((*sp).data);
    puts(sp->data);

The second is more common, since it involves typing fewer parentheses. It is an error to write *sp.data in this case; since . binds tighter than *, the compiler will attempt to evaluate sp.data first and generate an error, since sp doesn’t have a data field.

Pointers to structs are commonly used in defining abstract data types, since it is possible to declare that a function returns e.g. a struct string * without specifying the components of a struct string. (All pointers to structs in C have the same size and structure, so the compiler doesn’t need to know the components to pass around the address.) Hiding the components discourages code that shouldn’t look at them from doing so, and can be used, for example, to enforce consistency between fields.

For example, suppose we wanted to define a struct string * type that held counted strings that could only be accessed through a restricted interface that prevented (for example) the user from changing the string or its length. We might create a file myString.h that contained the declarations:

/* make a struct string * that holds a copy of s */
/* returns 0 if malloc fails */
struct string *makeString(const char *s);

/* destroy a struct string * */
void destroyString(struct string *);

/* return the length of a struct string * */
int stringLength(struct string *);

/* return the character at position index in the struct string * */
/* or returns -1 if index is out of bounds */
int stringCharAt(struct string *s, int index);

examples/myString/myString.h

and then the actual implementation in myString.c would be the only place where the components of a struct string were defined:

#include <stdlib.h>
#include <string.h>

#include "myString.h"

struct string {
    int length;
    char *data;
};

struct string *
makeString(const char *s)
{
    struct string *s2;

    s2 = malloc(sizeof(struct string));
    if(s2 == 0) { return 0; }  /* let caller worry about malloc failures */

    s2->length = strlen(s);

    s2->data = malloc(s2->length);
    if(s2->data == 0) {
	free(s2);
	return 0;
    }

    strncpy(s2->data, s, s2->length);

    return s2;
}

void
destroyString(struct string *s)
{
    free(s->data);
    free(s);
}

int
stringLength(struct string *s)
{
    return s->length;
}

int
stringCharAt(struct string *s, int index)
{
    if(index < 0 || index >= s->length) {
	return -1;
    } else {
	return s->data[index];
    }
}

examples/myString/myString.c

In practice, we would probably go even further and replace all the struct string * types with a new name declared with typedef.

Operations on structs

What you can do to structs is pretty limited: you can look up or set individual components in a struct, you can pass structs to functions or as return values from functions (which makes a copy of the original struct), and you can assign the contents of one struct to another using s1 = s2 (which is equivalent to copying each component separately).

One thing that you can’t do is test two structs for equality using ==; this is because structs may contain extra space holding junk data. If you want to test for equality, you will need to do it componenti by component.

Layout in memory

The C99 standard guarantees that the components of a struct are stored in memory in the same order that they are defined in: that is, later components are placed at higher address. This allows sneaky tricks like truncating a structure if you don’t use all of its components. Because of alignment restrictions, the compiler may add padding between components to put each component on its prefered alignment boundary.

You can find the position of a component within a struct using the offsetof macro, which is defined in stddef.h. This returns the number of bytes from the base of the struct that the component starts at, and can be used to do various terrifying non-semantic things with pointers.

#include <stdio.h>
#include <stdlib.h>
#include <stddef.h>
#include <assert.h>

int
main(int argc, char **argv)
{
    struct foo {
        int i;
        char c;
        double d;
        float f;
        char *s;
    };

    printf("i is at %lu\n", offsetof(struct foo, i));
    printf("c is at %lu\n", offsetof(struct foo, c));
    printf("d is at %lu\n", offsetof(struct foo, d));
    printf("f is at %lu\n", offsetof(struct foo, f));
    printf("s is at %lu\n", offsetof(struct foo, s));

    return 0;
}

examples/structs/offsetof.c

Bit fields

It is possible to specify the exact number of bits taken up by a member of a struct of integer type. This is seldom useful, but may in principle let you pack more information in less space. Bit fields are sometimes used to unpack data from an external source that uses this trick, but this is dangerous, because there is no guarantee that the compiler will order the bit fields in your struct in any particular order (at the very least, you will need to worry about endianness.

Example:

struct color {
    unsigned int red   : 2;
    unsigned int green : 2;
    unsigned int blue  : 2;
    unsigned int alpha : 2;
};

This defines a struct that (probably) occupies only one byte, and supplies four 2-bit fields, each of which can hold values in the range 0-3.

Unions

A union is just like a struct, except that instead of allocating space to store all the components, the compiler only allocates space to store the largest one, and makes all the components refer to the same address. This can be used to save space if you know that only one of several components will be meaningful for a particular object. An example might be a type representing an object in a LISP-like language like Scheme:

struct lispObject {
    int type;           /* type code */
    union {
        int     intVal;
        double  floatVal;
        char *  stringVal;
        struct {
            struct lispObject *car;
            struct lispObject *cdr;
        } consVal;
    } u;
};

Now if you wanted to make a struct lispObject that held an integer value, you might write

    lispObject o;

    o.type = TYPE_INT;
    o.u.intVal = 27;

Here TYPE_INT has presumably been defined somewhere. Note that nothing then prevents you from writing

    x = 2.7 * o.u.floatVal;        /* BAD */

The effects of this will be strange, since it’s likely that the bit pattern representing 27 as an int represents something very different as a double. Avoiding such mistakes is your responsibility, which is why most uses of union occur inside larger structs that contain enough information to figure out which variant of the union applies.

Enums

C provides the enum construction for the special case where you want to have a sequence of named constants of type int, but you don’t care what their actual values are, as in

enum color { RED, BLUE, GREEN, MAUVE, TURQUOISE };

This will assign the value 0 to RED, 1 to BLUE, and so on. These values are effectively of type int, although you can declare variables, arguments, and return values as type enum color to indicate their intended interpretation.

Despite declaring a variable enum color c (say), the compiler will still allow c to hold arbitrary values of type int.
So the following ridiculous code works just fine:

#include <stdio.h>
#include <stdlib.h>

enum foo { FOO };
enum apple { MACINTOSH, CORTLAND, RED_DELICIOUS };
enum orange { NAVEL, CLEMENTINE, TANGERINE };

int
main(int argc, char **argv)
{
    enum foo x;

    if(argc != 1) {
        fprintf(stderr, "Usage: %s\n", argv[0]);
        return 1;
    }

    printf("FOO = %d\n", FOO);
    printf("sizeof(enum foo) = %d\n", sizeof(enum foo));

    x = 127;

    printf("x = %d\n", x);

    /* note we can add apples and oranges */
    printf("%d\n", RED_DELICIOUS + TANGERINE);

    return 0;
}

examples/definitions/enumsAreInts.c

Specifying particular values

It is also possible to specify particular values for particular enumerated constants, as in

enum color { RED = 37, BLUE = 12, GREEN = 66, MAUVE = 5, TURQUOISE };

Anything that doesn’t get a value starts with one plus the previous value; so the above definition would set TURQUOISE to 6. This may result in two names mapping to the same value.

What most people do

In practice, enums are seldom used, and you will more commonly see a stack of #defines:

#define RED     (0)
#define BLUE    (1)
#define GREEN   (2)
#define MAUVE   (3)
#define TURQUOISE (4)

The reason for this is partly historical—enum arrived late in the evolution of C—but partly practical: a table of #defines makes it much easier to figure out which color is represented by 3, without having to count through a list. But if you never plan to use the numerical values, enum may be a better choice, because it guarantees that all the values will be distinct.

Using enum with union

A natural place to use an enum is to tag a union with the type being used. For example, a Lisp-like language might implement the following multi-purpose data type:

enum TypeCode { TYPE_INT, TYPE_DOUBLE, TYPE_STRING };

struct LispValue {
    enum TypeCode typeCode;
    union {
        int i;
        double d;
        char *s;
    } value;
};

Here we don’t care what the numeric values of TYPE_INT, TYPE_DOUBLE, and TYPE_STRING are, as long as we can apply switch to typeCode to figure out what to do with one of these things.


Licenses and Attributions


Speak Your Mind

-->