Books / Introduction to C Programming Language / Chapter 16
Structured data types
C has two kinds of structured data types: struct
s and union
s. A struct
holds multiple values in consecutive memory locations, called fields, and implements what in type theory is called a product type: the set of possible values is the Cartesian product of the sets of possible values for its fields. In contrast, a union
has multiple fields but they are all stored in the same location: effectively, this means that only one field at a time can hold a value, making a union
a sum type whose set of possible values is the union of the sets of possible values for each of its fields. Unlike what happens in more sensible programming languages, union
s are not tagged: unless you keep track of this information somewhere else, you can’t tell which field in a union is being used, and you can store a value of one type in a union
and try to read it back as a different type, and C won’t complain.
Structs
A struct
is a way to define a type that consists of one or more other types pasted together. Here’s a typical struct
definition:
struct string {
int length;
char *data;
};
This defines a new type struct string
that can be used anywhere you would use a simple type like int
or float
. When you declare a variable with type struct string
, the compiler allocates enough space to hold both an int
and a char *
.
Note that this declaration has a semicolon after the close brace. This can be confusing, since most close braces in C appear at the end of function bodies or compound statements, and are not followed by a semicolon. If you get strange error messages for lines following a struct
definition, it’s worth checking to make sure that the semicolon is there.
You can get at the individual components of a struct
using the .
operator, like this:
struct string {
int length;
char *data;
};
int
main(int argc, char **argv)
{
struct string s;
s.length = 4;
s.data = "this string is a lot longer than you think";
puts(s.data);
return 0;
}
examples/structs/structExample.c
Variables of type struct
can be assigned to, passed into functions, returned from functions, just like any other type. Each such operation is applied componentwise; for example, s1 = s2;
is equivalent to s1.length = s2.length; s1.data = s2.data;
.
These operations are not used as often as you might think: typically, instead of copying around entire structures, C programs pass around pointers, as is done with arrays. Pointers to struct
s are common enough in C that a special syntax is provided for dereferencing them. Suppose we have:
struct string s; /* a struct */
struct string *sp; /* a pointer to a struct */
s.length = 4;
s.data = "another overly long string";
sp = &s; /* sp now points to s */
We can then refer to elements of the struct string
that sp
points to (i.e. s
) in either of two ways:
puts((*sp).data);
puts(sp->data);
The second is more common, since it involves typing fewer parentheses. It is an error to write *sp.data
in this case; since .
binds tighter than *
, the compiler will attempt to evaluate sp.data
first and generate an error, since sp
doesn’t have a data
field.
Pointers to struct
s are commonly used in defining abstract data types, since it is possible to declare that a function returns e.g. a struct string *
without specifying the components of a struct string
. (All pointers to struct
s in C have the same size and structure, so the compiler doesn’t need to know the components to pass around the address.) Hiding the components discourages code that shouldn’t look at them from doing so, and can be used, for example, to enforce consistency between fields.
For example, suppose we wanted to define a struct string *
type that held counted strings that could only be accessed through a restricted interface that prevented (for example) the user from changing the string or its length. We might create a file myString.h
that contained the declarations:
/* make a struct string * that holds a copy of s */
/* returns 0 if malloc fails */
struct string *makeString(const char *s);
/* destroy a struct string * */
void destroyString(struct string *);
/* return the length of a struct string * */
int stringLength(struct string *);
/* return the character at position index in the struct string * */
/* or returns -1 if index is out of bounds */
int stringCharAt(struct string *s, int index);
and then the actual implementation in myString.c
would be the only place where the components of a struct string
were defined:
#include <stdlib.h>
#include <string.h>
#include "myString.h"
struct string {
int length;
char *data;
};
struct string *
makeString(const char *s)
{
struct string *s2;
s2 = malloc(sizeof(struct string));
if(s2 == 0) { return 0; } /* let caller worry about malloc failures */
s2->length = strlen(s);
s2->data = malloc(s2->length);
if(s2->data == 0) {
free(s2);
return 0;
}
strncpy(s2->data, s, s2->length);
return s2;
}
void
destroyString(struct string *s)
{
free(s->data);
free(s);
}
int
stringLength(struct string *s)
{
return s->length;
}
int
stringCharAt(struct string *s, int index)
{
if(index < 0 || index >= s->length) {
return -1;
} else {
return s->data[index];
}
}
In practice, we would probably go even further and replace all the struct string *
types with a new name declared with typedef
.
Operations on structs
What you can do to structs is pretty limited: you can look up or set individual components in a struct, you can pass structs to functions or as return values from functions (which makes a copy of the original struct), and you can assign the contents of one struct to another using s1 = s2
(which is equivalent to copying each component separately).
One thing that you can’t do is test two structs for equality using ==
; this is because structs may contain extra space holding junk data. If you want to test for equality, you will need to do it componenti by component.
Layout in memory
The C99 standard guarantees that the components of a struct
are stored in memory in the same order that they are defined in: that is, later components are placed at higher address. This allows sneaky tricks like truncating a structure if you don’t use all of its components. Because of alignment restrictions, the compiler may add padding between components to put each component on its prefered alignment boundary.
You can find the position of a component within a struct
using the offsetof
macro, which is defined in stddef.h
. This returns the number of bytes from the base of the struct that the component starts at, and can be used to do various terrifying non-semantic things with pointers.
#include <stdio.h>
#include <stdlib.h>
#include <stddef.h>
#include <assert.h>
int
main(int argc, char **argv)
{
struct foo {
int i;
char c;
double d;
float f;
char *s;
};
printf("i is at %lu\n", offsetof(struct foo, i));
printf("c is at %lu\n", offsetof(struct foo, c));
printf("d is at %lu\n", offsetof(struct foo, d));
printf("f is at %lu\n", offsetof(struct foo, f));
printf("s is at %lu\n", offsetof(struct foo, s));
return 0;
}
Bit fields
It is possible to specify the exact number of bits taken up by a member of a struct
of integer type. This is seldom useful, but may in principle let you pack more information in less space. Bit fields are sometimes used to unpack data from an external source that uses this trick, but this is dangerous, because there is no guarantee that the compiler will order the bit fields in your struct
in any particular order (at the very least, you will need to worry about endianness.
Example:
struct color {
unsigned int red : 2;
unsigned int green : 2;
unsigned int blue : 2;
unsigned int alpha : 2;
};
This defines a struct
that (probably) occupies only one byte, and supplies four 2-bit fields, each of which can hold values in the range 0-3.
Unions
A union
is just like a struct
, except that instead of allocating space to store all the components, the compiler only allocates space to store the largest one, and makes all the components refer to the same address. This can be used to save space if you know that only one of several components will be meaningful for a particular object. An example might be a type representing an object in a LISP-like language like Scheme:
struct lispObject {
int type; /* type code */
union {
int intVal;
double floatVal;
char * stringVal;
struct {
struct lispObject *car;
struct lispObject *cdr;
} consVal;
} u;
};
Now if you wanted to make a struct lispObject
that held an integer value, you might write
lispObject o;
o.type = TYPE_INT;
o.u.intVal = 27;
Here TYPE_INT
has presumably been defined somewhere. Note that nothing then prevents you from writing
x = 2.7 * o.u.floatVal; /* BAD */
The effects of this will be strange, since it’s likely that the bit pattern representing 27 as an int
represents something very different as a double
. Avoiding such mistakes is your responsibility, which is why most uses of union
occur inside larger struct
s that contain enough information to figure out which variant of the union
applies.
Enums
C provides the enum
construction for the special case where you want to have a sequence of named constants of type int
, but you don’t care what their actual values are, as in
enum color { RED, BLUE, GREEN, MAUVE, TURQUOISE };
This will assign the value 0
to RED
, 1
to BLUE
, and so on. These values are effectively of type int
, although you can declare variables, arguments, and return values as type enum color
to indicate their intended interpretation.
Despite declaring a variable enum color c
(say), the compiler will still allow c
to hold arbitrary values of type int
.
So the following ridiculous code works just fine:
#include <stdio.h>
#include <stdlib.h>
enum foo { FOO };
enum apple { MACINTOSH, CORTLAND, RED_DELICIOUS };
enum orange { NAVEL, CLEMENTINE, TANGERINE };
int
main(int argc, char **argv)
{
enum foo x;
if(argc != 1) {
fprintf(stderr, "Usage: %s\n", argv[0]);
return 1;
}
printf("FOO = %d\n", FOO);
printf("sizeof(enum foo) = %d\n", sizeof(enum foo));
x = 127;
printf("x = %d\n", x);
/* note we can add apples and oranges */
printf("%d\n", RED_DELICIOUS + TANGERINE);
return 0;
}
examples/definitions/enumsAreInts.c
Specifying particular values
It is also possible to specify particular values for particular enumerated constants, as in
enum color { RED = 37, BLUE = 12, GREEN = 66, MAUVE = 5, TURQUOISE };
Anything that doesn’t get a value starts with one plus the previous value; so the above definition would set TURQUOISE
to 6
. This may result in two names mapping to the same value.
What most people do
In practice, enum
s are seldom used, and you will more commonly see a stack of #define
s:
#define RED (0)
#define BLUE (1)
#define GREEN (2)
#define MAUVE (3)
#define TURQUOISE (4)
The reason for this is partly historical—enum
arrived late in the evolution of C—but partly practical: a table of #define
s makes it much easier to figure out which color is represented by 3, without having to count through a list. But if you never plan to use the numerical values, enum
may be a better choice, because it guarantees that all the values will be distinct.
Using enum
with union
A natural place to use an enum
is to tag a union
with the type being used. For example, a Lisp-like language might implement the following multi-purpose data type:
enum TypeCode { TYPE_INT, TYPE_DOUBLE, TYPE_STRING };
struct LispValue {
enum TypeCode typeCode;
union {
int i;
double d;
char *s;
} value;
};
Here we don’t care what the numeric values of TYPE_INT
, TYPE_DOUBLE
, and TYPE_STRING
are, as long as we can apply switch
to typeCode
to figure out what to do with one of these things.