The C programming language

The C programming language was developed at Bell Laboratories in the early 1970s as the system programming language for Unix, based on the earlier and even more primitive languages BCPL and B. When originally developed, it was targeted at machines that were extremely limited by modern standards: the first Unix implementation (and the B compiler that supported it) ran on a DEC PDP-7 with only 8192 18-bit words of memory (Dennis M. Ritchie, The development of the C language, in Thomas J. Bergin, Jr., and Richard G. Gibson, Jr., History of Programming Languages-II ed., ACM Press, 1996). So using as few resources as possible, both in the compiler and in the resulting code, was a priority.

This priority is reflected in the features (and lack of features) of C, and is partly responsible for its success. Programs written in C place almost no demands on the system they run on and give the programmer nearly complete control over their execution: this allows programs that were previously written in assembly language, like operating system kernels and device drivers, to be implemented in C. So C is often the first language ported to any new architecture, and many higher-level languages are either executed using interpreters written in C or use C as in intermediate language in the compilation process.

Since its initial development, C has gone through four major versions:

  • The original K&R C defined in the 1978 first edition of Kernighan and Ritchie’s book The C Programming Language;
  • ANSI C, from 1988, which fixed some oddities in the syntax and which is documented in the 1988 second edition of The C Programming Language;
  • C99, from 1999, the ISO/IEC 9899:1999 standard for C, which added some features from C++ and many new features for high-performance numerical computing;
  • C11, from 2011, the ISO/IEC 9899:2011 standard for C, which relaxed some of the requirements of C99 that most compilers hadn’t bothered implementing and which added a few extra features.

Unfortunately, C99 and C11 both exemplify the uselessness of standards committees in general and the ISO in particular. Because the ISO has no power to enforce standards on compiler writers, and because they will charge you CHF 198 just to look at the C11 standard, many compiler writers have ignored much of C99 and C11. In particular, Microsoft pretty much gave up on adding any features after ANSI C, and support for C99 and C11 is spotty in gcc and clang, the two dominant open source C compilers. So if you want to write portable C code, it is safest to limit yourself to features in ANSI C.

For this class, we will permit you to use any feature of C that gcc supports, which includes all features of ANSI C and most features of later standards.

Structure of a C program

A C program consists of one or more files (which act a little bit like modules in more structured programming languages, each of which typically contains definitions of functions, each of which consists of statements, which are either compound statements like if, while, etc. or expressions that typically perform some sort of arithmetic or call other functions. Files may also include declarations of global variables (not recommended), and functions will often contain declarations of local variables that can only be used inside that function.

Here is a typical small C program that sums a range of integers. Since this is our first real program, it’s a little heavy on the comments (shown between /* and */).

#include <stdio.h>    /* This is needed to get the declarations of fprintf and printf */
#include <stdlib.h>   /* This is needed to get the declaration of atoi */

/* Return the sum of all integers i
 * such that start <= i and i < end. */
int
sumRange(int start, int end)
{
    int i;    /* loop variable */
    int sum;  /* sum of all values so far */

    /* a mathematician would use a formula for this,
     * but we are computer programmers! */
    sum = 0;

    /* The three parts of the header for this loop mean:
     * 1. Set i to start initially.
     * 2. Keep doing the loop as long as i is less than end.
     * 3. After each iteration, add 1 to i.
     */
    for(i = start; i < end; i++) {
        sum += i;  /* This adds i to sum */
    }

    /* This exits the function immediately,
     * sending the value of sum back to the caller. */
    return sum;
}

int
main(int argc, char **argv)
{
    int start;    /* initial value in range */
    int end;      /* one past the last value in the range */

    /* This tests for the wrong number of arguments.
     * The != operator returns true (1) if its arguments are not equal,
     * and false (0) otherwise.
     * Note that the program name itself counts as an argument
     * (which is why we want the argument count to be 3)
     * and appears in position 0 in the argument vector
     * (which means we can get it using argv[0]). */
    if(argc != 3) {
        fprintf(stderr, "Usage: %s\n start end", argv[0]);
        return 1;
    }

    /* Convert start and end positions from strings to ints */
    start = atoi(argv[1]);
    end = atoi(argv[2]);

    /* Call sumRange and print the result */
    printf("sumRange(%d, %d) = %d\n", start, end, sumRange(start, end));

    return 0;
}

This is what the program does if we compile and run it:

$ gcc -g -Wall -o sumRange sumRange.c
$ ./sumRange 1 100
sumRange(1, 100) = 4950

The sumRange.c program contains two functions, sumRange and main. The sumRange function does the actual work, while main is the main routine of the program that gets called with the command-line arguments when the program is run. Every C program must have a routine named main with these particular arguments.

In addition, main may call three library functions, fprintf (which in this case is used to generate error messages), printf (which generates ordinary output), and atoi (which is used to translate the command-line arguments into numerical values). These functions must all be declared before they can be used. In the case of sumRange, putting the definition of sumRange before the definition of main is enough. For the library routines, the include files stdio.h and stdlib.h contain declarations of the functions that contain enough information about there return types and arguments that the compiler knows how to generate machine code to call them. These files are included in sumRange.c by the C preprocessor, which pastes in the contents of any file specified by the #include command, strips out any comments (delimited by /* and */, or by // and the end of the line if you are using C99), and does some other tricks that allow you to muck with the source code before the actual compiler sees it (see Macros). You can see what the output of the preprocessor looks like by calling the C compiler with the -E option, as in gcc -E sumRange.c.

The body of each function consists of some variable declarations followed by a sequence of statements that tell the computer what to do. Unlike some languages, every variable used in a C program must be declared. A declaration specifies the type of a variable, which tells the compiler how much space to allocate for it and how to interpret some operations on its value. Statements may be compound statements like the if statement in main that executes its body only if the program is called with the wrong number of command-line arguments or the for loop in sumRange that executes its body as long as the test in its header remains true; or they may be simple statements that consist of a single expression followed by a semicolon.

An expression is usually either a bare function call whose value is discarded (for example, the calls to fprintf and printf in main), or an arithmetic expression (which may include function calls, like the calls to atoi or in main) whose value is assigned to some variable using the assignment operator = or sometimes variants like += (which is shorthand for adding a value to an existing variable: x += y is equivalent to x = x+y).

When you compile a C program, after running the preprocessor, the compiler generates assembly language code that is a human-readable description of the ultimate machine code for your target CPU. Assembly language strips out all the human-friendly features of your program and reduces it to simple instructions usually involving moving things from one place to the other or performing a single arithmetic operation. For example, the C line

    x = y + 1;  /* add 1 to y, store result in x */

gets translated into x86 assembly as

        movl    -24(%rbp), %edi
        addl    $1, %edi
        movl    %edi, -28(%rbp)

These three operations copy the value of y into the CPU register %edi, add 1 to the %edi register, and then copy the value back into x. This corresponds directly to what you would have to do to evaluate x = y + 1 if you could only do one very basic operation at a time and couldn’t do arithmetic operations on memory locations: fetch y, add 1, store x. Note that the CPU doesn’t know about the names y and x; instead, it computes their addresses by adding -24 and -28 respectively to the base pointer register %rbp. This is why it can be hard to debug compiled code unless you tell the compiler to keep around extra information.

For an arbitrary C program, if you are using gcc, you can see what your code looks like in assembly language using the -S option. For example, gcc -S sumRange.c will create a file sumRange.s that looks like this:

	.file	"sumRange.c"
	.text
	.globl	sumRange
	.type	sumRange, @function
sumRange:
.LFB0:
	.cfi_startproc
	pushl	%ebp
	.cfi_def_cfa_offset 8
	.cfi_offset 5, -8
	movl	%esp, %ebp
	.cfi_def_cfa_register 5
	subl	$16, %esp
	movl	$0, -4(%ebp)
	movl	8(%ebp), %eax
	movl	%eax, -8(%ebp)
	jmp	.L2
.L3:
	movl	-8(%ebp), %eax
	addl	%eax, -4(%ebp)
	addl	$1, -8(%ebp)
.L2:
	movl	-8(%ebp), %eax
	cmpl	12(%ebp), %eax
	jl	.L3
	movl	-4(%ebp), %eax
	leave
	.cfi_restore 5
	.cfi_def_cfa 4, 4
	ret
	.cfi_endproc
.LFE0:
	.size	sumRange, .-sumRange
	.section	.rodata
.LC0:
	.string	"Usage: %s\n start end"
.LC1:
	.string	"sumRange(%d, %d) = %d\n"
	.text
	.globl	main
	.type	main, @function
main:
.LFB1:
	.cfi_startproc
	pushl	%ebp
	.cfi_def_cfa_offset 8
	.cfi_offset 5, -8
	movl	%esp, %ebp
	.cfi_def_cfa_register 5
	andl	$-16, %esp
	subl	$32, %esp
	cmpl	$3, 8(%ebp)
	je	.L6
	movl	12(%ebp), %eax
	movl	(%eax), %edx
	movl	stderr, %eax
	movl	%edx, 8(%esp)
	movl	$.LC0, 4(%esp)
	movl	%eax, (%esp)
	call	fprintf
	movl	$1, %eax
	jmp	.L7
.L6:
	movl	12(%ebp), %eax
	addl	$4, %eax
	movl	(%eax), %eax
	movl	%eax, (%esp)
	call	atoi
	movl	%eax, 24(%esp)
	movl	12(%ebp), %eax
	addl	$8, %eax
	movl	(%eax), %eax
	movl	%eax, (%esp)
	call	atoi
	movl	%eax, 28(%esp)
	movl	28(%esp), %eax
	movl	%eax, 4(%esp)
	movl	24(%esp), %eax
	movl	%eax, (%esp)
	call	sumRange
	movl	%eax, 12(%esp)
	movl	28(%esp), %eax
	movl	%eax, 8(%esp)
	movl	24(%esp), %eax
	movl	%eax, 4(%esp)
	movl	$.LC1, (%esp)
	call	printf
	movl	$0, %eax
.L7:
	leave
	.cfi_restore 5
	.cfi_def_cfa 4, 4
	ret
	.cfi_endproc
.LFE1:
	.size	main, .-main
	.ident	"GCC: (Ubuntu 4.8.2-19ubuntu1) 4.8.2"
	.section	.note.GNU-stack,"",@progbits

You usually don’t need to look at assembly language, but it can sometimes be enlightening to see what the compiler is doing with your code. One thing that I find interesting about this particular code (which is for the x86 architecture) is that most of the instructions are movl, the x86 instruction for copying a 32-bit quantity from one location to another. So most of what this program is doing is copying data into the places expected by the library functions it is calling. Also noteworthy is that the beautiful compound statements like if and for that so eloquently express the intent of the programmer get turned into a pile of jump (jmp) and conditional jump (jl, je) instructions, the machine code versions of the often dangerous and confusing goto statement. This is because CPUs are dumb: they don’t know how to carry out an if branch or a loop, and all they can do instead is be told to replace the value of their program counter register with some new value instead of just incrementing it as they usually do.

Assembly language is not the last stage in this process. The assembler (as) is a program that translates the assembly language in sumRange.s into machine code (which will be stored in sumRange.o if we aren’t compiling a single program all at once). Machine code is not human-readable, and is close to the raw stream of bytes that gets stored in the computer’s memory to represent a running program. The missing parts are that the addresses of each function and global variables are generally left unspecified, so that they can be moved around to make room for other functions and variables coming from other files and from system libraries. The job of stitching all of these pieces together, putting everything in the right place, filling in any placeholder addresses, and generating the executable file sumRange that we can actually run is given to the linker ld.

The whole process looks like this:

sumRange.c (source code)
   |
   v
[preprocessor (cpp)]
   |
   v
preprocessed version of sumRange.c
   |
   v
[compiler (gcc)]
   |
   v
sumRange.s (assembly code)
   |
   v
[assembler (as)]
   |
   v
sumRange.o (machine code)
   |
   v
[linker (ld)] <- system library (glibc.a)
   |
   v
sumRange (executable)

The good news is, you don’t actually have to run all of these steps yourself; instead, gcc will take care of everything for you, particularly for simple programs like sumRange.c that fit in a single file.

Numeric data types

All data stored inside a computer is ultimately represented as a sequence of bits, 0 or 1 values, typically organized into words consisting of several 8-bit bytes.

A typical desktop computer might have enough RAM to store 232 bytes (4 gigabytes). However, the address space of a process might be much larger: on a 64-bit machine, the address space is 264 bytes. There’s no way to store 264 different addresses in 235 bytes of RAM; instead, a memory mapper, typically built in to the CPU, translates the large addresses of the parts of the address space that are actually used into smaller addresses corresponding to actual RAM locations. In some cases, regions of memory that have not been used in a while will be swapped out to disk, leaving more RAM free for other parts of the process (or other processes). This technique is known as virtual memory and is usually invisible to the programmer. The use of virtual memory can increase the available space beyond the size of the RAM a little bit, but if you try to run a process that is actively using significantly more space that can be stored in RAM, it will slow down dramatically, because disk drives are roughly ten million times slower than memory.

The most basic kind of data represents integer values from some bounded range. C supports several integer data types, varying in their size (and thus range), and whether or not they are considered to be signed. These are described in more detail in Chapter 7.

For numerical computation, integer data types can be inconvenient. So C also supports floating-point types that consist of a fixed-size mantissa, which is essentially an integer, together with an exponent that is used to multiply the mantissa by 2x for some x. These allow very small or very large values to be represented with small relative error, but do not allow exact computation because of the limited precision of the mantissa. Floating-point types are also described in Chapter 8.

All other data is represented by converting it to either integer or floating-point numbers. For example, text characters in C are represented as small integer values, so that the character constant 'z' representation a lower-case “z” is exactly the same as the integer constant 122 (which is the ASCII code for “z”). A string like "hi there" is represented by a sequence of 8-bit ASCII characters, with a special 0 character to mark the end of the string. Strings that go beyond the English characters available in the ASCII encoding are typically represented using Unicode and encoded as sequences of bytes using a particular representation called UTF-8. The color of a pixel in an image might be represented as three 8-bit integers representing the intensity of red, green, and blue in the color, while an image itself might be a long sequence of such 3-byte RGB values. At the bottom, every operation applied to these more complex data types translates into a whole lot of copies and arithmetic operations on individual bytes and words.

From the CPU’s point of view, even much of this manipulation consists of operating on integers that happen to represent addresses instead of data. So when a C program writes a zero to the 19th entry in a sequence of 4-byte integers, somewhere in the implementation of this operation the CPU will be adding 4 ⋅ 19 to a base address for the sequence to computer where to write this value. Unlike many higher-level languages, C allows the program direct access to address computations via pointer types, which are tricky enough to get their own chapter. Indeed, most of the structured types that C provides for representing more complicated data can best be understood as a thin layer of abstraction on top of pointers. We will see examples of these in later chapters as well.

For now, we concentrate on integer and floating-point types, and on the operations that can be applied to them.


Licenses and Attributions


Speak Your Mind

-->