Compiling & Debugging C

The GNU C compiler gcc

A C program will typically consist of one or more files whose names end with .c. To compile foo.c, you can type gcc foo.c. Assuming foo.c contains no errors egregious enough to be detected by the extremely forgiving C compiler, this will produce a file named a.out that you can then execute by typing ./a.out.

If you want to debug your program using gdb or give it a different name, you will need to use a longer command line. Here’s one that compiles foo.c to foo (run it using ./foo) and includes the information that gdb needs: gcc -g3 -o foo foo.c

By default, gcc doesn’t check everything that might be wrong with your program. But if you give it a few extra arguments, it will warn you about many (but not all) potential problems: gcc -g3 -Wall -o foo foo.c.

Make

For complicated programs involving multiple source files, you are probably better off using make than calling gcc directly. Make is a “rule-based expert system” that figures out how to compile programs given a little bit of information about their components.

For example, if you have a file called foo.c, try typing make foo and see what happens.

In general you will probably want to write a Makefile, which is named Makefile or makefile and tells make how to compile programs in the same directory. Here’s a typical Makefile:

# Any line that starts with a sharp is a comment and is ignored
# by Make.

# These lines set variables that control make's default rules.
# We STRONGLY recommend putting "-g3 -Wall" in your CFLAGS.
CC=gcc
CFLAGS=-g3 -Wall

# The next line is a dependency line.
# It says that if somebody types "make all"
# make must first make "hello-world".
# By default the left-hand-side of the first dependency is what you
# get if you just type "make" with no arguments.
all: hello-world

# How do we make hello-world?
# The dependency line says you need to first make hello-world.o
# and hello-library.o
hello-world: hello-world.o hello-library.o
	# Subsequent lines starting with a TAB character give
	# commands to execute.
	# This command uses make built-in variables to avoid
	# retyping (and getting things wrong):
	#  $@ = target hello-world
	#  $^ = dependencies hello-world.o and hello-library.o
	$(CC) $(CFLAGS) -o $@ $^
	# You can put whatever commands you want.
	echo "I just built hello-world!  Hooray!"

# Here we are saying that hello-world.o and hello-library.o
#  should be rebuilt whenever their corresponding source file
#  or hello-library.h changes.
# There are no commands attached to these dependency lines, so
#  make will have to figure out how to do that somewhere else
#  (probably from the builtin .c -> .o rule).
hello-world.o: hello-world.c hello-library.h
hello-library.o: hello-library.c hello-library.h

# Command lines can do more than just build things.  For example,
# "make test" will rebuild hello-world (if necessary) and then run it.
test: hello-world
	./hello-world

# This lets you type "make clean" and get rid of anything you can
# rebuild.  The $(RM) variable is predefined to "rm -f"
clean:
	$(RM) hello-world *.o

Given a Makefile, make looks at each dependency line and asks: (a) does the target on the left hand side exist, and (b) is it older than the files it depends on. If so, it looks for a set of commands for rebuilding the target, after first rebuilding any of the files it depends on; the commands it runs will be underneath some dependency line where the target appears on the left-hand side. It has built-in rules for doing common tasks like building .o files (which contain machine code) from .c files (which contain C source code). If you have a fake target like all above, it will try to rebuild everything all depends on because there is no file named all (one hopes).

Make gotchas

Make really really cares that the command lines start with a TAB character. TAB looks like eight spaces in Emacs and other editors, but it isn’t the same thing. If you put eight spaces in (or a space and a TAB), Make will get horribly confused and give you an incomprehensible error message about a “missing separator”. This misfeature is so scary that I avoided using make for years because I didn’t understand what was going on. Don’t fall into that trap—make really is good for you, especially if you ever need to recompile a huge program when only a few source files have changed.

If you use GNU Make, note that beginning with version 3.78, GNU Make prints a message that hints at a possible SPACEs-vs-TAB problem, like this:

$ make
Makefile:23:*** missing separator (did you mean TAB instead of 8 spaces?).  Stop.

If you need to repair a Makefile that uses spaces, one way of converting leading spaces into TABs is to use the unexpand program:

$ mv Makefile Makefile.old
$ unexpand Makefile.old > Makefile

Debugging tools

The standard debugger is gdb. Also useful is the memory error checker valgrind. Below are some notes on debugging in general and using these programs in particular.

Debugging in general

Basic method of all debugging:

  1. Know what your program is supposed to do.
  2. Detect when it doesn’t.
  3. Fix it.

A tempting mistake is to skip step 1, and just try randomly tweaking things until the program works. Better is to see what the program is doing internally, so you can see exactly where and when it is going wrong. A second temptation is to attempt to intuit where things are going wrong by staring at the code or the program’s output. Avoid this temptation as well: let the computer tell you what it is really doing inside your program instead of guessing.

Assertions

Every non-trivial C program should include <assert.h>, which gives you the assert macro (see Appendix B6 of K&R). The assert macro tests if a condition is true and halts your program with an error message if it isn’t:

#include <assert.h>

int
main(int argc, char **argv)
{
    assert(2+2 == 5);
    return 0;
}

Compiling and running this program produces the following output:

$ gcc -o no no.c
$ ./no
no: no.c:6: main: Assertion `2+2 == 5' failed.

Line numbers and everything, even if you compile with the optimizer turned on. Much nicer than a mere segmentation fault, and if you run it under the debugger, the debugger will stop exactly on the line where the assert failed so you can poke around and see why.

The GNU debugger gdb

The standard debugger on Linux is called gdb. This lets you run your program under remote control, so that you can stop it and see what is going on inside.

You can also use ddd, which is a graphical front-end for gdb. There is an extensive tutorial available for ddd, so we will concentrate on the command-line interface to gdb here.

Warning: Though gdb is rock-solid when running on an actual Linux kernel, if you are running on a different underlying operating system like Windows (including Windows Subsystem for Linux) or OS X, it may not work as well, either missing errors that it should catch or in some cases not starting at all. For OS X, you might also have better results using the standard OS X debugger lldb, which is similar enough to gdb to do everything gdb can do while being different enough that you will need to learn its own set of commands. Most IDEs that support C also include debugging tools.

Getting back to gdb, we’ll look at a contrived example. Suppose you have the following program bogus.c:

#include <stdio.h>

/* Print the sum of the integers from 1 to 1000 */
int
main(int argc, char **argv)
{
    int i;
    int sum;

    sum = 0;
    for(i = 0; i -= 1000; i++) {
        sum += i;
    }
    printf("%d\n", sum);
    return 0;
}

Let’s compile and run it and see what happens. Note that we include the flag -g3 to tell the compiler to include debugging information. This allows gdb to translate machine addresses back into identifiers and line numbers in the original program for us.

$ gcc -g3 -o bogus bogus.c
$ ./bogus
-34394132

That doesn’t look like the sum of 1 to 1000. So what went wrong? If we were clever, we might notice that the test in the for loop is using the mysterious -= operator instead of the <= operator that we probably want. But let’s suppose we’re not so clever right now—it’s four in the morning, we’ve been working on bogus.c for twenty-nine straight hours, and there’s a -= up there because in our befuddled condition we know in our bones that it’s the right operator to use. We need somebody else to tell us that we are deluding ourselves, but nobody is around this time of night. So we’ll have to see what we can get the computer to tell us.

The first thing to do is fire up gdb, the debugger. This runs our program in stop-motion, letting us step through it a piece at a time and watch what it is actually doing. In the example below gdb is run from the command line. You can also run it directly from Emacs with M-x gdb, which lets Emacs track and show you where your program is in the source file with a little arrow, or by running ddd, which wraps gdb in a graphical user interface.

$ gdb bogus
GNU gdb 4.17.0.4 with Linux/x86 hardware watchpoint and FPU support
Copyright 1998 Free Software Foundation, Inc.
GDB is free software, covered by the GNU General Public License, and you are
welcome to change it and/or distribute copies of it under certain conditions.
Type "show copying" to see the conditions.
There is absolutely no warranty for GDB.  Type "show warranty" for details.
This GDB was configured as "i386-redhat-linux"...
(gdb) run
Starting program: /home/accts/aspnes/tmp/bogus 
-34394132

Program exited normally.

So far we haven’t learned anything. To see our program in action, we need to slow it down a bit. We’ll stop it as soon as it enters main, and step through it one line at a time while having it print out the values of the variables.

(gdb) break main
Breakpoint 1 at 0x8048476: file bogus.c, line 9.
(gdb) run
Starting program: /home/accts/aspnes/tmp/bogus 

Breakpoint 1, main (argc=1, argv=0xbffff9ac) at bogus.c:9
9           sum = 0;
(gdb) display sum
1: sum = 1
(gdb) n
10          for(i = 0; i -= 1000; i++)
1: sum = 0
(gdb) display i
2: i = 0
(gdb) n
11              sum += i;
2: i = -1000
1: sum = 0
(gdb) n
10          for(i = 0; i -= 1000; i++)
2: i = -1000
1: sum = -1000
(gdb) n
11              sum += i;
2: i = -1999
1: sum = -1000
(gdb) n
10          for(i = 0; i -= 1000; i++)
2: i = -1999
1: sum = -2999
(gdb) quit
The program is running.  Exit anyway? (y or n) y
$

Here we are using break main to tell the program to stop as soon as it enters main, display to tell it to show us the value of the variables i and sum whenever it stops, and n (short for next) to execute the program one line at a time.

When stepping through a program, gdb displays the line it will execute next as well as any variables you’ve told it to display. This means that any changes you see in the variables are the result of the previous displayed line. Bearing this in mind, we see that i drops from 0 to -1000 the very first time we hit the top of the for loop and drops to -1999 the next time. So something bad is happening in the top of that for loop, and if we squint at it a while we might begin to suspect that i -= 1000 is not the nice simple test we might have hoped it was.

My favorite gdb commands

help

Get a description of gdb’s commands.

run

Runs your program. You can give it arguments that get passed in to your program just as if you had typed them to the shell. Also used to restart your program from the beginning if it is already running.

quit

Leave gdb, terminating your program if necessary.

break

Set a breakpoint, which is a place where gdb will automatically stop your program. Some examples: - break somefunction stops before executing the first line somefunction. - break 117 stops before executing line number 117.

list

Show part of your source file with line numbers (handy for figuring out where to put breakpoints). Examples: - list somefunc lists all lines of somefunc. - list 117-123 lists lines 117 through 123.

next

Execute the next line of the program, including completing any procedure calls in that line.

step

Execute the next step of the program, which is either the next line if it contains no procedure calls, or the entry into the called procedure.

finish

Continue until you get out of the current procedure (or hit a breakpoint). Useful for getting out of something you stepped into that you didn’t want to step into.

cont

(Or continue). Continue until (a) the end of the program, (b) a fatal error like a Segmentation Fault or Bus Error, or (c) a breakpoint. If you give it a numeric argument (e.g., cont 1000) it will skip over that many breakpoints before stopping.

print

Print the value of some expression, e.g. print i.

display

Like print, but runs automatically every time the program stops. Useful for watching values that change often.

backtrace

Show all the function calls on the stack, with arguments. Can be abbreviated as bt. Do bt full if you also want to see local variables in each function.

set disable-randomization off

Not something you will need every day, but you should try this before running your program if it is producing segmentation faults outside of gdb but not inside. Normally the Linux kernel randomizes the position of bits of your program before running it, to make its response to buffer overflow attacks less predictable. By default, gdb turns this off so that the behavior of your program is consistent from one execution to the next. But sometimes this means that a pointer that had been bad with address randomization (causing a segmentation fault) turns out not to be bad without. This option will restore the standard behavior even inside gdb and give you some hope of finding what went wrong.

Debugging strategies

In general, the idea behind debugging is that a bad program starts out OK, but after executing for a while it gets confused and starts misbehaving. If you can find the exact moment in its execution where it first starts acting up, you can see exactly what piece of code is causing the problem and have a reasonably good chance of being able to fix it. So a typical debugging strategy is to put in a breakpoint (using break) somewhere before the confusion hits, “instrument” the program (using display) so that you can watch it getting confused, and step through it (using next, step, or breakpoints and cont) until you find the point of failure. Sometimes this process requires restarting the program (using run) if you skip over this point without noticing it immediately.

For large or long-running programs, it often makes sense to do binary search to find the point of failure. Put in a breakpoint somewhere (say, on a function that is called many times or at the top of a major loop) and see what the state of the program is after going through the breakpoint 1000 times (using something like cont 1000). If it hasn’t gone off the rails yet, try restarting and going through 2000 times. Eventually you bracket the error as occurring (for example) somewhere between the 4000th and 8000th occurrence of the breakpoint. Now try stepping through 6000 times; if the program is looking good, you know the error occurs somewhere between the 6000th and 8000th breakpoint. Adding a dozen or so more experiments like this should be enough isolate the bug to a specific line of code.

The key to all debugging is knowing what your code is supposed to do. If you don’t know this, then you can’t tell the cases where it is working from the cases where it isn’t. If you’re confused about what your code is supposed to be doing, you need to figure out what exactly you want it to do. If you can figure that out, often it will be obvious what is going wrong. If it isn’t obvious, you can always go back to gdb.

Common applications of gdb

Here are some typical classes of bugs and how to squish them with gdb. (The same instructions usually work for ddd.)

Watching your program run
  1. Compile your program with the -g3 flag. You can still run gdb if you don’t do this, but it won’t be able to show you variable names or source lines.
  2. Run gdb with gdb programname.
  3. Type break main to stop at the start of the main routine.
  4. Run your program with run arguments. The run command stands in for the program name. You can also redirect input as in the shell with run arguments < filename.
  5. When the program stops, you can display variables in the current function or expressions involving these variables using display, as in display x, display a[i], display z+17. In ddd, double-clicking on a variable name will have the same effect. Use undisplay to get rid of any displays you don’t want.
  6. To step through your program, use next (always goes to next line in the current function, not dropping down into function calls), step (go to the next executed line, even if it is inside a called function), finish (run until the current function returns), and cont (run until the end of the program or the next breakpoint).

This can be handy if you don’t particularly know what is going on in your program and want to see.

Dealing with failed assertions

Run the program as described above. When you hit the bad assert, you will stop several functions deep from where it actually happened. Use up to get up to the function that has the call to assert then use print or display to figure out what is going on.

Example program:

#include <stdio.h>
#include <stdlib.h>
#include <assert.h>

int
main(int argc, char **argv)
{
    int x;

    x = 3;

    assert(x+x == 4);

    return 0;
}

With gdb in action:

$ gcc -g3 -o assertFailed assertFailed.c 
$ gdb assertFailed
GNU gdb (Ubuntu 7.7.1-0ubuntu5~14.04.2) 7.7.1
Copyright (C) 2014 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "i686-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
<http://www.gnu.org/software/gdb/documentation/>.
For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from assertFailed...done.
(gdb) run
Starting program: /home/aspnes/g/classes/223/notes/examples/debugging/assertFailed 
assertFailed: assertFailed.c:12: main: Assertion `x+x == 4' failed.

Program received signal SIGABRT, Aborted.
0xb7fdd416 in __kernel_vsyscall ()
(gdb) up
#1  0xb7e43577 in __GI_raise (sig=sig@entry=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:56
56  ../nptl/sysdeps/unix/sysv/linux/raise.c: No such file or directory.
(gdb) up
#2  0xb7e469a3 in __GI_abort () at abort.c:89
89  abort.c: No such file or directory.
(gdb) up
#3  0xb7e3c6c7 in __assert_fail_base (fmt=0xb7f7a8b4 "%s%s%s:%u: %s%sAssertion `%s' failed.\n%n", 
    assertion=assertion@entry=0x804850f "x+x == 4", file=file@entry=0x8048500 "assertFailed.c", 
    line=line@entry=12, function=function@entry=0x8048518 <__PRETTY_FUNCTION__.2355> "main") at assert.c:92
92  assert.c: No such file or directory.
(gdb) up
#4  0xb7e3c777 in __GI___assert_fail (assertion=0x804850f "x+x == 4", file=0x8048500 "assertFailed.c", line=12, 
    function=0x8048518 <__PRETTY_FUNCTION__.2355> "main") at assert.c:101
101 in assert.c
(gdb) up
#5  0x0804845d in main (argc=1, argv=0xbffff434) at assertFailed.c:12
12      assert(x+x == 4);
(gdb) print x
$1 = 3

Here we see that x has value 3, which may or may not be the right value, but certainly violates the assertion.

Dealing with segmentation faults

Very much like the previous case. Run gdb until the segmentation fault hits, then look around for something wrong.

#include <stdio.h>
#include <stdlib.h>
#include <assert.h>

int
main(int argc, char **argv)
{
    int a[1000];
    int i;

    i = -1771724;

    printf("%d\n", a[i]);

    return 0;
}


$ gcc -g3 -o segmentationFault segmentationFault.c 
$ gdb segmentationFault
GNU gdb (Ubuntu 7.7.1-0ubuntu5~14.04.2) 7.7.1
[...]
Reading symbols from segmentationFault...done.
(gdb) run
Starting program: /home/aspnes/g/classes/223/notes/examples/debugging/segmentationFault 

Program received signal SIGSEGV, Segmentation fault.
0x08048435 in main (argc=1, argv=0xbffff434) at segmentationFault.c:13
13      printf("%d\n", a[i]);
(gdb) print a[i]
$1 = 0
(gdb) print i
$2 = -1771724

Curiously, gdb has no problem coming up with a value for a[i]. But i looks pretty suspicious.

Dealing with infinite loops

Run gdb, wait a while, then hit control-C. This will stop gdb wherever it is. If you have an infinite loop, it’s likely that you will be in it, and that the index variables will be doing something surprising. Use display to keep an eye on them and do next a few times.

$ gcc -g3 -o infiniteLoop infiniteLoop.c
$ gdb infiniteLoop
GNU gdb (Ubuntu 7.7.1-0ubuntu5~14.04.2) 7.7.1
[...]
Reading symbols from infiniteLoop...done.
(gdb) run
Starting program: /home/aspnes/g/classes/223/notes/examples/debugging/infiniteLoop 
^C
Program received signal SIGINT, Interrupt.
main (argc=1, argv=0xbffff434) at infiniteLoop.c:11
11          i *= 37;
(gdb) display i
1: i = 0
(gdb) n
10      for(i = 0; i < 10; i += 0) {
1: i = 0
(gdb) n
11          i *= 37;
1: i = 0
(gdb) n
10      for(i = 0; i < 10; i += 0) {
1: i = 0
(gdb) n
11          i *= 37;
1: i = 0
(gdb) n
10      for(i = 0; i < 10; i += 0) {
1: i = 0
(gdb) n
11          i *= 37;
1: i = 0
Mysterious variable changes

Sometimes pointer botches don’t manifest as good, honest segmentation faults but instead as mysterious changes to seemingly unrelated variables. You can catch these in the act using conditional breakpoints. The downside is that you can only put conditional breakpoints on particular lines.

Here’s a program that violates array bounds (which C doesn’t detect):

#include <stdio.h>
#include <stdlib.h>
#include <assert.h>

int
main(int argc, char **argv)
{
    int x;
    int a[10];
    int i;

    x = 5;

    for(i = -1; i < 11; i++) {
        a[i] = 37;
    }

    assert(x == 5);

    return 0;
}

In the debugging session below, it takes a couple of attempts to catch the change in x before hitting the failed assertion.

$ gcc -g3 -o mysteryChange mysteryChange.c 
$ gdb mysteryChange
GNU gdb (Ubuntu 7.7.1-0ubuntu5~14.04.2) 7.7.1
[...]
Reading symbols from mysteryChange...done.
(gdb) run
Starting program: /home/aspnes/g/classes/223/notes/examples/debugging/mysteryChange 
mysteryChange: mysteryChange.c:18: main: Assertion `x == 5' failed.

Program received signal SIGABRT, Aborted.
0xb7fdd416 in __kernel_vsyscall ()
(gdb) list main
2   #include <stdlib.h>
3   #include <assert.h>
4   
5   int
6   main(int argc, char **argv)
7   {
8       int x;
9       int a[10];
10      int i;
11  
(gdb) list
12      x = 5;
13  
14      for(i = -1; i < 11; i++) {
15          a[i] = 37;
16      }
17  
18      assert(x == 5);
19  
20      return 0;
21  }
(gdb) break 14 if x != 5
Breakpoint 1 at 0x804842e: file mysteryChange.c, line 14.
(gdb) run
The program being debugged has been started already.
Start it from the beginning? (y or n) y

Starting program: /home/aspnes/g/classes/223/notes/examples/debugging/mysteryChange 
mysteryChange: mysteryChange.c:18: main: Assertion `x == 5' failed.

Program received signal SIGABRT, Aborted.
0xb7fdd416 in __kernel_vsyscall ()
(gdb) break 15 if x != 5
Breakpoint 2 at 0x8048438: file mysteryChange.c, line 15.
(gdb) run
The program being debugged has been started already.
Start it from the beginning? (y or n) y

Starting program: /home/aspnes/g/classes/223/notes/examples/debugging/mysteryChange 

Breakpoint 2, main (argc=1, argv=0xbffff434) at mysteryChange.c:15
15          a[i] = 37;
(gdb) print i
$1 = 0
(gdb) print a[0]
$2 = 134520832
(gdb) print a[-1]
$3 = 37
(gdb) print x
$4 = 37

One thing to note is that a breakpoint stops before the line it is on executes. So when we hit the breakpoint on line 15 (gdb having observed that x != 5 is true), i has the value 0, but the damage happened in the previous interation when i was -1. If we want to see exactly what happened then, we’d need to go back in time. We can’t do this, but we could set an earlier breakpoint and run the program again.

Valgrind

The valgrind program can be used to detect some (but not all) common errors in C programs that use pointers and dynamic storage allocation. You can run valgrind on your program by putting valgrind at the start of the command line:

valgrind ./my-program arg1 arg2 < test-input

This will run your program and produce a report of any allocations and de-allocations it did. It will also warn you about common errors like using unitialized memory, dereferencing pointers to strange places, writing off the end of blocks allocated using malloc, or failing to free blocks.

You can suppress all of the output except errors using the -q option, like this:

valgrind -q ./my-program arg1 arg2 < test-input

You can also turn on more tests, e.g.

valgrind -q --tool=memcheck --leak-check=yes ./my-program arg1 arg2 < test-input

See valgrind --help for more information about the (many) options, or look at the documentation at http://valgrind.org/ for detailed information about what the output means. For some common valgrind messages, see the examples section below.

If you want to run valgrind on your own machine, you may be able to find a version that works at http://valgrind.org. Unfortunately, this is only likely to work if you are running a Unix-like operating system. This does include Linux (either on its own or inside Windows Subsystem for Linux) and OSX, but it does not include stock Windows.

Compilation flags

You can run valgrind on any program (try valgrind ls); it does not require special compilation. However, the output of valgrind will be more informative if you compile your program with debugging information turned on using the -g or -g3 flags (this is also useful if you plan to watch your program running using gdb, ).

Automated testing

Unless otherwise specified, automated testing of your program will be done using the script in /c/cs223/bin/vg; this runs /c/cs223/bin/valgrind with the --tool=memcheck, --leak-check=yes, and -q options, throws away your program’s output, and replaces it with valgrind’s output. If you have a program named ./prog, running /c/cs223/bin/vg ./prog should produce no output.

Examples of some common valgrind errors

Here are some examples of valgrind output. In each case the example program is compiled with -g3 so that valgrind can report line numbers from the source code.

You may also find it helpful to play with this demo program.

Uninitialized values

Consider this unfortunate program, which attempts to compare two strings, one of which we forgot to ensure was null-terminated:

#include <stdio.h>

int
main(int argc, char **argv)
{
    char a[2];

    a[0] = 'a';

    if(!strcmp(a, "a")) {
        puts("a is \"a\"");
    }

    return 0;
}

Run without valgrind, we see no errors, because we got lucky and it turned out our hand-built string was null-terminated anyway:

$ ./uninitialized 
a is "a"

But valgrind is not fooled:

$ valgrind -q ./uninitialized 
==4745== Conditional jump or move depends on uninitialised value(s)
==4745==    at 0x4026663: strcmp (mc_replace_strmem.c:426)
==4745==    by 0x8048435: main (uninitialized.c:10)
==4745== 
==4745== Conditional jump or move depends on uninitialised value(s)
==4745==    at 0x402666C: strcmp (mc_replace_strmem.c:426)
==4745==    by 0x8048435: main (uninitialized.c:10)
==4745== 
==4745== Conditional jump or move depends on uninitialised value(s)
==4745==    at 0x8048438: main (uninitialized.c:10)
==4745== 

Here we get a lot of errors, but they are all complaining about the same call to strcmp. Since it’s unlikely that strcmp itself is buggy, we have to assume that we passed some uninitialized location into it that it is looking at. The fix is to add an assignment a[1] = '\0' so that no such location exists.

Bytes definitely lost

Here is a program that calls malloc but not free:

#include <stdio.h>
#include <stdlib.h>

int
main(int argc, char **argv)
{
    char *s;

    s = malloc(26);

    return 0;
}

With no extra arguments, valgrind will not look for this error. But if we turn on --leak-check=yes, it will complain:

$ valgrind -q --leak-check=yes ./missing_free
==4776== 26 bytes in 1 blocks are definitely lost in loss record 1 of 1
==4776==    at 0x4024F20: malloc (vg_replace_malloc.c:236)
==4776==    by 0x80483F8: main (missing_free.c:9)
==4776== 

Here the stack trace in the output shows where the bad block was allocated: inside malloc (specifically the paranoid replacement malloc supplied by valgrind), which was in turn called by main in line 9 of missing_free.c. This lets us go back and look at what block was allocated in that line and try to trace forward to see why it wasn’t freed. Sometimes this is as simple as forgetting to include a free statement anywhere, but in more complicated cases it may be because I somehow lose the pointer to the block by overwriting the last variable that points to it or by embedding it in some larger structure whose components I forget to free individually.

Invalid write or read operations

These are usually operations that you do off the end of a block from malloc or on a block that has already been freed.

An example of the first case:

#include <stdio.h>
#include <stdlib.h>
#include <assert.h>

int
main(int argc, char **argv)
{
    char *s;

    s = malloc(1);
    s[0] = 'a';
    s[1] = '\0';

    puts(s);

    return 0;
}


==7141== Invalid write of size 1
==7141==    at 0x804843B: main (invalid_operations.c:12)
==7141==  Address 0x419a029 is 0 bytes after a block of size 1 alloc'd
==7141==    at 0x4024F20: malloc (vg_replace_malloc.c:236)
==7141==    by 0x8048428: main (invalid_operations.c:10)
==7141== 
==7141== Invalid read of size 1
==7141==    at 0x4026063: __GI_strlen (mc_replace_strmem.c:284)
==7141==    by 0x409BCE4: puts (ioputs.c:37)
==7141==    by 0x8048449: main (invalid_operations.c:14)
==7141==  Address 0x419a029 is 0 bytes after a block of size 1 alloc'd
==7141==    at 0x4024F20: malloc (vg_replace_malloc.c:236)
==7141==    by 0x8048428: main (invalid_operations.c:10)
==7141== 

An example of the second:

#include <stdio.h>
#include <stdlib.h>
#include <assert.h>

int
main(int argc, char **argv)
{
    char *s;

    s = malloc(2);
    free(s);

    s[0] = 'a';
    s[1] = '\0';

    puts(s);

    return 0;
}


==7144== Invalid write of size 1
==7144==    at 0x804846D: main (freed_block.c:13)
==7144==  Address 0x419a028 is 0 bytes inside a block of size 2 free'd
==7144==    at 0x4024B3A: free (vg_replace_malloc.c:366)
==7144==    by 0x8048468: main (freed_block.c:11)
==7144== 
==7144== Invalid write of size 1
==7144==    at 0x8048477: main (freed_block.c:14)
==7144==  Address 0x419a029 is 1 bytes inside a block of size 2 free'd
==7144==    at 0x4024B3A: free (vg_replace_malloc.c:366)
==7144==    by 0x8048468: main (freed_block.c:11)
==7144== 
==7144== Invalid read of size 1
==7144==    at 0x4026058: __GI_strlen (mc_replace_strmem.c:284)
==7144==    by 0x409BCE4: puts (ioputs.c:37)
==7144==    by 0x8048485: main (freed_block.c:16)
[... more lines of errors deleted ...]

In both cases the problem is that we are operating on memory that is not guaranteed to be allocated to us. For short programs like these, we might get lucky and have the program work anyway. But we still want to avoid bugs like this because we might not get lucky.

How do we know which case is which? If I write off the end of an existing block, I’ll see something like Address 0x419a029 is 0 bytes after a block of size 1 alloc'd, telling me that I am working on an address after a block that is still allocated. When I try to write to a freed block, the message changes to Address 0x419a029 is 1 bytes inside a block of size 2 free'd, where the free'd part tells me I freed something I probably shouldn’t have. Fixing the first class of bugs is usually just a matter of allocating a bigger block (but don’t just do this without figuring out why you need a bigger block, or you’ll just be introducing random mutations into your code that may cause other problems elsewhere). Fixing the second class of bugs usually involves figuring out why you freed this block prematurely. In some cases you may need to re-order what you are doing so that you don’t free a block until you are completely done with it.

A tempting but usually bad approach to debugging is to put lots of printf statements in your code to show what is going on. The problem with this compared to using assert is that there is no built-in test to see if the output is actually what you’d expect. The problem compared to gdb is that it’s not flexible: you can’t change your mind about what is getting printed out without editing the code. A third problem is that the output can be misleading: in particular, printf output is usually buffered, which means that if your program dies suddenly there may be output still in the buffer that is never flushed to stdout. This can be very confusing, and can lead you to believe that your program fails earlier than it actually does.

If you really need to use printf or something like it for debugging output, here are a few rules of thumb to follow to mitigate the worst effects:

  1. Use fprintf(stderr, ...) instead of printf(...); this allows you to redirect your program’s regular output somewhere that keeps it separate from the debugging output (but beware of misleading interleaving of the two streams—buffering may mean that output to stdout and stderr appears to arrive out of order). It also helps that output to stderr is usually unbuffered, avoiding the problem of lost output.
  2. If you must output to stdout, put fflush(stdout) after any output operation you suspect is getting lost in the buffer. The fflush function forces any buffered output to be emitted immediately.
  3. Keep all arguments passed to printf as simple as possible and beware of faults in your debugging code itself. If you write printf("a[key] == %d\n", a[key]) and key is some bizarre value, you will never see the result of this printf because your program will segfault while evaluating a[key]. Naturally, this is more likely to occur if the argument is a[key]->size[LEFTOVERS].cleanupFunction(a[key]) than if it’s just a[key], and if it happens it will be harder to figure out where in this complex chain of array indexing and pointer dereferencing the disaster happened. Better is to wait for your program to break in gdb, and use the print statement on increasingly large fragments of the offending expression to see where the bogus array index or surprising null pointer is hiding.
  4. Wrap your debugging output in an #ifdef so you can turn it on and off easily.

Bearing in mind that this is a bad idea, here is an example of how one might do it as well as possible:

#include <stdio.h>
#include <stdlib.h>
#include <assert.h>

/* initialize the application */
void
init(void)
{
    int x;

    x = *((int *) 0xbad1dea); /* if we are lucky, maybe the optimizer will remove it? */
}

int
main(int argc, char **argv)
{
    init();

#ifdef DEBUGGING_OUTPUT
    /*
     * this type of debugging output is not recommended
     * but if you do it anyway:
     *
     * 1. Use stderr, which flushes automatically.
     * 2. Be wary of buffered data on stdout.
     * 3. Wrap your debugging statement in an #ifdef,
     *    so it is not active by default.
     */
    fputs("Returned from init() in main()\n", stderr);
#endif

    return 0;
}

Note that we get much more useful information if we run this under gdb (which will stop exactly on the bad line in init), but not seeing the result of the fputs at least tells us something.

Performance tuning

Chapter 7 of Kernighan and Pike, The Practice of Programming (Addison-Wesley, 1998) gives an excellent overview of performance tuning. This page will be limited to some Linux-specific details and an example.

Timing under Linux

Use time, e.g.

$ time wc /usr/share/dict/words
 45378  45378 408865 /usr/share/dict/words

real    0m0.010s
user    0m0.006s
sys     0m0.004s

This measures “real time” (what it sounds like), “user time” (the amount of time the program runs), and “system time” (the amount of time the operating system spends supporting your program, e.g. by loading it from disk and doing I/O). Real time need not be equal to the sum of user time and system time, since the operating system may be simultaneously running other programs.

Particularly for fast programs, times can vary from one execution to the next, e.g.

$ time wc /usr/share/dict/words
 45378  45378 408865 /usr/share/dict/words

real    0m0.009s
user    0m0.008s
sys     0m0.001s
$ time wc /usr/share/dict/words
 45378  45378 408865 /usr/share/dict/words

real    0m0.009s
user    0m0.007s
sys     0m0.002s

This arises because of measurement errors and variation in how long different operations take. But usually the variation will not be much.

Note also that time is often a builtin operation of your shell, so the output format may vary depending on what shell you use.

Profiling with valgrind

The problem with time is that it only tells you how much time your whole program took, but not where it spent its time. This is similar to looking at a program without a debugger: you can’t see what’s happening inside. If you want to see where your program is spending its time, you need to use a profiler.

The specific profiler we will use in this section is callgrind, a tool built into valgrind, which we’ve been using elsewhere to detect pointer disasters and storage leaks. Full documentation for callgrind can be found at http://valgrind.org/docs/manual/cl-manual.html, but we’ll give an example of typical use here.

Here is an example of a program that is unreasonably slow for what it is doing.

#include <stdio.h>
#include <stdlib.h>
#include <assert.h>

#include <string.h>

/* concatenate n copies of src into dest */
/* returns dest */
/* caller is responsible for making dest large enough */
char *
replicate(char *dest, const char *src, int n)
{
    /* truncate dest */
    dest[0] = '\0';

    /* BAD: each call to strcat requires walking across dest */
    for(int i = 0; i < n; i++) {
        strcat(dest, src);
    }

    return dest;
}

/* like strcpy, but only copies characters at indices 0, 2, 4, ...
   from src to dest */
char *
copyEvenCharacters(char *dest, const char *src)
{
    int i;
    int j;

    /* BAD: Calls strlen on every pass through the loop */
    for(i = 0, j = 0; i < strlen(src); i += 2, j++) {
        dest[j] = src[i];
    }

    dest[j] = '\0';

    return dest;
}

/* like puts, but stops after maxChars characters */
void
putsRestricted(const char *s, int maxChars)
{
    for(int i = 0; s[i]; i++) {
        if(i >= maxChars) {
            printf("[%zu more]\n", strlen(s) - i);
            return;
        } else {
            putchar(s[i]);
        }
    }

    putchar('\n');
}

#define SMALL (10)
#define BIG (100000)
#define PATTERN "abcd"

#define BUFFER_SIZE (BIG * strlen(PATTERN) + 1)

/* how many characters to print at once */
#define MAX_CHARS (40)

int
main(int argc, char **argv)
{
    char *buffer;
    char *half;

    buffer = malloc(BUFFER_SIZE);
    half = malloc(BUFFER_SIZE);

    putsRestricted(replicate(buffer, PATTERN, SMALL), MAX_CHARS);
    putsRestricted(copyEvenCharacters(half, buffer), MAX_CHARS);

    putsRestricted(replicate(buffer, PATTERN, BIG), MAX_CHARS);
    putsRestricted(copyEvenCharacters(half, buffer), MAX_CHARS);

    free(half);
    free(buffer);

    return 0;
}

This program defines several functions for processing null-terminated strings: replicate, which concatenates many copies of some string together, and copyEvenCharacters, which copies every other character in a string to a given buffer. Unfortunately, both functions contain a hidden inefficiency arising from their use of the standard C library string functions.

The runtime of the program is not terrible, but not as sprightly as we might expect given that we are working on less than half a megabyte of text:

$ time ./slow
abcdabcdabcdabcdabcdabcdabcdabcdabcdabcd
acacacacacacacacacac
abcdabcdabcdabcdabcdabcdabcdabcdabcdabcd[399960 more]
acacacacacacacacacacacacacacacacacacacac[199960 more]

real    0m3.171s
user    0m3.164s
sys     0m0.001s

So we’d like to make it faster.

In this particular case, the programmer was kind enough to identify the problems in the original code in comments, but we can’t always count on that. But we can use the callgrind tool built into valgrind to find out where our program is spending most of its time.

To run callgrind, call valgrind with the --tool=callgrind option, like this:

$ time valgrind --tool=callgrind ./slow
==5714== Callgrind, a call-graph generating cache profiler
==5714== Copyright (C) 2002-2017, and GNU GPL'd, by Josef Weidendorfer et al.
==5714== Using Valgrind-3.13.0 and LibVEX; rerun with -h for copyright info
==5714== Command: ./slow
==5714== 
==5714== For interactive control, run 'callgrind_control -h'.
abcdabcdabcdabcdabcdabcdabcdabcdabcdabcd
acacacacacacacacacac
abcdabcdabcdabcdabcdabcdabcdabcdabcdabcd[399960 more]
acacacacacacacacacacacacacacacacacacacac[199960 more]
==5714== 
==5714== Events    : Ir
==5714== Collected : 15339385208
==5714== 
==5714== I   refs:      15,339,385,208

real    1m31.965s
user    1m31.515s
sys 0m0.037s

I’ve include time at the start of the command line to make it clear just how much of a slowdown you can expect from using valgrind for this purpose. Note that valgrind only prints a bit of summary data while executing. To get a full report, we use a separate program callgrind_annotate:

$ callgrind_annotate --auto=yes --inclusive=yes > slow.callgrind

Here I sent the output to a file slow.callgrind so I could look at it in more detail in my favorite text editor, since the actual report is pretty huge. The --auto=yes argument tells callgrind_annotate to show how many instructions were executed as part of each line of source code, and the --inclusive=yes argument tells use that in its report it should charge instructions executed in some function both to that function and to all functions responsible for calling it. This is usually what you want to figure out where things are going wrong.

The first thing to look at in slow.callgrind is the table showing which functions are doing most of the work:

--------------------------------------------------------------------------------
            Ir  file:function
--------------------------------------------------------------------------------
15,339,385,208  ???:0x0000000000000dd0 [/usr/lib64/ld-2.25.so]
15,339,274,304  ???:_start [/home/accts/aspnes/g/classes/223/notes/examples/profiling/slow]
15,339,274,293  /usr/src/debug/glibc-2.25-123-gedcf13e25c/csu/../csu/libc-start.c:(below main) [/usr/lib64/libc-2.25.so]
15,339,273,103  slow.c:main [/home/accts/aspnes/g/classes/223/notes/examples/profiling/slow]
15,339,273,103  /home/accts/aspnes/g/classes/223/notes/examples/profiling/slow.c:main
11,264,058,263  slow.c:copyEvenCharacters [/home/accts/aspnes/g/classes/223/notes/examples/profiling/slow]
11,260,141,740  /usr/src/debug/glibc-2.25-123-gedcf13e25c/string/../sysdeps/x86_64/strlen.S:strlen [/usr/lib64/ld-2.25.so]
 4,075,049,055  slow.c:replicate [/home/accts/aspnes/g/classes/223/notes/examples/profiling/slow]
 4,074,048,083  /usr/src/debug/glibc-2.25-123-gedcf13e25c/string/../sysdeps/x86_64/multiarch/strcat-ssse3.S:__strcat_ssse3 [/usr/lib64/libc-2.25.so]
       108,795  /usr/src/debug/glibc-2.25-123-gedcf13e25c/elf/rtld.c:_dl_start [/usr/lib64/ld-2.25.so]

Since each function is charged for work done by its children, the top of the list includes various setup functions included automatically by the C compiler, followed by main. Inside main, we see that the majority of the work is done in copyEvenCharacters, with a substantial chunk in replicate. The suspicious similarity in numbers suggests that most of these instructions in copyEvenCharacters are accounted for by calls to strlen and in replicate by calls to __strcat_sse3, which happens to be an assembly-language implementation of strcat (hence the .S in the source file name) that uses the special SSE instructions in the x86 instruction set to speed up copying.

We can confirm this suspicion by looking at later parts of the file, which annotate the source code with instruction counts.

The annotated version of slow.c includes this annotated version of replicate, showing roughly 4 billion instructions executed in __strcat_sse3:

        .  char *
        .  replicate(char *dest, const char *src, int n)
       12  {
        .      /* truncate dest */
        4      dest[0] = '\0';
        .  
        .      /* BAD: each call to strcat requires walking across dest */
  400,050      for(int i = 0; i < n; i++) {
  600,064          strcat(dest, src);
      836  => /usr/src/debug/glibc-2.25-123-gedcf13e25c/elf/../sysdeps/x86_64/dl-trampoline.h:_dl_runtime_resolve_xsave (1x)
4,074,048,083  => /usr/src/debug/glibc-2.25-123-gedcf13e25c/string/../sysdeps/x86_64/multiarch/strcat-ssse3.S:__strcat_ssse3 (100009x)
        .      }
        .  
        2      return dest;
        4  }

Similarly, the annotated version of copyEvenCharacters shows that 11 billion instructions were executed in strlen:

        .  char *
        .  copyEvenCharacters(char *dest, const char *src)
       12  {
        .      int i;
        .      int j;
        .  
        .      /* BAD: Calls strlen on every pass through the loop */
2,000,226      for(i = 0, j = 0; i < strlen(src); i += 2, j++) {
11,260,056,980  => /usr/src/debug/glibc-2.25-123-gedcf13e25c/string/../sysdeps/x86_64/strlen.S:strlen (200021x)
      825  => /usr/src/debug/glibc-2.25-123-gedcf13e25c/elf/../sysdeps/x86_64/dl-trampoline.h:_dl_runtime_resolve_xsave (1x)
2,000,200          dest[j] = src[i];
        .      }
        .  
       10      dest[j] = '\0';
        .  
        2      return dest;
        8  }

This gives a very strong hint for fixing the program: cut down on the cost of calling strlen and strcat.

Fixing copyEvenCharacters is trivial. Because the length of src doesn’t change, we can call strlen once and save the value in a variable:

char *
copyEvenCharacters(char *dest, const char *src)
{
    int i;
    int j;
    size_t len;

    /* GOOD: Calls strlen only once */
    len = strlen(src);
    for(i = 0, j = 0; i < len; i += 2, j++) {
        dest[j] = src[i];
    }

    dest[j] = '\0';

    return dest;
}

Fixing replicate is trickier. The trouble with using strcat is that every time we call strcat(dest, src), strcat has to scan down the entire dest string to find the end, which (a) gets more expensive as dest gets longer, and (b) involves passing over the same non-null initial characters over and over again each time we want to add a few more characters. The effect of this is that we turn what should be an O(n)-time process of generating a string of n characters into something that looks more like O(n2). We can fix this by using pointer arithmetic to keep track of the end of dest ourselves, which also allows us to replace strcat with memcpy, which is likely to be faster since it doesn’t have to check for nulls. Here’s the improved version:

char *
replicate(char *dest, const char *src, int n)
{
    size_t len = strlen(src);
    char *tail = dest;

    /* GOOD: each call to memcpy only copies n*strlen(src) bytes */
    for(int i = 0; i < n; i++, tail += len) {
        memcpy(tail, src, len);
    }

    /* tack on final null */
    *tail = '\0';

    return dest;
}

The result of applying both of these fixes can be found in fast.c. This runs much faster than slow:

abcdabcdabcdabcdabcdabcdabcdabcdabcdabcd
acacacacacacacacacac
abcdabcdabcdabcdabcdabcdabcdabcdabcdabcd[399960 more]
acacacacacacacacacacacacacacacacacacacac[199960 more]

real    0m0.003s
user    0m0.001s
sys     0m0.001s

Profiling with gprof

If you can’t use valgrind for profiling, don’t like the output you get from it, or are annoyed by the huge slowdown when profiling your program, you may be able to get similar results from an older program gprof, which is closely tied to the gcc compiler. Unlike valgrind, which simulates an x86 CPU one machine-code instruction at a time, gprof works by having gcc add extra code to your program to track function calls and do sampling at runtime to see where your program is spending its time. The cost of this approach is that you get a bit less accuracy. I have also found gprof to be tricky to get working right on some operating systems.

Here’s a short but slow program for calculating the number of primes less than some limit passed as argv[1]:

#include <stdio.h>
#include <stdlib.h>

/* return 1 if n is prime, 0 otherwise */
int
isPrime(int n)
{
    int factor;

    if(n < 2) return 0;
    /* else */
    for(factor = 2; factor < n; factor++) {
        if(n % factor == 0) return 0;
    }
    /* else */
    return 1;
}

/* return number of primes < n */
int
countPrimes(int n)
{
    int i;
    int count;

    count = 0;

    for(i = 0; i < n; i++) {
        if(isPrime(i)) count++;
    }

    return count;
}

int
main(int argc, char **argv)
{
    if(argc != 2) {
        fprintf(stderr, "Usage: %s n\n", argv[0]);
        return 1;
    }

    printf("%d\n", countPrimes(atoi(argv[1])));

    return 0;
}

And now we’ll time countPrimes 100000:

$ gcc -g3 -o countPrimes countPrimes.c 
$ time ./countPrimes 100000
9592

real    0m4.711s
user    0m4.608s
sys 0m0.004s

This shows that the program took just under five seconds of real time, of which most was spent in user mode and a very small fraction was spent in kernel (sys) mode. The user-mode part corresponds to the code we wrote and any library routines we call that don’t require special privileges from the operation system. The kernel-mode part will mostly be I/O (not much in this case). Real time is generally less useful than CPU time, because it depends on how loaded the CPU is. Also, none of these times are especially precise, because the program only gets charged for time on a context switch (when it switches between user and kernel mode or some other program takes over the CPU for a bit) or when the kernel decides to see what it is up to (typically every 10 milliseconds).

The overall cost is not too bad, but the reason I picked 100000 and not some bigger number was that it didn’t finish fast enough for larger inputs. We’d like to see why it is taking so long, to have some idea what to try to speed up. So we’ll compile it with the -pg option to gcc, which inserts profiling code that counts how many times each function is called and how long (on average) each call takes.

Because the profile is not very smart about shared libraries, we also including the --static option to force the resulting program to be statically linked. This means that all the code that is used by the program is baked into the executable instead of being linked in at run-time. (Normally we don’t do this because it makes for big executables and big running programs, since statically-linked libraries can’t be shared between more than one running program.)

$ gcc -pg --static -g3 -o countPrimes countPrimes.c 
$ time ./countPrimes 100000
9592

real    0m4.723s
user    0m4.668s
sys 0m0.000s

Hooray! We’ve made the program slightly slower. But we also just produced a file gmon.out that we can read with gprof. Note that we have to pass the name of the program so that gprof can figure out which executable generated gmon.out.

$ gprof countPrimes
Flat profile:

Each sample counts as 0.01 seconds.
  %   cumulative   self              self     total           
 time   seconds   seconds    calls   s/call   s/call  name    
100.00      4.66     4.66   100000     0.00     0.00  isPrime
  0.00      4.66     0.00        1     0.00     4.66  countPrimes
  0.00      4.66     0.00        1     0.00     4.66  main

[...much explanatory text deleted]

It looks like we are spending all of our time in isPrime, at least if we read the columns on the left. The per-call columns are not too helpful because of granularity: isPrime is too fast for the profiler to wake up and detect how long it runs for. The total columns are less suspicious because they are obtained by sampling: from time to time, the profiler looks and sees what function it’s in, and charges each function a fraction of the total CPU time proportional to how often it gets sampled. So we probable aren’t really spending zero time in countPrimes and main, but the amount of time we do spend is small enough not to be detected.

This is handy because it means we don’t need to bother trying to speed up the rest of the program. We have two things we can try:

  1. Call isPrime less.
  2. Make isPrime faster.

Let’s start by seeing if we can make isPrime faster.

What isPrime is doing is testing if a number n is prime by the most direct way possible: dividing by all numbers less than n until it finds a factor. That’s a lot of divisions: if n is indeed prime, it’s linear in n. Since division is a relatively expensive operation, the first thing to try is to get rid of some.

Here’s a revised version of isPrime:

/* return 1 if n is prime, 0 otherwise */
int
isPrime(int n)
{
    int factor;

    if(n < 2) { return 0; }
    if(n % 2 == 0) {
        /* special case for the only even prime */
        return n == 2;
    }
    /* else */
    for(factor = 3; factor < n; factor+=2) {
        if(n % factor == 0) return 0;
    }
    /* else */
    return 1;
}

The trick is to check first if n is divisible by 2, and only test odd potential factors thereafter. This requires some extra work to handle 2, but maybe the extra code complexity will be worth it.

Let’s see how the timing goes:

$ gcc -pg --static -g3 -o countPrimes ./countPrimesSkipEvenFactors.c 
$ time ./countPrimes 100000
9592

real    0m2.608s
user    0m2.400s
sys 0m0.004s
$ gprof countPrimes
Flat profile:

Each sample counts as 0.01 seconds.
  %   cumulative   self              self     total           
 time   seconds   seconds    calls   s/call   s/call  name    
100.00      2.29     2.29   100000     0.00     0.00  isPrime
  0.00      2.29     0.00        1     0.00     2.29  countPrimes
  0.00      2.29     0.00        1     0.00     2.29  main

[...]

Twice as fast! And the answer is still the same, too—this is important.

Can we test even fewer factors? Suppose n has a non-trivial factor x. Then n equals x*y for some y which is also nontrivial. One of x or y will be no bigger than the square root of n. So perhaps we can stop when we reach the square root of n,

Let’s try it:

#include <math.h>

/* return 1 if n is prime, 0 otherwise */
int
isPrime(int n)
{
    int factor;

    if(n < 2) { return 0; }
    if(n % 2 == 0) {
        /* special case for the only even prime */
        return n == 2;
    }
    /* else */
    for(factor = 3; factor < sqrt(n)+1; factor+=2) {
        if(n % factor == 0) return 0;
    }
    /* else */
    return 1;
}
}

I added +1 to the return value of sqrt both to allow for factor to be equal to the square root of n, and because the output of sqrt is not exact, and it would be embarrassing if I announced that 25 was prime because I stopped at 4.9999999997.

Using the math library not only requires including <math.h> but also requires compiling with the -lm flag after all .c or .o files, to link in the library routines:

$ gcc -pg --static -g3 -o countPrimes ./countPrimesSqrt.c -lm
$ time ./countPrimes 1000000
78498

real    0m1.008s
user    0m0.976s
sys 0m0.000s
$ gprof countPrimes
Flat profile:

Each sample counts as 0.01 seconds.
  %   cumulative   self              self     total           
 time   seconds   seconds    calls  ms/call  ms/call  name    
 50.00      0.02     0.02   100000     0.00     0.00  isPrime
 50.00      0.04     0.02                             __sqrt_finite
  0.00      0.04     0.00        1     0.00    20.00  countPrimes
  0.00      0.04     0.00        1     0.00    20.00  main

[...]

Whoosh!

Can we optimize further? Let’s see what happens on a bigger input:

$ time ./countPrimes 1000000
78498

real    0m0.987s
user    0m0.960s
sys 0m0.000s
$ gprof countPrimes
Flat profile:

Each sample counts as 0.01 seconds.
  %   cumulative   self              self     total           
 time   seconds   seconds    calls  ms/call  ms/call  name    
 51.04      0.49     0.49                             __sqrt_finite
 44.79      0.92     0.43  1000000     0.00     0.00  isPrime
  3.65      0.96     0.04                             sqrt
  0.52      0.96     0.01        1     5.00   435.00  main
  0.00      0.96     0.00        1     0.00   430.00  countPrimes

[...]

This is still very good, although we’re spending a lot of time in sqrt (more specifically, its internal helper routine __sqrt_finite). Can we do better?

Maybe moving the sqrt out of the loop in isPrime will make a difference:

/* return 1 if n is prime, 0 otherwise */
int
isPrime(int n)
{
    int factor;
    int sqrtValue;

    if(n < 2) { return 0; }
    if(n % 2 == 0) {
        /* special case for the only even prime */
        return n == 2;
    }
    /* else */
    sqrtValue = sqrt(n) + 1;
    for(factor = 3; factor < sqrtValue; factor+=2) {
        if(n % factor == 0) return 0;
    }
    /* else */
    return 1;
}


$ gcc -pg --static -g3 -o countPrimes ./countPrimesSqrtOutsideLoop.c -lm
$ time ./countPrimes 1000000
78498

real    0m0.413s
user    0m0.392s
sys 0m0.000s
$ gprof countPrimes
Flat profile:

Each sample counts as 0.01 seconds.
  %   cumulative   self              self     total           
 time   seconds   seconds    calls  ms/call  ms/call  name    
 97.44      0.38     0.38  1000000     0.00     0.00  isPrime
  2.56      0.39     0.01        1    10.00   390.00  countPrimes
  0.00      0.39     0.00        1     0.00   390.00  main

[...]

This worked! We are now spending so little time in sqrt that the profiler doesn’t notice it.

What if we get rid of the call to sqrt and test if factor * factor <= n instead? This way we could dump the math library:

/* return 1 if n is prime, 0 otherwise */
int
isPrime(int n)
{
    int factor;

    if(n < 2) { return 0; }
    if(n % 2 == 0) {
        /* special case for the only even prime */
        return n == 2;
    }
    /* else */
    for(factor = 3; factor*factor <= n; factor+=2) {
        if(n % factor == 0) return 0;
    }
    /* else */
    return 1;
}

$ gcc -pg --static -g3 -o countPrimes ./countPrimesSquaring.c
$ time ./countPrimes 1000000
78498

real    0m0.450s
user    0m0.428s
sys 0m0.000s

This is slower, but not much slower. We might need to decide how much we care about avoiding floating-point computation in our program.

At this point we could decide that countPrimes is fast enough, or maybe we could look for further improvements, say, by testing out many small primes at the beginning instead of just 2, calling isPrime only on odd values of i, or reading a computational number theory textbook to find out how the cool kids do this.

Note that in some cases going all-out to improve performance may be more trouble than it’s worth. A reasonable strategy for code for your own use might be to start running one version and make improvements on a separate copy while it’s running. If the first version finishes before you are done writing new code, it’s probably fast enough.

Effect of optimization during compilation

We didn’t use any optimization flags for this example, because the optimizer can do a lot of rewriting that can make the output of the profiler confusing. For example, at high optimization levels, the compiler will often avoid function-call overhead by inserting the body of a helper function directly into its caller. But this can make a big difference in performance, so in real life you will want to compile with optimization turned on. Here’s how the performance of countPrimes 100000 is affected by optimization level:

Version No optimization With -O1 With -O2 With -O3
countPrimes.c 4.600 4.060 3.928 3.944
countPrimesSkipEvenFactors.c 2.260 1.948 1.964 1.984
countPrimesSqrt.c 0.036 0.028 0.028 0.028
countPrimesSqrtOutsideLoop.c 0.012 0.012 0.008 0.008
countPrimesSquaring.c 0.012 0.012 0.008 0.012

In each case, the reported time is the sum of user and system time in seconds.

For the smarter routines, more optimization doesn’t necessarily help, although some of this may be experimental error since I was too lazy to get a lot of samples by running each program more than once, and the times for the faster programs are so small that granularity is going to be an issue.

Here’s the same table using countPrimes 10000000 on the three fastest programs:

Version No optimization With -O1 With -O2 With -O3
countPrimesSqrt.c 24.236 18.840 18.720 18.564
countPrimesSqrtOutsideLoop.c 9.388 9.364 9.368 9.360
countPrimesSquaring.c 9.748 9.248 9.236 9.160

Again there are the usual caveats that I am a lazy person and should probably be doing more to deal with sampling and granularity issues, but if you believe these numbers, we actually win by going to countPrimesSquaring once the optimizer is turned on. I suspect that it is benefiting from strength reduction, which would generate the product factor*factor in isPrime incrementally using addition rather than multiplying from scratch each time.

It’s also worth noting that the optimizer works better if we leave a lot of easy optimization lying around. For countPrimesSqrt.c, my guess is that most of the initial gains are from avoiding function call overhead on sqrt by compiling it in-line. But even the optimizer is not smart enough to recognize that we are computing the same value over and over again, so we still win by pulling sqrt out of the loop in countPrimesSqrtOutsideLoop.c.

If I wanted to see if my guesses about the optimizer were correct, I could use gcc -S and look at the assembler code. But see earlier comments about laziness.


Licenses and Attributions


Speak Your Mind

-->