Integer types in C

Integer types in C

Most variables in C programs tend to hold integer values, and indeed most variables in C programs tend to be the default-width integer type int. Declaring a variable to have a particular integer type controls how much space is used to store the variable (any values too big to fit will be truncated) and specifies that the arithmetic on the variable is done using integer operations.

Basic integer types

The standard C integer types are:

Name Typical size Signed by default?
char 8 bits unspecified
short 16 bits signed
int 32 bits signed
long 32 bits signed
long long 64 bits signed

The typical size is for architectures like the Intel x86, which is the architecture used in most desktop and server machines. Some 64-bit machines might have 64-bit ints and longs, and some microcontrollers have 16-bit ints. Particularly bizarre architectures might have even wilder sizes, but you are not likely to see this unless you program vintage 1970s supercomputers. The general convention is that int is the most convenient size for whatever computer you are using and should be used by default.

Many compilers also support a long long type that is usually twice the length of a long (making it 64 bits on x86 machines). This type was not officially added to the C standard prior to C99, so it may or may not be available if you insist on following the ANSI specification strictly.

If you need to know the exact size of each type, you can use the sizeof operator, which returns the number of chars in a type. For example, on a typical machine, sizeof(int) will evaluate to 4, and sizeof(long long) will evaluate to 8. You can multiply by the constant CHAR_BIT, usually defined in /usr/include/limits.h, to translate these number to bits. However, if you are looking for a type that holds a particular number of bits, you are better off using a C99 fixed-width type like int32_t.

Each of these types comes in signed and unsigned variants.
This controls the interpretation of some operations (mostly comparisons and shifts) and determines the range of the type: for example, an unsigned char holds values in the range 0 through 255 while a signed char holds values in the range -128 through 127, and in general an unsigned n-bit type runs from 0 through 2n − 1 while the signed version runs from  − 2n − 1 through 2n − 1 − 1. The representation of signed integers uses two’s-complement notation, which means that a positive value x is represented as the unsigned value x while a negative value  − x is represented as the unsigned value 2n − x. For example, if we had a peculiar implementation of C that used 3-bit ints, the binary values and their interpretation as int or unsigned int would look like this:

bits as unsigned int as int
000 0 0
001 1 1
010 2 2
011 3 3
100 4 -4
101 5 -3
110 6 -2
111 7 -1

The reason we get one extra negative value for a signed integer type is this allows us to interpret the first bit as the sign, which makes life a little easier for whoever is implementing our CPU. Two useful features of this representation are:

  1. We can convert freely between signed and unsigned values as long as we are in the common range of both, and
  2. Addition and subtraction work exactly the same we for both signed and unsigned values. For example, on our hypothetical 3-bit machine, 1 + 5 represented as 001 + 101 = 110 gives the same answer as 1 + ( − 3) = 001 + 101 = 110. In the first case we interpret 110 as 6, while in the second we interpret it as  − 2, but both answers are right in their respective contexts.

Note that in order to make this work, we can’t detect overflow: when the CPU adds two 3-bit integers, it doesn’t know if we are adding 7 + 6 = 111 + 110 = 1101 = 13 or ( − 1) + ( − 2) = 111 + 110 = 101 = ( − 3). In both cases the result is truncated to 101, which gives the incorrect answer 5 when we are adding unsigned values.

This can often lead to surprising uncaught errors in C programs, although using more than 3 bits will make overflow less likely. It is usually a good idea to pick a size for a variable that is substantially larger than the largest value you expect the variable to hold (although most people just default to int), unless you are very short on space or time (larger values take longer to read and write to memory, and may make some arithmetic operations take longer).

Taking into account signed and unsigned versions, the full collection of integer types looks like this:

char signed char unsigned char
short unsigned short
int unsigned int
long unsigned long
long long unsigned long long

So these are all examples of declarations of integer variables:

    int i;
    char c;
    signed char temperature; /* degrees Celsius, only valid for Earth's surface */
    long netWorthInPennies;
    long long billGatesNetWorthInPennies;
    unsigned short shaveAndAHaircutTwoBytes;

For chars, whether the character is signed ( − 128…127) or unsigned (0…255) is at the whim of the compiler. If it matters, declare your variables as signed char or unsigned char. For storing actual 8-bit characters that you aren’t doing arithmetic on, it shouldn’t matter.

There is a slight gotcha with character processing with the input functions getchar and getc. These return the special value EOF (defined in stdio.h to be  − 1) to indicate end of file. But 255, which represents 'ÿ' in the ISO Latin-1 alphabet and in Unicode, and which may also appear quite often in binary files, will map to  − 1 if you put it in a char. So you should store the output of these functions in an int if you need to test for end of file. After you have done this test, it’s safe to store a non-end-of-file character in a char.

    /* right */
    int c;

    while((c = getchar()) != EOF) {
        putchar(c);
    }

    /* WRONG */
    char c;

    while((c = getchar()) != EOF) {  /* <- DON'T DO THIS! */
        putchar(c);
    }

Overflow and the C standards

So far we have been assuming that overflow implicitly applies a (mod 2b) operation, where b is the number of bits in our integer data type. This works on many machines, but as of the C11 standard, this is defined behavior only for unsigned integer types. For signed integer types, the effect of overflow is undefined. This means that the result of adding two very large signed ints could be arbitrary, and not only might depend on what CPU, compiler, and compiler options you are using, but might even vary from one execution of your program to another. In many cases this is not an issue, but undefined behavior is often exploited by compilers to speed up compiled code by omitting otherwise necessary instructions to force a particular outcome. This is especially true if you turn on the optimizer using the -O flag.

This means that you should not depend on reasonable behavior for overflow of signed types. Usually this is not a problem, because signed computations often represent real-world values where overflow will produce bad results anyway. For unsigned computations, the implicit modulo operation applied to overflow can be useful for some applications.

C99 fixed-width types

C99 provides a stdint.h header file that defines integer types with known size independent of the machine architecture. So in C99, you can use int8_t instead of signed char to guarantee a signed type that holds exactly 8 bits, or uint64_t instead of unsigned long long to get a 64-bit unsigned integer type. The full set of types typically defined are int8_t, int16_t, int32_t, and int64_t for signed integers and uint8_t, uint16_t, uint32_t, and uint64_t for unsigned integers. There are also types for integers that contain the fewest number of bits greater than some minimum (e.g., int_least16_t is a signed type with at least 16 bits, chosen to minimize space) or that are the fastest type with at least the given number of bits (e.g., int_fast16_t is a signed type with at least 16 bits, chosen to minimize time). The stdint.h file also defines constants giving the minimum and maximum values of these and standard integer types; for example, INT_MIN and INT_MAX give the smallest and largest values that can be stored in an int.

All of these types are defined as aliases for standard integer types using typedef; the main advantage of using stdint.h over defining them yourself is that if somebody ports your code to a new architecture, stdint.h should take care of choosing the right types automatically. The main disadvantage is that, like many C99 features, stdint.h is not universally available on all C compilers. Also, because these fixed-width types are a late addition to the language, the built-in routines for printing and parsing integers, as well as the mechanisms for specifying the size of an integer constant, are not adapted to deal with them.

If you do need to print or parse types defined in stdint.h, the larger inttypes.h header defines macros that give the corresponding format strings for printf and scanf. The inttypes.h file includes stdint.h, so you do not need to include both. Below is an example of a program that uses the various features provided by inttypes.h and stdint.h.

#include <stdio.h>
#include <stdlib.h>
#include <assert.h>

#include <inttypes.h>

/* example of using fixed-width types */

/* largest value we can apply 3x+1 to without overflow */
#define MAX_VALUE ((UINT64_MAX - 1) / 3)
        
int
main(int argc, char **argv)
{
    uint64_t big;

    if(argc != 2) {
        fprintf(stderr, "Usage: %s number\n", argv[0]);
        return 1;
    }

    /* parse argv[1] as a uint64_t */
    /* SCNu64 expands to the format string for scanning uint64_t (without the %) */
    /* We then rely on C concatenating adjacent string constants. */
    sscanf(argv[1], "%" SCNu64, &big);

    /* do some arithmetic on big */
    while(big != 1) {
        /* PRIu64 expands to the format string for printing uint64_t */
        printf("%" PRIu64 "\n", big);

        if(big % 2 == 0) {
            big /= 2;
        } else if(big <= MAX_VALUE) {
            big = 3*big + 1;
        } else {
            /* overflow! */
            puts("overflow");
            return 1;
        }
    }

    puts("Reached 1");
    return 0;
}

examples/integerTypes/fixedWidth.c

size_t and ptrdiff_t

The type aliases size_t and ptrdiff_t are provided in stddef.h to represent the return types of the sizeof operator and pointer subtraction. On a 32-bit architecture, size_t will be equivalent to the unsigned 32-bit integer type uint32_t (or just unsigned int) and ptrdiff_t will be equivalent to the signed 32-bit integer type int32_t (int). On a 64-bit architecture, size_t will be equivalent to uint64_t and ptrdiff_t will be equivalent to int64_t.

The place where you will most often see size_t is as an argument to malloc, where it gives the number of bytes to allocate.

Because stdlib.h includes stddef.h, it is often not necessary to include stddef.h explicitly.

Integer constants

Constant integer values in C can be written in any of four different ways:

  • In the usual decimal notation, e.g. 0, 1, -127, 9919291, 97.
  • In octal or base 8, when the leading digit is 0, e.g. 01 for 1, 010 for 8, 0777 for 511, 0141 for 97. Octal is not used much any more, but it is still conventional for representing Unix file permissions.
  • In hexadecimal or base 16, when prefixed with 0x. The letters a through f are used for the digits 10 through 15. For example, 0x61 is another way to write 97.
  • Using a character constant, which is a single ASCII character or an escape sequence inside single quotes. The value is the ASCII value of the character: 'a' is 97. Unlike languages with separate character types, C characters are identical to integers; you can (but shouldn’t) calculate 972 by writing 'a'*'a'. You can also store a character in a location with any integer type.

Except for character constants, you can insist that an integer constant is unsigned or long by putting a u or l after it. So 1ul is an unsigned long version of 1. By default integer constants are (signed) ints. For long long constants, use ll, e.g., the unsigned long long constant 0xdeadbeef01234567ull. It is also permitted to write the l as L, which can be less confusing if the l looks too much like a 1.

Some examples:

'a' int
97 int
97u unsigned int
0xbea00d1ful unsigned long, written in hexadecimal
0777s short, written in octal

A curious omission is that there is no way to write a binary integer directly in C. So if you want to write the bit pattern 00101101, you will need to encode it in hexadecimal as 0x2d (or octal as 055). Another potential trap is that leading zeros matter: 012 is an octal value representing the number most people call 10.

Naming constants

Having a lot of numeric constants in your program—particularly if the same constant shows up in more than one place—is usually a sign of bad programming. There are a few constants, like 0 and 1, that make sense on their own, but many constant values are either mostly arbitrary, or might change if the needs of the program change. It’s helpful to assign these constants names that explain their meaning, instead of requiring the user to guess why there is a 37 here or an 0x1badd00d there. This is particularly important if the constants might change in later versions of the program, since even though you could change every 37 in your program into a 38, this might catch other 37 values that have different intended meanings.

For example, suppose that you have a function (call it getchar) that needs to signal that sometimes it didn’t work. The usual way is to return a value that the function won’t normally return. Now, you could just tell the user what value that is:

/* get a character (as an `int` ASCII code) from `stdin` */
/* return -1 on end of file */
int getchar(void);

and now the user can write

    while((c = getchar()) != -1) {
        ...
    }

But then somebody reading the code has to remember that -1 means “end of file” and not “signed version of 0xff” or “computer room on fire, evacuate immediately.” It’s much better to define a constant EOF that happens to equal -1, because among other things if you change the special return value from getchar later then this code will still work (assuming you fixed the definition of EOF):

    while((c = getchar()) != EOF) {
        ...
    }

So how do you declare a constant in C? The traditional approach is to use the C preprocessor, the same tool that gets run before the compiler to expand out #include directives. To define EOF, the file /usr/include/stdio.h includes the text

#define EOF (-1)

What this means is that whenever the characters EOF appear in a C program as a separate word (e.g. in 1+EOF*3 but not in appurtenancesTherEOF), then the preprocessor will replace them with the characters (-1). The parentheses around the -1 are customary to ensure that the -1 gets treated as a separate constant and not as part of some larger expression. So from the compiler’s perspective, EOF really is -1, but from the programmer’s perspective, it’s end-of-file. This is a special case of the C preprocessor’s macro mechanism.

In general, any time you have a non-trivial constant in a program, it should be #defined. Examples are things like array dimensions, special tags or return values from functions, maximum or minimum values for some quantity, or standard mathematical constants (for example, /usr/include/math.h defines M_PI as the mathematical constant pi to umpteen digits). This allows you to write

    char buffer[MAX_FILENAME_LENGTH+1];
    
    area = M_PI*r*r;

    if(status == COMPUTER_ROOM_ON_FIRE) {
        evacuate();
    }

instead of

    char buffer[513];
    
    area = 3.141592319*r*r;   /* not the correct value of pi */

    if(status == 136) {
        evacuate();
    }

which is just an invitation to errors (including the one in the area computation).

Like typedefs, #defines that are intended to be globally visible are best done in header files; in large programs you will want to #include them in many source files. The usual convention is to write #defined names in all-caps to remind the user that they are macros and not real variables.

Integer operators

Arithmetic operators

The usual + (addition), - (negation or subtraction), and * (multiplication) operators work on integers pretty much the way you’d expect. The only caveat is that if the result lies outside of the range of whatever variable you are storing it in, it will be truncated instead of causing an error:

    unsigned char c;

    c = -1;             /* sets c = 255 */
    c = 255 + 255;      /* sets c = 254 */
    c = 256 * 1772717;  /* sets c = 0 */

This can be a source of subtle bugs if you aren’t careful. The usual giveaway is that values you thought should be large positive integers come back as random-looking negative integers.

Division (/) of two integers also truncates: 2/3 is 0, 5/3 is 1, etc. For positive integers it will always round down.

Prior to C99, if either the numerator or denominator was negative, the behavior was unpredictable and depended on what your processor chose to do. In practice this meant you should never use / if one or both arguments might be negative. The C99 standard specified that integer division always removes the fractional part, effectively rounding toward 0; so (-3)/2 is -1, 3/-2 is -1, and (-3)/-2 is 1.

There is also a remainder operator % with e.g. 2%3 = 2, 5%3 = 2, 27 % 2 = 1, etc. The sign of the modulus is ignored, so 2%-3 is also 2. The sign of the dividend carries over to the remainder: (-3)%2 and (-3)%(-2) are both -1. The reason for this rule is that it guarantees that y == x*(y/x) + y%x is always true.

Bitwise operators

In addition to the arithmetic operators, integer types support bitwise logical operators that apply some Boolean operation to all the bits of their arguments in parallel. What this means is that the i-th bit of the output is equal to some operation applied to the i-th bit(s) of the input(s). The bitwise logical operators are ~ (bitwise negation: used with one argument as in ~0 for the all-1’s binary value), & (bitwise AND), ‘ ’ (bitwise OR), and ‘^’ (bitwise XOR, i.e. sum mod 2). These are mostly used for manipulating individual bits or small groups of bits inside larger words, as in the expression x & 0x0f, which strips off the bottom four bits stored in x.

Examples:

x y expression value
0011 0101 x&y 0001
0011 0101 x|y 0111
0011 0101 x^y 0110
0011 0101 ~x 1100

The shift operators << and >> shift the bit sequence left or right: x << y produces the value x ⋅ 2y (ignoring overflow); this is equivalent to shifting every bit in x y positions to the left and filling in y zeros for the missing positions. In the other direction, x >> y produces the value ⌊x ⋅ 2−y⌋ by shifting x y positions to the right. The behavior of the right shift operator depends on whether x is unsigned or signed; for unsigned values, it shifts in zeros from the left end always; for signed values, it shifts in additional copies of the leftmost bit (the sign bit). This makes x >> y have the same sign as x if x is signed.

If y is negative, the behavior of the shift operators is undefined.

Examples (unsigned char x):

x y x << y x >> y
00000001 1 00000010 00000000
11111111 3 11111000 00011111

Examples (signed char x):

x y x << y x >> y
00000001 1 00000010 00000000
11111111 3 11111000 11111111

Shift operators are often used with bitwise logical operators to set or extract individual bits in an integer value. The trick is that (1 << i) contains a 1 in the i-th least significant bit and zeros everywhere else. So x & (1<<i) is nonzero if and only if x has a 1 in the i-th place. This can be used to print out an integer in binary format (which standard printf won’t do).

The following program gives an example of this technique. For example, when called as ./testPrintBinary 123, it will print 111010 followed by a newline.

#include <stdio.h>
#include <stdlib.h>

/* print out all bits of n */
void
print_binary(unsigned int n)
{
    unsigned int mask = 0;

    /* this grotesque hack creates a bit pattern 1000... */
    /* regardless of the size of an unsigned int */
    mask = ~mask ^ (~mask >> 1);

    for(; mask != 0; mask >>= 1) {
        putchar((n & mask) ? '1' : '0');
    }
}

int
main(int argc, char **argv)
{
    if(argc != 2) {
        fprintf(stderr, "Usage: %s n\n", argv[0]);
        return 1;
    }

    print_binary(atoi(argv[1]));
    putchar('\n');

    return 0;
}

examples/integerTypes/testPrintBinary.c

In the other direction, we can set the i-th bit of x to 1 by doing x | (1 << i) or to 0 by doing x & ~(1 << i). See the chapter on bit manipulation. for applications of this to build arbitrarily-large bit vectors.

Logical operators

To add to the confusion, there are also three logical operators that work on the truth-values of integers, where 0 is defined to be false and anything else is defined by be true. These are && (logical AND), ||, (logical OR), and ! (logical NOT). The result of any of these operators is always 0 or 1 (so !!x, for example, is 0 if x is 0 and 1 if x is anything else). The && and || operators evaluate their arguments left-to-right and ignore the second argument if the first determines the answer (this is the only place in C where argument evaluation order is specified); so

    0 && executeProgrammer();
    1 || executeProgrammer();

is in a very weak sense perfectly safe code to run.

Watch out for confusing & with &&. The expression 1 & 2 evaluates to 0, but 1 && 2 evaluates to 1. The statement 0 & executeProgrammer(); is also unlikely to do what you want.

Yet another logical operator is the ternary operator ?:, where x ? y : z equals the value of y if x is nonzero and z if x is zero. Like && and ||, it only evaluates the arguments it needs to:

    fileExists(badFile) ? deleteFile(badFile) : createFile(badFile);

Most uses of ?: are better done using an if-then-else statement.

The convention that Boolean values in C are represented by integers means that C traditionally did not have an explicit Boolean type. If you want to use explicit Boolean types, you can include the stdbool.h header file (added in C99) with #include <stdbool.h>. This doesn’t give you much: it makes bool an integer type that can hold Boolean values, and defines false and true to be constants 0 and 1. Since bool is just another integer type, nothing prevents you from writing x = 12 / true or similar insults to the type system. But having explicit bool, false, and true keywords might make the intent of your code more explicit than the older int/0/1 approach.

Relational operators

Logical operators usually operate on the results of relational operators or comparisons: these are == (equality), != (inequality), < (less than), > (greater than), <= (less than or equal to) and >= (greater than or equal to). So, for example,

    if(size >= MIN_SIZE && size <= MAX_SIZE) {
        puts("just right");
    }

tests if size is in the (inclusive) range [MIN_SIZE..MAX_SIZE].

Beware of confusing == with =. The code

    /* DANGER! DANGER! DANGER! */
    if(x = 5) {
        ...

is perfectly legal C, and will set x to 5 rather than testing if it’s equal to 5. Because 5 happens to be nonzero, the body of the if statement will always be executed. This error is so common and so dangerous that gcc will warn you about any tests that look like this if you use the -Wall option. Some programmers will go so far as to write the test as 5 == x just so that if their finger slips, they will get a syntax error on 5 = x even without special compiler support.

Converting to and from strings

To input or output integer values, you will need to convert them from or to strings. Converting from a string is easy using the atoi or atol functions declared in stdlib.h; these take a string as an argument and return an int or long, respectively. C99 also provides atoll for converting to long long. These routines have no ability to signal an error other than returning 0, so if you do atoi("Sweden"), 0 is what you will get.

Output is usually done using printf (or sprintf if you want to write to a string without producing output). Use the %d format specifier for ints, shorts, and chars that you want the numeric value of, %ld for longs, and %lld for long longs.

A contrived program that uses all of these features is given below:

#include <stdio.h>
#include <stdlib.h>

/* This program can be used to show how atoi etc. handle overflow. */
/* For example, try "overflow 1000000000000". */
int
main(int argc, char **argv)
{
    char c;
    int i;
    long l;
    long long ll;
    
    if(argc != 2) {
        fprintf(stderr, "Usage: %s n\n", argv[0]);
        return 1;
    }
    
    c = atoi(argv[1]);
    i = atoi(argv[1]);
    l = atol(argv[1]);
    ll = atoll(argv[1]);

    printf("char: %d  int: %d  long: %ld  long long: %lld", c, i, l, ll);

    return 0;
}

examples/integerTypes/overflow.c


Licenses and Attributions


Speak Your Mind

-->