Strings

Processing strings of characters is one of the oldest application of mechanical computers, arguably predating numerical computation by at least fifty years. Assuming you’ve already solved the problem of how to represent characters in memory (e.g. as the C char type encoded in ASCII), there are two standard ways to represent strings:

  • As a delimited string, where the end of a string is marked by a special character. The advantages of this method are that only one extra byte is needed to indicate the length of an arbitrarily long string, that strings can be manipulated by simple pointer operations, and in some cases that common string operations that involve processing the entire string can be performed very quickly. The disadvantage is that the delimiter can’t appear inside any string, which limits what kind of data you can store in a string.
  • As a counted string, where the string data is prefixed or supplemented with an explicit count of the number of characters in the string. The advantage of this representation is that a string can hold arbitrary data (including delimiter characters) and that one can quickly jump to the end of the string without having to scan its entire length. The disadvantage is that maintaining a separate count typically requires more space than adding a one-byte delimiter (unless you limit your string length to 255 characters) and that more care needs to be taken to make sure that the count is correct.

C strings

Because delimited strings are simpler and take less space, C went for delimited strings. A string is a sequence of characters terminated by a null character '\0'. Looking back from almost half a century later, this choice may have been a mistake in the long run, but we are pretty much stuck with it.

Note that the null character is not the same as a null pointer, although both appear to have the value 0 when used in integer contexts. A string is represented by a variable of type char *, which points to the zeroth character of the string. The programmer is responsible for allocating and managing space to store strings, except for explicit string constants, which are stored in a special non-writable string space by the compiler.

If you want to use counted strings instead, you can build your own using a struct. Most scripting languages written in C (e.g. Perl, Python_programming_language, PHP, etc.) use this approach internally. (Tcl is an exception, which is one of many good reasons not to use Tcl).

String constants

A string constant in C is represented by a sequence of characters within double quotes. Standard C character escape sequences like \n (newline), \r (carriage return), \a (bell), \0x17 (character with hexadecimal code 0x17), \\ (backslash), and \" (double quote) can all be used inside string constants. The value of a string constant has type const char *, and can be assigned to variables and passed as function arguments or return values of this type.

Two string constants separated only by whitespace will be concatenated by the compiler as a single constant: "foo" "bar" is the same as "foobar". This feature is not much used in normal code, but shows up sometimes in macros.

String encodings

Standard C strings are assumed to be in ASCII, a 7-bit code developed in the 1960s to represent English-language text. If you want to write text that includes any letters not in the usual 26-letter Latin alphabet, you will need to use a different encoding. C does not provide very good support for this, but for fixed strings, you can often get away with using Unicode as long as both your text editor and your terminal are set to use the UTF-8 encoding.

The reason this works is that UTF-8 encodes each Unicode character as one or more 8-bit characters, and does this in a way that guarantees that you never accidentally create a null or any other standard ASCII character. So a C string containing UTF-8 characters looks like an ordinary C string to all the C library routines. This also works if you include a Unicode string with a UTF-8 encoding in a comment or printf format string, as illustrated in the file unicode.c. But this use of Unicode in C is very limited.

Some issues you will quickly run into if you are trying to do something more sophisticated:

  1. You cannot use non-ASCII letters anywhere outside a string constant or comment without confusing the C compiler. So variable names that use non-ASCII characters are forbidden.
  2. If you include a UTF-8 encoded string somewhere, even though both your text editor and terminal are likely to display the multi-byte characters correctly, they are still spread across multiple bytes in the encoded string. This can be trouble if you try to change some letter in a UTF-8-encoded string to something else whose encoding has a different width.
  3. You can’t generally put a multibyte character into a char variable, or write it as a char constant.
  4. You may find out that some other tools have their own ideas about what encodings to expect, causing Unicode characters to turn into gibberish or cause other errors.

There exists libraries for working with Unicode strings in C, but they are clunky. If you need to handle a lot of non-ASCII text, you may be better of working with a different language. However, even moving away from C is not always a panacea, and Unicode support in other tools may be hit-or-miss.

String buffers

The problem with string constants is that you can’t modify them. If you want to build strings on the fly, you will need to allocate space for them. The traditional approach is to use a buffer, an array of chars. Here is a particularly painful hello-world program that builds a string by hand:

#include <stdio.h>

int
main(int argc, char **argv)
{
    char hi[3];

    hi[0] = 'h';
    hi[1] = 'i';
    hi[2] = '\0';

    puts(hi);

    return 0;
}

examples/strings/hi.c

Note that the buffer needs to have size at least 3 in order to hold all three characters. A common error in programming with C strings is to forget to leave space for the null at the end (or to forget to add the null, which can have comical results depending on what you are using your surprisingly long string for).

String buffers and the perils of gets

Fixed-size buffers are a common source of errors in older C programs, particularly ones written with the library routine gets. The problem is that if you do something like

    strcpy(smallBuffer, bigString);

the strcpy function will happily keep copying characters across memory long after it has passed the end of smallBuffer. While you can avoid this to a certain extent when you control where bigString is coming from, the situation becomes particularly fraught if the string you are trying to store comes from the input, where it might be supplied by anybody, including somebody who is trying to execute a buffer overrun attack to seize control of your program.

If you do need to read a string from the input, you should allocate the receiving buffer using malloc and expand it using realloc as needed. Below is a program that shows how to do this, with some bad alternatives commented out:

#include <stdio.h>
#include <stdlib.h>
#include <assert.h>

#define NAME_LENGTH (2)

#define INITIAL_LINE_LENGTH (2)

/* return a freshly-malloc'd line with next line of input from stdin */
char *
getLine(void)
{
    char *line;
    int size;   /* how much space do I have in line? */
    int length; /* how many characters have I used */
    int c;

    size = INITIAL_LINE_LENGTH;
    line = malloc(size);
    assert(line);

    length = 0;

    while((c = getchar()) != EOF && c != '\n') {
        if(length >= size-1) {
            /* need more space! */
            size *= 2;

            /* make length equal to new size */
            /* copy contents if necessary */
            line = realloc(line, size);
        }

        line[length++] = c;
    }

    line[length] = '\0';

    return line;
}

int
main(int argc, char **argv)
{
    int x = 12;
    /* char name[NAME_LENGTH]; */
    char *line;
    int y = 17;

    puts("What is your name?");

    /* gets(name); */                        /* may overrun buffer */
    /* scanf("%s\n", name); */               /* may overrun buffer */
    /* fgets(name, NAME_LENGTH, stdin); */   /* may truncate input */
    line = getLine();                        /* has none of these problems */
    
    printf("Hi %s!  Did you know that x == %d and y == %d?\n", line, x, y);

    free(line);  /* but we do have to free line when we are done with it */

    return 0;
}

examples/strings/getLine.c

Operations on strings

Unlike many programming languages, C provides only a rudimentary string-processing library. The reason is that many common string-processing tasks in C can be done very quickly by hand.

For example, suppose we want to copy a string from one buffer to another. The library function strcpy declared in string.h will do this for us (and is usually the right thing to use), but if it didn’t exist we could write something very close to it using a famous C idiom.

void
strcpy2(char *dest, const char *src)
{
    /* This line copies characters one at a time from *src to *dest. */
    /* The postincrements increment the pointers (++ binds tighter than *) */
    /*  to get to the next locations on the next iteration through the loop. */
    /* The loop terminates when *src == '\0' == 0. */
    /* There is no loop body because there is nothing to do there. */
    while(*dest++ = *src++);
}

The externally visible difference between strcpy2 and the original strcpy is that strcpy returns a char * equal to its first argument. It is also likely that any implementation of strcpy found in a recent C library takes advantage of the width of the memory data path to copy more than one character at a time.

Most C programmers will recognize the while(*dest++ = *src++); from having seen it before, although experienced C programmers will generally be able to figure out what such highly abbreviated constructions mean. Exposure to such constructions is arguably a form of hazing.

Because C pointers act exactly like array names, you can also write strcpy2 using explicit array indices. The result is longer but may be more readable if you aren’t a C fanatic.

char *
strcpy2a(char *dest, const char *src)
{
    int ;

    i = 0;
    for(i = 0; src[i] != '\0'; i++) {
        dest[i] = src[i];
    }

    /* note that the final null in src is not copied by the loop */
    dest[i] = '\0';

    return dest;
}

An advantage of using a separate index in strcpy2a is that we don’t trash dest, so we can return it just like strcpy does. (In fairness, strcpy2 could have saved a copy of the original location of dest and done the same thing.)

Note that nothing in strcpy2, strcpy2a, or the original strcpy will save you if dest points to a region of memory that isn’t big enough to hold the string at src, or if somebody forget to tack a null on the end of src (in which case strcpy will just keep going until it finds a null character somewhere). As elsewhere, it’s your job as a programmer to make sure there is enough room. Since the compiler has no idea what dest points to, this means that you have to remember how much room is available there yourself.

If you are worried about overrunning dest, you could use strncpy instead. The strncpy function takes a third argument that gives the maximum number of characters to copy; however, if src doesn’t contain a null character in this range, the resulting string in dest won’t either. Usually the only practical application to strncpy is to extract the first k characters of a string, as in

/* copy the substring of src consisting of characters at positions
    start..end-1 (inclusive) into dest */
/* If end-1 is past the end of src, copies only as many characters as 
    available. */
/* If start is past the end of src, the results are unpredictable. */
/* Returns a pointer to dest */
char *
copySubstring(char *dest, const char *src, int start, int end)
{
    /* copy the substring */
    strncpy(dest, src + start, end - start);

    /* add null since strncpy probably didn't */
    dest[end - start] = '\0';

    return dest;
}

Another quick and dirty way to extract a substring of a string you don’t care about (and can write to) is to just drop a null character in the middle of the sacrificial string. This is generally a bad idea unless you are certain you aren’t going to need the original string again, but it’s a surprisingly common practice among C programmers of a certain age.

A similar operation to strcpy is strcat. The difference is that strcat concatenates src on to the end of dest; so that if dest previous pointed to "abc" and src to "def", dest will now point to "abcdef". Like strcpy, strcat returns its first argument. A no-return-value version of strcat is given below.

void
strcat2(char *dest, const char *src)
{
    while(*dest) dest++;
    while(*dest++ = *src++);
}

Decoding this abomination is left as an exercise for the reader. There is also a function strncat which has the same relationship to strcat that strncpy has to strcpy.

As with strcpy, the actual implementation of strcat may be much more subtle, and is likely to be faster than rolling your own.

Finding the length of a string

Because the length of a string is of fundamental importance in C (e.g., when deciding if you can safely copy it somewhere else), the standard C library provides a function strlen that counts the number of non-null characters in a string. Note that if you are allocating space for a copy of a string, you will need to add one to the value returned by strlen to account for the null.

Here’s a possible implementation:

int
strlen(const char *s)
{
    int i;

    for(i = 0; *s; i++, s++);

    return i;
}

Note the use of the comma operator in the increment step. The comma operator applied to two expressions evaluates both of them and discards the value of the first; it is usually used only in for loops where you want to initialize or advance more than one variable at once.

Like the other string routines, using strlen requires including string.h.

The strlen tarpit

A common mistake is to put a call to strlen in the header of a loop; for example:

/* like strcpy, but only copies characters at indices 0, 2, 4, ...
   from src to dest */
char *
copyEvenCharactersBadVersion(char *dest, const char *src)
{
    int i;
    int j;

    /* BAD: Calls strlen on every pass through the loop */
    for(i = 0, j = 0; i < strlen(src); i += 2, j++) {
        dest[j] = src[i];
    }

    dest[j] = '\0';

    return dest;
}

The problem is that strlen has to scan all of src every time the test is done, which adds time proportional to the length of src to each iteration of the loop. So copyEvenCharactersBadVersion takes time proportional to the square of the length of src.

Here’s a faster version:

/* like strcpy, but only copies characters at indices 0, 2, 4, ...
   from src to dest */
char *
copyEvenCharacters(char *dest, const char *src)
{
    int i;
    int j;
    int len;    /* length of src */

    len = strlen(src);

    /* GOOD: uses cached value of strlen(src) */
    for(i = 0, j = 0; i < len; i += 2, j++) {
        dest[j] = src[i];
    }

    dest[j] = '\0';

    return dest;
}

Because it doesn’t call strlen all the time, this version of copyEvenCharacters will run much faster than the original even on small strings, and several million times faster if src is megabytes long.

Comparing strings

If you want to test if strings s1 and s2 contain the same characters, writing s1 == s2 won’t work, since this tests instead whether s1 and s2 point to the same address. Instead, you should use strcmp, declared in string.h. The strcmp function walks along both of its arguments until it either hits a null on both and returns 0, or hits two different characters, and returns a positive integer if the first string’s character is bigger and a negative integer if the second string’s character is bigger (a typical implementation will just subtract the two characters). A straightforward implementation might look like this:

int
strcmp(const char *s1, const char *s2)
{
    while(*s1 && *s2 && *s1 == *s2) {
        s1++;
        s2++;
    }

    return *s1 - *s2;
}

To use strcmp to test equality, test if the return value is 0:

    if(strcmp(s1, s2) == 0) {
        /* strings are equal */
        ...
    }

You may sometimes see this idiom instead:

    if(!strcmp(s1, s2)) {
        /* strings are equal */
        ...
    }

My own feeling is that the first version is more clear, since !strcmp always suggested to me that you were testing for the negation of some property (e.g. not equal). But if you think of strcmp as telling you when two strings are different rather than when they are equal, this may not be so confusing.

Formatted output to strings

You can write formatted output to a string buffer with sprintf just like you can write it to stdout with printf or to a file with fprintf. Make sure when you do so that there is enough room in the buffer you are writing to, or the usual bad things will happen.

Dynamic allocation of strings

When allocating space for a copy of a string s using malloc, the required space is strlen(s)+1. Don’t forget the +1, or bad things may happen.

Because allocating space for a copy of a string is such a common operation, many C libraries provide a strdup function that does exactly this. If you don’t have one (it’s not required by the C standard), you can write your own like this:

/* return a freshly-malloc'd copy of s */
/* or 0 if malloc fails */
/* It is the caller's responsibility to free the returned string when done. */
char *
strdup(const char *s)
{
    char *s2;

    s2 = malloc(strlen(s)+1);

    if(s2 != 0) {
        strcpy(s2, s);
    }

    return s2;
}

Exercise: Write a function strcatAlloc that returns a freshly-malloc’d string that concatenates its two arguments. Exactly how many bytes do you need to allocate?

Command-line arguments

Now that we know about strings, we can finally do something with argc and argv.

Recall that argv in main is declared as char **; this means that it is a pointer to a pointer to a char, or in this case the base address of an array of pointers to char, where each such pointer references a string. These strings correspond to the command-line arguments to your program, with the program name itself appearing in argv[0]

The count argc counts all arguments including argv[0]; it is 1 if your program is called with no arguments and larger otherwise.

Here is a program that prints its arguments. If you get confused about what argc and argv do, feel free to compile this and play with it:

#include <stdio.h>

int
main(int argc, char **argv)
{
    int i;

    printf("argc = %d\n\n", argc);

    for(i = 0; i < argc; i++) {
	printf("argv[%d] = %s\n", i, argv[i]);
    }

    return 0;
}

examples/strings/printArgs.c

Like strings, C terminates argv with a null: the value of argv[argc] is always 0 (a null pointer to char). In principle this allows you to recover argc if you lose it.


Licenses and Attributions


Speak Your Mind

-->