Missing Values

When working with numeric data one often needs to deal with missing values. Failing to take this requirement into account early in the development process can cause enormous problems later on.

Motivation

Suppose you’re writing a program that helps households manage their monthly budgets (in dollars and cents). Users of such a program have to enter their various expenditures every week. Unfortunately, people sometimes forget to do so. For example, someone might forget to enter their grocery expenditures for a particular week. When calculating their average expenditure on groceries, this missing value shouldn’t be treated as a \(0.00\), because that would skew the result. However it must be accounted for somehow.

To deal with problems of this kind you must think about two things. First, you have to think about how to represent missing values. Second, you have to think about how to incorporate them into calculations of various kinds.

Review

If you were given the task of writing such a budget program, you would almost certainly use a double to represent expenditures. Then, since expenditures must be non-negative, you would use a sentinel value like -1.00 to indicate that the expenditure is actually missing.

There are two shortcomings of this approach for general situations. The first, and most important, is that in many situations there is no double value that can be used reliably as a sentinel because every possible double value is valid. The second is that it is error prone. Specifically, if at some point a programmer forgets to check to see if a value is a sentinel it will be used as if it is valid, resulting in incorrect results (and a defect that is very difficult to localize and correct).

Thinking About The Problem

Ideally, every data type would have an associated sentinel. Unfortunately, this isn’t the case. Fortunately, however, all reference types do have an associated sentinel, the reference null.

This means that you have a natural way to indicate that something is missing for everything that is represented using a reference type. For example, if you don’t have the name of the grocery store where a purchase was made, you can indicate that by assigning null to the relevant variable.

The Pattern

This observation leads to a solution to the general problem. Specifically, as in Chapter 32 on outbound parameters, you can use wrapper objects to hold the numeric values. When a particular data point is missing the wrapper object will be null, otherwise the wrapper object will hold the value. Since there is no reason for the wrapper objects to be mutable, unlike Chapter 32, you can use the built-in Double and/or Integer classes. Then, before performing any operation on the wrapped data, you just check to see if the wrapper is null, extract the value if it isn’t, and take the appropriate actions in either case.

This pattern can be summarized as follows. When collecting the data, you must:

1. Declare a wrapper object to be a Double or Integer as appropriate.

2. If the information isn’t missing, use the static Double.valueOf() or Integer.valueOf() method to construct the wrapper object.

Then, when processing the data, you must:

3. Determine if the wrapper object is null.

4a. If it is, take the appropriate actions for a missing value.

4b. If it isn’t, use the wrapper object’s doubleValue() or intValue() to retrieve the value and take the appropriate actions for a non-missing value.

Examples

As an example, consider situations in which you need to calculate the mean of an array of data points (using one or more accumulators as in Chapter 16). Each data point is represented as a Double object, as is the result of the calculation (i.e., the mean), so that it can be used in subsequent calculations (e.g., in the calculation of the variance). The situations vary in the way missing values are handled.

Using a Default Value

The first kind of situation is one in which a default value is used in place of any missing elements. This would be appropriate, for example, when calculating the mean exam grade in a course in which all of the exams are required and, hence, the defaultValue is 0.0, as in the following:

        total = 0.0;
        for (int i = 0; i < data.length; i++) {
            if (data[i] == null) {
                total += defaultValue; // Initialized elsewhere
            } else {
                total += data[i].doubleValue();
            }
        }
        average = total / (double) data.length;

All that is needed in this case is to increase the accumulator named total by the defaultValue when the element is missing or by the actual value when it isn’t.

Ignoring Missing Values

The next kind of situation is one in which missing values are ignored (i.e., each missing value is skipped). This approach might be used, for example, to calculate someone’s average weekly grocery bill when they might forget to enter the value for a particular week, as in the following:

        total = 0.0;
        n     = 0;
        for (int i = 0; i < data.length; i++) {
            if (data[i] != null) {
                total += data[i].doubleValue();
                n++;
            }
        }
        average = total / (double) n;

In this case it is critical to ensure that the number of non-missing values is used when calculating the mean. A second accumulator, n, is used for this purpose.

Propagating the Missing Value

The final kind of situation is one in which missing values are propagated. In other words, any calculation involving a missing value results in a missing value. This might be appropriate, for example, when calculating the average state population in the United States. If the population for a particular state is missing, it can neither be ignored nor replaced with a default value. So, the average itself must be missing, as in the following:

        missing = false;        
        total = 0.0;
        for (int i = 0; i < data.length; i++) {
            if (data[i] == null) {
                missing = true;
                break; // No reason to continue iterating
            } else {
                total += data[i].doubleValue();
            }
        }

        if (missing) {
            result = null;
        } else {
            result = Double.valueOf(total / (double) data.length);
        }

In this case, after the loop terminates, you need to know if there were any missing values. Again, a second accumulator (named missing) is used for this purpose. Note that, as soon as a missing value is encountered, the loop can be terminated.

A Warning

As a convenience, the Java compiler boxes and unboxes its wrapper objects. This means that, given the following declarations:

        double  value;
        Double  wrapper;

a statement like the following:

        wrapper = value;

is actually converted into the following:

        wrapper = Double.valueOf(value);

and then compiled.

Similarly, a statement like the following:

        value = wrapper;

is actually converted into the following:

        value = wrapper.doubleValue();

and then compiled.

It is very easy for beginning programmers to forget that this happens, and make mistakes as a result. It is also very easy to think that the compiler will box/unbox things that it will not. So, for example, you cannot assign a double[] to a Double[] or vice versa. So, when first starting out, you should not rely on this “convenience”.

Looking Ahead

When you learn about collections you will learn about parameterized classes (i.e., type-safe, generic classes). Though they are almost always taught originally in the context of collections, parameterized classes actually have many other uses.

One example is the Optional class in the java.util package. It is a wrapper class that has methods like isEmpty() and isPresent() that can be used to determine if a value is missing or supplied. In addition, it has methods like orElse() that return the actual contents for non-missing data and a default for missing data.



Licenses and Attributions


Speak Your Mind

-->