Books / Patterns for Beginning Programmers / Chapter 33
Missing Values
When working with numeric data one often needs to deal with missing values. Failing to take this requirement into account early in the development process can cause enormous problems later on.
Motivation
Suppose you’re writing a program that helps households manage their monthly budgets (in dollars and cents). Users of such a program have to enter their various expenditures every week. Unfortunately, people sometimes forget to do so. For example, someone might forget to enter their grocery expenditures for a particular week. When calculating their average expenditure on groceries, this missing value shouldn’t be treated as a \(0.00\), because that would skew the result. However it must be accounted for somehow.
To deal with problems of this kind you must think about two things. First, you have to think about how to represent missing values. Second, you have to think about how to incorporate them into calculations of various kinds.
Review
If you were given the task of writing such a budget program, you would
almost certainly use a double
to represent expenditures. Then, since
expenditures must be non-negative, you would use a sentinel value like
-1.00
to indicate that the expenditure is actually missing.
There are two shortcomings of this approach for general situations. The
first, and most important, is that in many situations there is no
double
value that can be used reliably as a sentinel because every
possible double
value is valid. The second is that it is error prone.
Specifically, if at some point a programmer forgets to check to see if a
value is a sentinel it will be used as if it is valid, resulting in
incorrect results (and a defect that is very difficult to localize and
correct).
Thinking About The Problem
Ideally, every data type would have an associated sentinel.
Unfortunately, this isn’t the case. Fortunately, however, all reference
types do have an associated sentinel, the reference null
.
This means that you have a natural way to indicate that something is
missing for everything that is represented using a reference type. For
example, if you don’t have the name of the grocery store where a
purchase was made, you can indicate that by assigning null
to the
relevant variable.
The Pattern
This observation leads to a solution to the general problem.
Specifically, as in Chapter
32
on outbound parameters, you can use wrapper objects to hold the numeric
values. When a particular data point is missing the wrapper object will
be null
, otherwise the wrapper object will hold the value. Since there
is no reason for the wrapper objects to be mutable, unlike Chapter
32,
you can use the built-in Double
and/or Integer
classes. Then, before
performing any operation on the wrapped data, you just check to see if
the wrapper is null
, extract the value if it isn’t, and take the
appropriate actions in either case.
This pattern can be summarized as follows. When collecting the data, you must:
1. Declare a wrapper object to be a Double
or Integer
as
appropriate.
2. If the information isn’t missing, use the static Double.valueOf()
or Integer.valueOf()
method to construct the wrapper object.
Then, when processing the data, you must:
3. Determine if the wrapper object is null
.
4a. If it is, take the appropriate actions for a missing value.
4b. If it isn’t, use the wrapper object’s doubleValue()
or
intValue()
to retrieve the value and take the appropriate actions for
a non-missing value.
Examples
As an example, consider situations in which you need to calculate the
mean of an array of data points (using one or more accumulators as in
Chapter
16).
Each data point is represented as a Double
object, as is the result of
the calculation (i.e., the mean), so that it can be used in subsequent
calculations (e.g., in the calculation of the variance). The situations
vary in the way missing values are handled.
Using a Default Value
The first kind of situation is one in which a default value is used in
place of any missing elements. This would be appropriate, for example,
when calculating the mean exam grade in a course in which all of the
exams are required and, hence, the defaultValue
is 0.0
, as in the
following:
total = 0.0;
for (int i = 0; i < data.length; i++) {
if (data[i] == null) {
total += defaultValue; // Initialized elsewhere
} else {
total += data[i].doubleValue();
}
}
average = total / (double) data.length;
All that is needed in this case is to increase the accumulator named
total
by the defaultValue
when the element is missing or by the
actual value when it isn’t.
Ignoring Missing Values
The next kind of situation is one in which missing values are ignored (i.e., each missing value is skipped). This approach might be used, for example, to calculate someone’s average weekly grocery bill when they might forget to enter the value for a particular week, as in the following:
total = 0.0;
n = 0;
for (int i = 0; i < data.length; i++) {
if (data[i] != null) {
total += data[i].doubleValue();
n++;
}
}
average = total / (double) n;
In this case it is critical to ensure that the number of non-missing
values is used when calculating the mean. A second accumulator, n
, is
used for this purpose.
Propagating the Missing Value
The final kind of situation is one in which missing values are propagated. In other words, any calculation involving a missing value results in a missing value. This might be appropriate, for example, when calculating the average state population in the United States. If the population for a particular state is missing, it can neither be ignored nor replaced with a default value. So, the average itself must be missing, as in the following:
missing = false;
total = 0.0;
for (int i = 0; i < data.length; i++) {
if (data[i] == null) {
missing = true;
break; // No reason to continue iterating
} else {
total += data[i].doubleValue();
}
}
if (missing) {
result = null;
} else {
result = Double.valueOf(total / (double) data.length);
}
In this case, after the loop terminates, you need to know if there were
any missing values. Again, a second accumulator (named missing
) is
used for this purpose. Note that, as soon as a missing value is
encountered, the loop can be terminated.
A Warning
As a convenience, the Java compiler boxes and unboxes its wrapper objects. This means that, given the following declarations:
double value;
Double wrapper;
a statement like the following:
wrapper = value;
is actually converted into the following:
wrapper = Double.valueOf(value);
and then compiled.
Similarly, a statement like the following:
value = wrapper;
is actually converted into the following:
value = wrapper.doubleValue();
and then compiled.
It is very easy for beginning programmers to forget that this happens,
and make mistakes as a result. It is also very easy to think that the
compiler will box/unbox things that it will not. So, for example, you
cannot assign a double[]
to a Double[]
or vice versa. So, when first
starting out, you should not rely on this “convenience”.
Looking Ahead
When you learn about collections you will learn about parameterized classes (i.e., type-safe, generic classes). Though they are almost always taught originally in the context of collections, parameterized classes actually have many other uses.
One example is the Optional
class in the java.util
package. It is a
wrapper class that has methods like isEmpty()
and isPresent()
that
can be used to determine if a value is missing or supplied. In addition,
it has methods like orElse()
that return the actual contents for
non-missing data and a default for missing data.