Detecting Duplicated Code

Fowler and Beck have ranked duplicated code as the first of the top ten code smells indicating the need to refactor a piece of software [FBB^+^99]. As they like to explain it, whenever you duplicate a piece of code, you are taking out a loan, in the sense that you are getting something now (an almost ready-made piece of software) that you will have to pay for later. There is nothing wrong with taking out a loan, but you have a choice between paying back a small amount now (by taking out the time to refactor your code to eliminate the duplication) or paying back a lot later (in terms of increased complexity and maintenance costs).

Data from empirical studies show that typically between 8% and 12% of industrial software consists of duplicated code [DRD99]. Although this may not seem like much, in fact it is difficult to achieve very high rates of duplication. (Imagine what it would take to have a duplication rate of even 50%!) Duplication rates of 15 to 20% are therefore considered to be severe.

It is important to identify duplicated code for the following reasons:

  • Duplicated code hampers the introduction of changes, since every implemented variant of a piece of functionality will have to be changed. Since it is easy to miss some variants, bugs are likely to pop up in other places.

  • Duplicated code replicates and scatters the logic of a system instead of grouping it into identifiable artifacts (classes, methods, packages). It leads to systems that are more difficult to understand and to change. Instead of just having to understand relationship between logical parts you will have first to identify them and then understand their relationships.

Duplicated code arises for a variety of reasons:

  • Whenever a programmer is implementing a piece of functionality that is remotely similar to something that has been done before, it is natural to use the existing code as a model for the new task. If it is a matter of recombining existing procedures, the task will be simple. But if the behavior is more complex, the easiest thing to do is to copy, paste and modify the old code to achieve the functionality. If both the old and new pieces of code belong to different applications, the harm is minimal. But if they are part of the same system, duplicated code has now been introduced.

  • Sometimes code is copied, pasted and modified between different applications, or different versions of the same application. When multiple versions must be maintained simultaneously, or when different applications or versions must be merged, you immediately have a duplicated code problem.

From a reengineering perspective, usually people know whether or not a system suffers from duplication. First, the development team, or the manager will tell you. Second, there are normally some clear signs that duplication has been practiced in a project: for example, two developers cannot develop four millions of line of code in less than eight months without copying and pasting existing code. While analyzing the system you will also identify duplicated code by accident. There is a major difference, however, between knowing that a system contains duplicated code, and knowing exactly which parts have been duplicated.

image

Figure 8.1: Two patterns to support Detecting Duplicated Code.

Overview

Detecting Duplicated Code consists of two patterns: Compare Code Mechanically, which describes how we can detect duplicated code, and Visualize Code as Dotplots, which shows how duplicated code can be better understood by simple matrix visualization.

Once you have detected and understood duplication in the system, you may decide on a variety of tactics. Various refactoring patterns, such as Extract Method [p. 291] may help you to eliminate the duplication. Duplication may be a sign of misplaced responsibilities, in which you should may decide to Move Behavior Close to Data [p. 221].

Complex conditional statements are also a form of duplication, and may indicate that multiple clients have to duplicate actions that should belong to the target class. The pattern cluster Transform Conditionals to Polymorphism can help you to resolve these problems.

8.1 Compare Code Mechanically

Intent Discover duplicated code by comparing all the source code files line-byline.

Problem

How do you discover which parts of an application code have been duplicated?

This problem is difficult because:

  • You may suspect that code has been duplicated but you do not have any a priori evidence where the duplication occurs. For example, you know that two programmers cannot have developed 4 million lines of Cobol in one year without having duplicated some code.

  • Browsing the code is not an effective way of discovering duplication; you will only find duplicated code by accident.

  • Programmers may have not only copied and pasted code but also modified variables or changed the shape of the programs.

Yet, solving this problem is feasible because:

  • Most duplicated code can be detected by mechanical procedures.

Solution

Textually compare each line of the software source code with all the other lines of code.

Steps

  • Normalize the lines of code by removing comments, tabs and blanks.

  • Remove lines that contain uninteresting code elements (e.g., just else or })

  • Compare each line with all the other lines. Reduce search complexity by hashing:

    • Preprocessing: Compute the hash value for each line

    • Actual Comparison: Compare all lines in the same hash bucket Variants

This approach may fail to identify some instances of duplicated code due to renaming of variables. By deleting all variable identifiers, or by mapping them to a common symbol, you can detect similar code patterns, while abstracting from the details of the specific identifiers. This variant, however, requires some syntactic processing of the code.

Tradeoffs

Pros

  • The approach is simple and gives good results while only requiring modest resources.

  • It is nearly language-independent in the sense that you only have to build a lexical analyzer and not a full parser. That’s why a simple perl script can be sufficient depending on the level of sophistication that you want.

  • Simple statistics and percentage rates are easily computed and may help you to gain credibility or more strength in discussions on resource allocation or hiring new people.

Cons

  • Code that has been heavily edited after copying may be hard to identify as duplicated code.

  • Systems containing a lot of duplicated code will generate a lot of data that can be difficult to analyze effectively.

Example

Consider the case of a system written in C++ where you suspect duplicated code. However, you didn’t write to code yourself so you don’t know where the actual duplication occurs. How can you detect where the duplicated code fragments are? Consistent with Keep It Simple [p. 31] you do the simplest thing that may possibly work: you write a little script that first normalizes the code to remove all white space from the code and afterwards compares each line of code against itself.

The normalization would change the following code

//...
// assign same fastid as container
fastid = NULL;
const char* fidptr = getFastid();
if(fidptr != NULL) {
    int l = strlen(fidptr);
    fastid = new char[l+1];
    char* tmp = (char*) fastid;
    for (int i =0;i<l;i++)
        tmp[i] = fidptr[i];
    tmp[l] = '\0';
}
//...

into

fastid=NULL;
const char* fidptr=getFastid();
if(fidptr!=NULL)
    int l=strlen(fidptr);
fastid=newchar[l+1];
char * tmp=(char*)fastid;
for(inti=0;i<l;i++)
    tmp[i]=fidptr[i];
tmp[l]='\0';

Known Uses

In the context of software reengineering, the pattern has been applied to detect duplicated code in FAMOOS case studies containing up to one million lines of C++. It also has been applied to detect duplicated code in a COBOL system of 4 million lines of code. DATRIX has investigated multiple versions of a large telecommunications system, wading through 89 million lines of code all in all [LPM</superscript>97].

8.2 Visualize Code as Dotplots

Intent _Gain insight into the nature of the duplication by studying the patterns in the dotplots.

Problem

How can you gain insight into the scope and nature of code duplication in a software system?

This problem is difficult because:

  • Just knowing where in the system duplicated code exists does not necessarily help you to understand its nature, or what should be done about it.

Yet, solving this problem is feasible because:

  • A picture is worth a thousand words.

Solution

Visualize the code as a matrix in which the two axes represent two source code files (possibly the same file), and dots in the matrix occur where source code lines are duplicated.

Steps

If you want to analyze two files A and B:

  • Normalize the contents of the two files to eliminate noise (white space etc.).

  • Let each axis of the matrix represent elements (e.g., the lines of code) of the normalized files.

  • Represent a match between two elements as a dot in the matrix.

  • Interpret the obtained pictures: a diagonal represents duplicated code between the two files.

To analyze the duplication inside a single file, plot the elements of that file on both axes.

image

Figure 8.2: Possible sequences of dot and their associated interpretations.

Interpretations

The interpretation of the obtained matrices are illustrated in Figure 8.2:

Some interesting configurations formed by the dots in the matrices are the following:

  • Exact Copies: diagonals of dots indicate copied sequences of source code.

  • Copies With Variations: sequences that have holes in them indicate that a portion of a copied sequences has been changed.

  • Inserts/Deletes: broken sequences with parts shifted to the right or left indicate that a portion of code has been inserted or deleted.

  • Repetitive Code Elements: rectangular configurations indicate periodic occurrences of the same code. An example is the break at the end of the individual cases of a C or C ++ switch statement, or recurring preprocessor commands like #ifdef SOME CONDITION.

Tradeoffs

Pros

  • The approach is largely language-independent, since only the code normalization depends on the language syntax.

  • The approach works well when reverse engineering large amounts of unknown code, because the dotplots attract your eye to certain parts of the code to be studied more closely.

  • The idea is simple yet works surprisingly well.

image

Figure 8.3: Code duplication before and after refactoring.

A simple version of the approach can be implemented by a good programmer using a appropriate tools in a couple of days. (One of our better students made a small dotplot browser in Delphi in two days.)

Cons

• Dotplots only present pairwise comparisons. They do not necessarily help you identify all instances of duplicated elements in the entire software system. Although the approach can easily be extended to present multiple files across each axis, the comparisons are still only pairwise.

Difficulties

  • A naive implementation of a dotplot visualizer may not scale well to large systems. Tuning and optimizing the approach for large data sets can compromise the simplicity of the approach.

  • The interpretation of the data may be more subtle than it appears at first glance. Indeed, while comparing multiple files the diagonals represent more duplication than is really in the system because we are comparing duplicated fragments with themselves over different files, as shown by Figure 8.3 and Figure 8.4.

  • The screen size limits the amount of information that can be visualized. Some success has been achieved with so-called “mural” visualization approaches [JS96]. However, these techniques are significantly more difficult to implement than simple dotplots and are not worth the extra effort.

Example

In Figure 8.3 we see a dotplot of two versions of a piece of software, before and after the duplication has been removed. The first version is compared to itself in the top left square. The line down the diagonal simply shows us that every line of code is being compared to itself. What is more interesting is that several other diagonal lines occur in the dotplot, which means that code has been duplicated within this file. A second version of the same file is compared to itself in the lower right square. Here we see no significant duplication aside from the main diagonal, which reflects the fact that all the duplicated code has been successfully refactored.

image Figure 8.4: A Python file A being compared to itself and to a second file B.

The bottom left and top right squares are mirror images of each other. They tell us how the before and after files have been reorganized. Since there is no strong diagonal, this tells us that significant reorganization has taken place. The diagonal stripes show us which parts of the old version have survived and where they appear in the new version. From the dot-

image

Figure 8.5: Dotplots produced by four switch statements. plot alone, we can guess that about half of the code has survived, and another half of the code has been significantly rewritten.

Dotplots are also useful to detect duplication across multiple files. Figure 8.4 shows a dotplot comparing two Python files. The comparison of A vs. A shows that there is essentially no internal duplication. Very likely there are some switch statements in the bottom have of the file, indicated by the matrix pattern.

When we compare file A to file B, however, we detect a staggering amount of duplication. It looks very much like file B is just a copy of file A that has been extended in various ways. Closer investigation showed this to be the case. In fact, file A was just an older version of file B that had inadvertently been left in the release.

Dotplots can also be useful to detect other problems. Figure 8.5 presents four clones that represent a switch statement over a type variable that is used to call individual construction code. The duplicated code could perhaps be eliminated by applying Transform Conditionals to Polymorphism.

Known Uses

The pattern has been applied in biological research to detect DNA sequences [PK82]. The Dotplot tool [Hel95] has been used to detect similarities in manual pages, literary texts and names from file systems. In the FAMOOS project, the pattern has been applied to build Duploc, a tool for identifying duplication in software source code [DRD99]. The Dup tool [Bak92] has been used to investigated the source code of the X-Window system and uses a dotplot matrix graphical representation.

Once you have detected duplicated code, numerous refactoring patterns may apply, in particular Extract Method [p. 291].

Very often duplicated code arises because clients assume too many responsibilities. In that case, Move Behavior Close to Data [p. 221] will help you to eliminate the duplication.

Dotplots also help to detect large conditional constructs. You should probably Transform Conditionals to Polymorphism to eliminate these conditionals and thereby achieve a more flexible design.


Licenses and Attributions


Speak Your Mind