###### Books / Analysis of Algorithms / Chapter 13

# Dynamic Programming - Longest Common Subsequence

To further illustrate using dynamic programming, another problem that can be solved with dynamic programming is finding the *longest common subsequence* between two sequences.

## Sequences

Given a sequence

X= <x_{1},x_{2}, …,x_{n}>

a sequence of length *k*

Z= <z_{1},z_{2}, …,z_{k}>

is a *subsequence* if there exists a *strictly increasing* set of indicies <*i*_{1}, *i*_{2}, …, *i*_{k}> such that

In other words, all the elements of *Z* appear *in the same order* as they appear in *X* (but not necessarily consecutively).

For example, consider the seqence *X* = <A, B, C, B, D, A>. One subsequence is *Z* = <B, C, D, A> with indices *i* = <2, 3, 5, 6>.

## Longest Common Subsequence

**Problem**

Given two sequences *X* of length *m* and *Y* of length *n* as

X= <x_{1},x_{2}, …,x_{m}>

Y= <y_{1},y_{2}, …,y_{n}>

find the *longest* common subsequence (LCS).

**Solution**

*Step 1: Characterize optimality*

The brute force procedure would involve enumerating all 2^{m} subsequences of *X* (again simply consider all binary strings of length *m*) and check if they are also subsequences of *Y* keeping track of the longest one. Clearly this produces exponential run time and does not take advantage of the optimal substructure of the solution.

Define the *i*^{th} *prefix* of a sequence as the first *i* elements

X_{i}= <x_{1},x_{2}, …,x_{i}>

with *X*_{0} representing the empty sequence.

If we assume that *Z* = <*z*_{1}, *z*_{2}, …, *z*_{k}> is a LCS (with length *k*) of *X* and *Y* then one of the following three cases must hold:

- If
x_{m}=y_{n}, thenz_{k}=x_{m}=y_{n}andZ_{k-1}is a LCS ofX_{m-1},Y_{n-1}. Basically if the last elements of both sequences are the same then it must be the last element of the LCS and thek-1prefix of the LCS must be a LCS of them-1andn-1prefixes of the original sequences.- If
x_{m}≠y_{n}, then ifz_{k}≠x_{m}Zis a LCS ofX_{m-1},Y. Basically if the last element of the LCS isnotthe same as the last element ofXthen it must be a LCS of the prefix ofXwithout the last element.- If
x_{m}≠y_{n}, then ifz_{k}≠y_{n}Zis a LCS ofX,Y_{n-1}. Basically if the last element of the LCS isnotthe same as the last element ofYthen it must be a LCS of the prefix ofYwithout the last element.

In all three cases we see that the LCS of the original two sequences contains a LCS of *prefixes* of the two sequences (smaller versions of the original problem) ⇒ *optimal substructure problem*.

*Step 2: Define the recursive solution (top-down)*

Case 1 reduces to the *single* subproblem of finding a LCS of *X*_{m-1}, *Y*_{n-1} and adding *x*_{m} = *y*_{n}to the end of *Z*.

Cases 2 and 3 reduces to *two* subproblems of finding a LCS of *X*_{m-1}, *Y* and *X*, *Y*_{n-1}and selecting the longer of the two (note both of these subproblems involve also solving the subproblem of Case 1).

Hence if we let *c[m,n]* be the length of a LCS for *X* and *Y* we can write the recursion described by the above cases as

Since each step of the recursion removes at least one element from one of the sequences, there are only Θ(*mn*) subproblems to consider. Hence we can solve it by creating two tables - *C* an *m* x *n* table storing the LCS lengths and *B* an *m* x *n* table for reconstructing the LCS.

Note that not all subproblems are considered depending on which recursive branch is selected.

*Step 3: Compute the length of the LCS (bottom-up)*

Replacing *m* by *i* and *n* by *j* and noting that *c[i,0] = c[0,j] = 0*, we can rewrite the recursive equation bottom up as follows

When the procedure is complete, the optimal length of the LCS will be stored in *c[m,n]*. Thus since we fill in the entire table, the procedure will take O(*mn*).

*Step 4: Construct an optimal LCS*

Start at *any* entry containing the max-length (for example *c[m,n]*) and follow the arrows through the table adding elements in reverse order whenever a ↖ occurs. At worst we move up or left at each step giving a run time of O(*m* + *n*).

Alternatively we could avoid the *B* matrix (saving some space) and reconstruct the LCS from *C* at each step in O(1) time (using only the surrounding table cells), however it does not provide any improvement in the asymptotic run time.

**Example**

Consider the two sequences

X= <A, B, C, B, A>

Y= <B, D, C, A, B>

We will fill in the table row-wise starting in the upper left corner using the following formulas

The completed table is given by

i,j0 1 2 3 4 5 YBDCAB0 X0 0 0 0 0 0 1 A0 0↑ 0↑ 0↑ 1↖ 1← 2 B0 1↖ 1← 1← 1↑ 2↖ 3 C0 1↑ 1↑ 2↖ 2← 2↑ 4 B0 1↖ 1↑ 2↑ 2↑ 3↖ 5 A0 1↑ 1↑ 2↑ 3↖ 3↑

Thus the optimal LCS length is *c[m,n]* = 3.

Constructing an optimal LCS starting at *c[5,5]* we get *Z* = <B, C, B> (added at elements *c[4,5]*, *c[3,3]*, and *c[2,1]*). Alternatively we could start at *c[5,4]* which would produce *Z* = <B, C, A>. Note that the LCS *is not unique* but the optimal length of the LCS *is*.