RM: More dynamic programming

Longest common subsequence:

xm: A B C B D A B
yn: B D C A B A

The LCS is B C B A

For arbitrary inputs, it is NP-hard. For constant inputs, DP gives the solution in polynomial time ex: O(n^k). Also, there can be more than one LCSs and the solution for that is exponential time ex: O(k^n).

Here we maintain an LCS matrix of m*n size. Initially, for i= 0 or j = 0, LCSij = 0. Then we use the following optimal substructure.

LCSij = 0 for i = 0 or j = 0
          = LCSi-1j-1 + 1                  for xi = yj
          = max(LCSi-1j, LCSij-1)   for xi != yj

I thought of the following on my try which also works:

LCSij = 0 for i = 0 or j = 0
          = max(LCSi-1j-1 + Aij, LCSi-1j, LCSij-1)
where
Aij = 1 for xi = yj
     = 0 for xi != yj

It works as LCSi-1j-1 would be lesser than LCSi-1j or LCSij-1 by not more than 1. If xi = yj then we consume characters from both strings (LCSi-1j-1 + 1) and if not then we consume character from either one (max(LCSi-1j, LCSij-1)). "The rationale for this recurrence is that, if the last characters of two sequences are equal, they must be part of the LCS. You could never get a larger LCS by matching x_m to y_j where j < n, and vice versa." (from wiki)

incorrect:

  ... A

. x x x

. 1 0 2

A 1 2 1

corrected:

  ... A      OR       ... A

. x x x             . x x x

. 1 1 2             . 1 1 2

A 1 2 2             B 1 2 2

Below is an example from wiki:



    | 0 1 2 3 4 5 6 7

    |   M Z J A W X U

----|-----------------

0   | 0 0 0 0 0 0 0 0

1  X | 0 0 0 0 0 0 1 1

2  M | 0 1 1 1 1 1 1 1

3  J | 0 1 1 2 2 2 2 2

4  Y | 0 1 1 2 2 2 2 2

5  A | 0 1 1 2 3 3 3 3

6  U | 0 1 1 2 3 3 3 4

7  Z | 0 1 2 2 3 3 3 4

getLongestCommonSubsequence(x, y):
    lcs[m,n]
    for i = 0 to m:
        lcs[i,0] = 0
    for j = 0 to n:
        lcs[0,j] = 0
    for i = 1 to m:
        for j = 1 to n:
            if x[i] = y[j]:
                lcs[i,j] = lcs[i-1,j-1] + 1
            else: 
                lcs[i,j] = max(lcs[i-1,j], lcs[i,j-1])

After LCS matrix is calculated in O(n*m) time, we just need to trace back from LCS[n,m] location.

printLCS(lcs, x, y, i, j):
    if i = 0 or j = 0: 
        return ""
    else:
        if x[i] = y[j]:
            return printLCS(lcs, x, y, i-1, j-1) + x[i]
        else:
            if lcs[i,j-1] > lcs[i-1,j]: 
                return printLCS(lcs, x, y, i, j-1)
            else:
                return printLCS(lcs, x, y, i-1, j)

There can be more than one LCS of the same length. The above would print one of them in O(n+m) time. So by DP, we get one LCS in O(n*m) time.

Edit-Distance / String Mutations:

Distance between two strings is the minimum mutations required to transform one string to another (aka Levenshtein distance). In below case, to transform x to y, it is 1 (delete first B from x to get y). Applications are spell-checking, plagiarism detection etc.

xm: ABAB
yn: AAB

d('' , '') = 0
d(s , '') = d('' , s) = len(s)
d( s1 + ch1, s2 + ch2 )
= min ( d(s1, s2) + d(ch1, ch2) , d(s1 + ch1) + 1, d(s2 + ch2) + 1) d(ch1, ch2) = 1 when ch1 != ch2 else 0

( from here )

Optimal substructure is below:

mutij
= i for i >= 0, j = 0 and j for j >= 0, i = 0
= min( muti-1j-1 + (if xi = yj then 0 else 1), muti-1j + 1 , mutij-1 + 1 )

And mut[m,n] gives the mutations / distance for xm and yn.

    | 0 1 2 3

    |   A B C x

----|---------

0   | 0 1 2 3

1  A | 1 0 1 2

2  A | 2 1 1 2

3  B | 3 2 1 2

4  B | 4 3 2 2

y

getDistance(x, y):
    mut[m,n]
    for i = 0 to m:
        mut[i,0] = i
    for j = 0 to n:
        mut[0,j] = j
    for i = 1 to m:
        for j = 1 to n:
            mut[i,j] = min( 
                        mut[i-1, j-1] + (x[i] = y[j] ? 0 : 1)
                        , mut[i-1, j] + 1
                        , mut[i, j-1] + 1 )

Once we have the mut matrix in O(n*m) time, below function prints the mutations by backtracking from mut[m,n] to mut[0,0] with respect to sequence x in O(m+n) time. There are 3 types of mutations (update in x y, delete in x, delete in y). When backtracking, we check which of the three values - muti-1j-1 (update in x y or no mutation if xi = yj), mutij-1 (delete in y), muti-1j (delete in x) - is same as mutij and print accordingly.

printMutations(mut, x, y, i, j):
    if i = 0 and j = 0:
        return ""
    else:
        if mut[i-1,j-1] + (x[i] = y[j] ? 0 : 1) = mut[i,j]: 
            if x[i] = y[j]:
                return printMutations(mut, x, y, i-1, j-1)
            else:
                return printMutations(mut, x, y, i-1, j-1) 
                    + print update x[i] at i in x
        else: 
            if mut[i,j-1] +1 = mut[i,j]:
                return printMutations(mut, x, y, i, j-1) 
                    + print delete y[j] at j in y

            else: //mut[i-1,j] +1 = mut[i,j]
                return printMutations(mut, x, y, i-1, j) 
                    + print delete x[i] at i in x

For x = AABB and y = ABC, it would print:

delete A at 1 in x

update B at 4 in x

Again, there can be more than one ways to mutate the sequences.

Sunday, July 20, 2008

More dynamic programming

No comments:

Post a Comment

Contributors

List