Sunday, July 20, 2008

More dynamic programming

Longest common subsequence:

xm: A B C B D A B
yn: B D C A B A

The LCS is B C B A

For arbitrary inputs, it is NP-hard. For constant inputs, DP gives the solution in polynomial time ex: O(n^k). Also, there can be more than one LCSs and the solution for that is exponential time ex: O(k^n).

Here we maintain an LCS matrix of m*n size. Initially, for i= 0 or j = 0, LCSij = 0. Then we use the following optimal substructure.

LCSij = 0 for i = 0 or j = 0
          = LCSi-1j-1 + 1                  for xi = yj
          = max(LCSi-1j, LCSij-1)   for xi != yj

I thought of the following on my try which also works:

LCSij = 0 for i = 0 or j = 0
          = max(LCSi-1j-1 + Aij, LCSi-1j, LCSij-1)
where
Aij =  1 for xi = yj
     =  0 for xi != yj


It works as LCSi-1j-1 would be lesser than LCSi-1j or LCSij-1 by not more than 1. If xi = yj then we consume characters from both strings (LCSi-1j-1 + 1) and if not then we consume character from either one (max(LCSi-1j, LCSij-1)). "The rationale for this recurrence is that, if the last characters of two sequences are equal, they must be part of the LCS. You could never get a larger LCS by matching xm to yj where j < n, and vice versa." (from wiki)

incorrect:

  ... A
. x x x
. 1 0 2
A 1 2 1


corrected:

  ... A      OR       ... A
. x x x             . x x x
. 1 1 2             . 1 1 2
A 1 2 2             B 1 2 2


Below is an example from wiki:

    | 0 1 2 3 4 5 6 7
    |   M Z J A W X U
----|-----------------
0   | 0 0 0 0 0 0 0 0
1 X | 0 0 0 0 0 0 1 1
2 M | 0 1 1 1 1 1 1 1
3 J | 0 1 1 2 2 2 2 2
4 Y | 0 1 1 2 2 2 2 2
5 A | 0 1 1 2 3 3 3 3
6 U | 0 1 1 2 3 3 3 4
7 Z | 0 1 2 2 3 3 3 4


getLongestCommonSubsequence(x, y):
    lcs[m,n]
    for i = 0 to m:
        lcs[i,0] = 0
    for j = 0 to n:
        lcs[0,j] = 0
    for i = 1 to m:
        for j = 1 to n:
            if x[i] = y[j]:
                lcs[i,j] = lcs[i-1,j-1] + 1
            else: 
                lcs[i,j] = max(lcs[i-1,j], lcs[i,j-1])

After LCS matrix is calculated in O(n*m) time, we just need to trace back from LCS[n,m] location.

printLCS(lcs, x, y, i, j):
    if i = 0 or j = 0: 
        return ""
    else:
        if x[i] = y[j]:
            return printLCS(lcs, x, y, i-1, j-1) + x[i]
        else:
            if lcs[i,j-1] > lcs[i-1,j]: 
                return printLCS(lcs, x, y, i, j-1)
            else:
                return printLCS(lcs, x, y, i-1, j)

There can be more than one LCS of the same length. The above would print one of them in O(n+m) time. So by DP, we get one LCS in O(n*m) time.


Edit-Distance / String Mutations:

Distance between two strings is the minimum mutations required to transform one string to another (aka Levenshtein distance). In below case, to transform x to y, it is 1 (delete first B from x to get y). Applications are spell-checking, plagiarism detection etc.

xm: ABAB
yn: AAB

d('' , '') = 0
d(s , '') = d('' , s) = len(s)
d( s1 + ch1, s2 + ch2 )
 = min ( d(s1, s2) + d(ch1, ch2) , d(s1 + ch1) + 1, d(s2 + ch2) + 1)   d(ch1, ch2) = 1 when ch1 != ch2 else 0

( from here )

Optimal substructure is below:

mutij
= i for i >= 0, j = 0 and j for j >= 0, i = 0
= min( muti-1j-1 + (if xi = ythen 0 else 1), muti-1j + 1 , mutij-1 + 1 )

And mut[m,n] gives the mutations / distance for xm and yn.

    | 0 1 2 3
    |   A B C x
----|---------
0   | 0 1 2 3
1 A | 1 0 1 2
2 A | 2 1 1 2
3 B | 3 2 1 2
4 B | 4 3 2 2

  y

getDistance(x, y):
    mut[m,n]
    for i = 0 to m:
        mut[i,0] = i
    for j = 0 to n:
        mut[0,j] = j
    for i = 1 to m:
        for j = 1 to n:
            mut[i,j] = min( 
                        mut[i-1, j-1] + (x[i] = y[j] ? 0 : 1)
                        , mut[i-1, j] + 1
                        , mut[i, j-1] + 1 )

Once we have the mut matrix in O(n*m) time, below function prints the mutations by backtracking from mut[m,n] to mut[0,0] with respect to sequence x in O(m+n) time. There are 3 types of mutations (update in x y, delete in x, delete in y). When backtracking, we check which of the three values - muti-1j-1 (update in x y or no mutation if xi = yj), mutij-1 (delete in y), muti-1j (delete in x) - is same as mutij and print accordingly.

printMutations(mut, x, y, i, j):
    if i = 0 and j = 0:
        return ""
    else:
        if mut[i-1,j-1] + (x[i] = y[j] ? 0 : 1) = mut[i,j]: 
            if x[i] = y[j]:
                return printMutations(mut, x, y, i-1, j-1)
            else:
                return printMutations(mut, x, y, i-1, j-1) 
                    + print update x[i] at i in x
        else: 
            if mut[i,j-1] +1 = mut[i,j]:
                return printMutations(mut, x, y, i, j-1) 
                    + print delete y[j] at j in y

            else: //mut[i-1,j] +1 = mut[i,j]
                return printMutations(mut, x, y, i-1, j) 
                    + print delete x[i] at i in x

For x = AABB and y = ABC, it would print:
delete A at 1 in x
update B at 4 in x


Again, there can be more than one ways to mutate the sequences.

No comments:

Post a Comment