xm: A B C B D A B
yn: B D C A B A
The LCS is B C B A
For arbitrary inputs, it is NP-hard. For constant inputs, DP gives the solution in polynomial time ex: O(n^k). Also, there can be more than one LCSs and the solution for that is exponential time ex: O(k^n).
Here we maintain an LCS matrix of m*n size. Initially, for i= 0 or j = 0, LCSij = 0. Then we use the following optimal substructure.
LCSij = 0 for i = 0 or j = 0
= LCSi-1j-1 + 1 for xi = yj
= max(LCSi-1j, LCSij-1) for xi != yj
I thought of the following on my try which also works:
LCSij = 0 for i = 0 or j = 0
= max(LCSi-1j-1 + Aij, LCSi-1j, LCSij-1)
where
Aij = 1 for xi = yj
= 0 for xi != yj
It works as LCSi-1j-1 would be lesser than LCSi-1j or LCSij-1 by not more than 1. If xi = yj then we consume characters from both strings (LCSi-1j-1 + 1) and if not then we consume character from either one (max(LCSi-1j, LCSij-1)). "The rationale for this recurrence is that, if the last characters of two sequences are equal, they must be part of the LCS. You could never get a larger LCS by matching xm to yj where j < n, and vice versa." (from wiki)
incorrect:
... A
. x x x
. 1 0 2
A 1 2 1
corrected:
... A OR ... A
. x x x . x x x
. 1 1 2 . 1 1 2
A 1 2 2 B 1 2 2
Below is an example from wiki:
| 0 1 2 3 4 5 6 7
| M Z J A W X U
----|-----------------
0 | 0 0 0 0 0 0 0 0
1 X | 0 0 0 0 0 0 1 1
2 M | 0 1 1 1 1 1 1 1
3 J | 0 1 1 2 2 2 2 2
4 Y | 0 1 1 2 2 2 2 2
5 A | 0 1 1 2 3 3 3 3
6 U | 0 1 1 2 3 3 3 4
7 Z | 0 1 2 2 3 3 3 4
getLongestCommonSubsequence(x, y):
lcs[m,n]
for i = 0 to m:
lcs[i,0] = 0
for j = 0 to n:
lcs[0,j] = 0
for i = 1 to m:
for j = 1 to n:
if x[i] = y[j]:
lcs[i,j] = lcs[i-1,j-1] + 1
else:
lcs[i,j] = max(lcs[i-1,j], lcs[i,j-1])
After LCS matrix is calculated in O(n*m) time, we just need to trace back from LCS[n,m] location.
printLCS(lcs, x, y, i, j):
if i = 0 or j = 0:
return ""
else:
if x[i] = y[j]:
return printLCS(lcs, x, y, i-1, j-1) + x[i]
else:
if lcs[i,j-1] > lcs[i-1,j]:
return printLCS(lcs, x, y, i, j-1)
else:
return printLCS(lcs, x, y, i-1, j)
There can be more than one LCS of the same length. The above would print one of them in O(n+m) time. So by DP, we get one LCS in O(n*m) time.
Edit-Distance / String Mutations:
Distance between two strings is the minimum mutations required to transform one string to another (aka Levenshtein distance). In below case, to transform x to y, it is 1 (delete first B from x to get y). Applications are spell-checking, plagiarism detection etc.
xm: A
yn: AAB
d('' , '') = 0
d(s , '') = d('' , s) = len(s)
d( s1 + ch1, s2 + ch2 )
= min ( d(s1, s2) + d(ch1, ch2) , d(s1 + ch1) + 1, d(s2 + ch2) + 1) d(ch1, ch2) = 1 when ch1 != ch2 else 0
( from here )
Optimal substructure is below:
mutij
= i for i >= 0, j = 0 and j for j >= 0, i = 0
= min( muti-1j-1 + (if xi = yj then 0 else 1), muti-1j + 1 , mutij-1 + 1 )
And mut[m,n] gives the mutations / distance for xm and yn.
| 0 1 2 3
| A B C x
----|---------
0 | 0 1 2 3
1 A | 1 0 1 2
2 A | 2 1 1 2
3 B | 3 2 1 2
4 B | 4 3 2 2
y
getDistance(x, y):
mut[m,n]
for i = 0 to m:
mut[i,0] = i
for j = 0 to n:
mut[0,j] = j
for i = 1 to m:
for j = 1 to n:
mut[i,j] = min(
mut[i-1, j-1] + (x[i] = y[j] ? 0 : 1)
, mut[i-1, j] + 1
, mut[i, j-1] + 1 )
Once we have the mut matrix in O(n*m) time, below function prints the mutations by backtracking from mut[m,n] to mut[0,0] with respect to sequence x in O(m+n) time. There are 3 types of mutations (update in x y, delete in x, delete in y). When backtracking, we check which of the three values - muti-1j-1 (update in x y or no mutation if xi = yj), mutij-1 (delete in y), muti-1j (delete in x) - is same as mutij and print accordingly.
printMutations(mut, x, y, i, j):
if i = 0 and j = 0:
return ""
else:
if mut[i-1,j-1] + (x[i] = y[j] ? 0 : 1) = mut[i,j]:
if x[i] = y[j]:
return printMutations(mut, x, y, i-1, j-1)
else:
return printMutations(mut, x, y, i-1, j-1)
+ print update x[i] at i in x
else:
if mut[i,j-1] +1 = mut[i,j]:
return printMutations(mut, x, y, i, j-1)
+ print delete y[j] at j in y
else: //mut[i-1,j] +1 = mut[i,j]
return printMutations(mut, x, y, i-1, j)
+ print delete x[i] at i in x
For x = AABB and y = ABC, it would print:
delete A at 1 in x
update B at 4 in x
Again, there can be more than one ways to mutate the sequences.
No comments:
Post a Comment