Sunday, February 01, 2009

Find the distinct words from a text file

I'm repeating this post from here, but I think it deserves it because of its elegance.

--

The data structure of the trie and its addWord() method remains the same.

class trieNode:
    trieNode[26] children
    char c
    int countPrefix // words with prefix from root to current node
    int countWord // complete words for this current node 

void addWord(trieNode node, string word): 
        //end of the word
        if (word = ""):
            node.countWord = node.countWord +1
            return

        //do stuff with current node
        node.countPrefix = node.countPrefix + 1
        
        //go to next node 
        char first = firstChar(word)
        if (node.children[first] == null):
            node.children[first] = new trieNode()
    
        addWord(node.children[first], firstCharRemoved(word))


Another example is to find the distinct words from a text file. At first, I thought of a Hashmap which acts like a boolean array for the words as index. In which case, the time required is O(n) and space required is that of the distinct words. With Trie, the solution is better and elegant in terms of space required, time required is O(length). In case of words 'anon', 'an', 'a', the Trie formed (''-a-n-o-n) would just need space for a-n-o-n. Thus, the advantages of Trie would show over a Hashmap when many words have overlapping prefixes. Once the Trie is formed, a DFS would print the distinct words as follows.

//do DFS on the trie
StringBuilder sb = new StringBuilder()
dfs(root, sb)


void dfs(trieNode node, StringBuilder sb):
    sb.append(node.c)
    
    if (node.countWord > 0):
        print(sb)
    
    for (trieNode child : node.children)
        if (child != null):
            dfs(child, sb)
    
    sb.length = sb.length -1


[Hat tip to B.]

No comments:

Post a Comment