Pages

Friday, October 26, 2012

[InterviewStreet Problem] Find Strings

Update
The mistake was in counting the number of children. I have now passed all but two test cases, which probably require that I implement the Suffix Tree using lists instead of arrays.
----------------------------------------------------------------------------------------------------------------------------

So I have been trying to solve problems under the "Strings" section in InterviewStreet. Find Strings is one of those questions and I am here to discuss a possible strategy to solve the problem. I'd like to make it clear from the beginning that my own solution has not been accepted yet, but I am unable to see the flaw in the approach and am expecting someone to point it out.

So, the question requires us to find the number of unique substrings in a given set of strings. At first I thought as the question unfolds and since I use C++ for most problems, I was thinking in terms of the available data structures in STL. So, at first I thought of storing the unique substrings (of the order of the square of the length of the string) in a 'set'. The set takes care of the uniqueness, but the problem arises if I try to find the 'k-th' smallest string. This would take really long, for each query.

So, the things that led me to another solution were the following realizations :-
  • First solution takes too long inserting each string
  • Insertion of the substrings of a string like "aab", will require steps to insert each of {a, aa, aab, ab, b}, i.e. 5 different insertion steps. Here, I thought if I could simply mark {a,aa} as unique substrings as well when inserting aab
  • Searching for a string in a set (when checking for uniqueness) requires comparison of almost the entire string even if a similar string has already been compared with. For instance, in a set of {abc, abe}, I am checking if "abd" exists, then I would need to carry out 6 comparisons (between characters). This wastes the earlier learnt lesson that the first two characters are 'ab'.
These things led me to think in terms of tries and consequently I related that it is possible find substrings of a string in its Suffix Tree. A suffix tree is, basically, a trie of suffixes of a string. So, if I create one suffix tree as the set of the unique substrings, I have the following advantage :-
  •  I have to insert only the m suffixes of a string of length m. This reduces the number of insertions to linear from quadratic in the previous algorithm.
  • While checking if the string to be inserted is already present in the set, I need only O(d) comparisons, where d is the length of the string to be inserted.
  • While checking for the k-th smallest string (lexicographically), I can make use of the information about the number of children of a node (not the number of leaves it leads). The number of children of a node can be related to the number of unique "substrings" (not suffixes, which is given by the number of leaves). 

So, eventually, I ended up coding up a SuffixTree and the solution I submitted can be found below (or follow this link). But again, this solution has been able to pass only one of the test cases. So, either there's some serious flaw with the approach or I've failed to code what I had in mind. 


I'd be glad to hear suggestions for improvements in the algorithm or the code.