Update

The mistake was in counting the number of children. I have now passed all but two test cases, which probably require that I implement the Suffix Tree using lists instead of arrays.

----------------------------------------------------------------------------------------------------------------------------

So I have been trying to solve problems under the "Strings" section in InterviewStreet. Find Strings is one of those questions and I am here to discuss a possible strategy to solve the problem.~~I'd like to make it clear from the beginning that my own solution has not been accepted yet, but I am unable to see the flaw in the approach and am expecting someone to point it out.~~

So, the question requires us to find the number of unique substrings in a given set of strings. At first I thought as the question unfolds and since I use C++ for most problems, I was thinking in terms of the available data structures in STL. So, at first I thought of storing the unique substrings (of the order of the square of the length of the string) in a 'set'. The set takes care of the uniqueness, but the problem arises if I try to find the 'k-th' smallest string. This would take really long, for each query.

So, the things that led me to another solution were the following realizations :-

So, eventually, I ended up coding up a SuffixTree and the solution I submitted can be found below (or follow this link).~~But again, this solution has been able to pass only one of the test cases. So, either there's some serious flaw with the approach or I've failed to code what I had in mind. ~~

I'd be glad to hear suggestions for improvements in the algorithm or the code.

The mistake was in counting the number of children. I have now passed all but two test cases, which probably require that I implement the Suffix Tree using lists instead of arrays.

----------------------------------------------------------------------------------------------------------------------------

So I have been trying to solve problems under the "Strings" section in InterviewStreet. Find Strings is one of those questions and I am here to discuss a possible strategy to solve the problem.

So, the question requires us to find the number of unique substrings in a given set of strings. At first I thought as the question unfolds and since I use C++ for most problems, I was thinking in terms of the available data structures in STL. So, at first I thought of storing the unique substrings (of the order of the square of the length of the string) in a 'set'. The set takes care of the uniqueness, but the problem arises if I try to find the 'k-th' smallest string. This would take really long, for each query.

So, the things that led me to another solution were the following realizations :-

- First solution takes too long inserting each string
- Insertion of the substrings of a string like "aab", will require steps to insert each of {a, aa, aab, ab, b}, i.e. 5 different insertion steps. Here, I thought if I could simply mark {a,aa} as unique substrings as well when inserting aab
- Searching for a string in a set (when checking for uniqueness) requires comparison of almost the entire string even if a similar string has already been compared with. For instance, in a set of {abc, abe}, I am checking if "abd" exists, then I would need to carry out 6 comparisons (between characters). This wastes the earlier learnt lesson that the first two characters are 'ab'.

- I have to insert only the m suffixes of a string of length
*m*. This reduces the number of insertions to linear from quadratic in the previous algorithm. - While checking if the string to be inserted is already present in the set, I need only O(d) comparisons, where
*d*is the length of the string to be inserted. - While checking for the k-th smallest string (lexicographically), I can make use of the information about the number of children of a node (not the number of leaves it leads). The number of children of a node can be related to the number of unique "substrings" (not suffixes, which is given by the number of leaves).

So, eventually, I ended up coding up a SuffixTree and the solution I submitted can be found below (or follow this link).

## No comments:

## Post a Comment