Monday, December 10, 2012

[Quora Challenges] Typeahead Search - I

This is the first post amongst a few that would follow in an attempt to solve the Quora challenge Typeahead search. This problem particularly interests me. Why? Because during my internship at (Directi), I worked on a similar problem. Having spent an entire summer working on an equivalent of Google's "instant search", I realized how complex it is. "Bigger dataset, lesser query time", this was all that was on my mind for most of my summer.

Having done a lot of research earlier, I referred back to the following to see if I can apply any ideas directly
  1. A very fast approach to auto complete (or search suggestions)
  2. Efficient auto-complete with a ternary search tree
  3. Trie examples 
  4. What is the best autocomplete/suggest algorithm,datastructure 
As is evident, most of these point towards the use of a trie to solve the problem. Considering that I developed a fascination towards Tries during the internship, I might even end up using the same structure here. However, I'm considering exploring other methods that might suit this problem better. For instance, tries are exceptionally good to reduce the time taken to search for a word, it requires a lot of space. Also, since there would be at most 40k insertions, other methods such as the 4th Approach here. I also need to check how much memory usage is allowed to solve this problem and how much pre-processing time would be acceptable.

[More discussion in follow-up posts]

Friday, October 26, 2012

[InterviewStreet Problem] Find Strings

The mistake was in counting the number of children. I have now passed all but two test cases, which probably require that I implement the Suffix Tree using lists instead of arrays.

So I have been trying to solve problems under the "Strings" section in InterviewStreet. Find Strings is one of those questions and I am here to discuss a possible strategy to solve the problem. I'd like to make it clear from the beginning that my own solution has not been accepted yet, but I am unable to see the flaw in the approach and am expecting someone to point it out.

So, the question requires us to find the number of unique substrings in a given set of strings. At first I thought as the question unfolds and since I use C++ for most problems, I was thinking in terms of the available data structures in STL. So, at first I thought of storing the unique substrings (of the order of the square of the length of the string) in a 'set'. The set takes care of the uniqueness, but the problem arises if I try to find the 'k-th' smallest string. This would take really long, for each query.

So, the things that led me to another solution were the following realizations :-
  • First solution takes too long inserting each string
  • Insertion of the substrings of a string like "aab", will require steps to insert each of {a, aa, aab, ab, b}, i.e. 5 different insertion steps. Here, I thought if I could simply mark {a,aa} as unique substrings as well when inserting aab
  • Searching for a string in a set (when checking for uniqueness) requires comparison of almost the entire string even if a similar string has already been compared with. For instance, in a set of {abc, abe}, I am checking if "abd" exists, then I would need to carry out 6 comparisons (between characters). This wastes the earlier learnt lesson that the first two characters are 'ab'.
These things led me to think in terms of tries and consequently I related that it is possible find substrings of a string in its Suffix Tree. A suffix tree is, basically, a trie of suffixes of a string. So, if I create one suffix tree as the set of the unique substrings, I have the following advantage :-
  •  I have to insert only the m suffixes of a string of length m. This reduces the number of insertions to linear from quadratic in the previous algorithm.
  • While checking if the string to be inserted is already present in the set, I need only O(d) comparisons, where d is the length of the string to be inserted.
  • While checking for the k-th smallest string (lexicographically), I can make use of the information about the number of children of a node (not the number of leaves it leads). The number of children of a node can be related to the number of unique "substrings" (not suffixes, which is given by the number of leaves). 

So, eventually, I ended up coding up a SuffixTree and the solution I submitted can be found below (or follow this link). But again, this solution has been able to pass only one of the test cases. So, either there's some serious flaw with the approach or I've failed to code what I had in mind. 

I'd be glad to hear suggestions for improvements in the algorithm or the code.

Tuesday, July 3, 2012

Converting a file descriptor to an I/O Stream in Linux [C++]

Recently, if one can call something that happened over 2 months ago 'recent', I decided to do a Networks assignment entirely in C++ and using streams. Why, you ask ? Well, the simplest answer is, "I was bored." I had done another network assignment in C, through which I had seen quite a lot of functions that are required to connect, send/receive messages, etc. So, this was something like a challenge.

And that's exactly what it turned out to be, a challenge. Four hours into the assignment, which was obviously just 2 days before the submission, I was struggling to understand how could I convert all the file descriptors that were being returned from various C based functions into C++ streams.

As it turns out, due to difference in uses and implications of file descriptors in different operating systems, there was no standard function to convert a File Descriptor to a stream.  That was what I learnt after hours of scouring over StackOverflow and running into answers that explained "How to get the underlying FILE* from fstream?" until I finally found out a solution that works on G++.

This may definitely not be the best approach, but a hack.

I used it here and built upon it. Besides the problem of handling Unicode characters (which I understand was because I was using string and not wstring ), I did not face any problem with streams; so I would assume that the solution worked well.

If you are aware of a smarter, quicker and cleaner method of doing this, please share it in comments.

Monday, January 9, 2012

Alias a URL on Apache

Many a times I have needed to share a page using Apache, without copying the page into the '/var/www' folder. Discussed below is one of the solutions that I often use.

Lets say you have a folder '/home/user/images' that you would like to share over Apache but you might not want to put it over into '/var/www' maybe because you have other scripts performing on this folder or taking backup from time to time.

In such a scenario, Apache's mod_alias comes to our rescue. This module allows us to provide an alias URL for the required directory. To achieve the same, insert the following lines in the '/etc/apache2/httpd.conf'

Alias /image /ftp/pub/image

<Directory /ftp/pub/image>
  Order allow,deny
  Allow from all
Care must be taken regarding the trailing slash ('/') in the directory's name. Among the options available, I suggest not using the trailing slash in any path (neither the alias, nor the actual path).

This is not the end to the process though. One important step is to ensure that the path is 'executable' by other users as well. Each of the parent directory should be also be executable by other users.

Use 'chmod 711' on each of the parent directories.

Voila, now you are done. Just restart apache and visit the URL

Friday, January 6, 2012

Chrome becomes the most popular browser in India

TechRaga reported in September, "Chrome Beats Internet Explorer And Becomes #2 Browser in India". The latest statistics from StatCounter show that Chrome is now the most popular browser in India, closely followed by Firefox.

Source: StatCounter Global Stats - Browser Market Share

The above image depicts the share of browsers in the last 10 weeks, and it clearly shows that Firefox and Chrome are fighting for the top spot.

Despite the 3rd spot in India, IE enjoys the first spot worldwide, while Chrome stands at the 2nd spot after recently overtaking Firefox globally.

Source: StatCounter Global Stats - Browser Market Share

According to StatCounter, among various browser versions Chrome 15.0 has already taken over the global market followed by IE 8.0.

Source: StatCounter Global Stats - Browser Version Market Share

These browser wars remind us of the IE-Netscape war which was won by IE, but it looks like Open Source Browsers are becoming more popular, even among ordinary users.