Wednesday, 29 March 2006

Interview Programming Problems

Another Saturday done, another interviewing round finished. Thought I would put down into words what I look for when reviewing the programming problems done by candidates. I don't really care if candidates know what I look for - if they can do it in an interview, they can do it in their daily work. Especially so when code reviewers are likely to be at least as watchful as I am in an interview.

As an illustration, I'm using a problem we previously used in our written tests. We replaced it recently because everyone answered it in almost the same way, making it useless as a differentiator between candidates. The problem is as follows:

Implement a function "intersection". The function takes two ASCII strings and returns an ASCII string containing only those characters, that are simultaneously present in both arguments. The result should be as short as possible. For example:
intersection("abde", "bexy") may return "be" or "eb"
intersection("exoweb", "candidate") should return "e"

Almost all solutions (that work) are variants of the same form. Below was the minimum acceptable code to pass first round screening in Exoweb, in a prettied up python format:

`def intersection(a, b):   intersections = ""   for char1 in a:       for char2 in b:           if char1 == char2:               intersections = intersections + char1   return intersections`

(trivia - usage of the "in" keyword is up to 3 times faster than a str.find() call on my laptop)

(trivia #2 - something like 50% of candidates who make it to the written test are unable to write even the above snippet. After this test, 90% of all candidates who have submitted their resume have been eliminated)

There are two problems with the code above, one pretty obvious (not reading the requirements) and the other a lot less so (performance problem). The first is repeated characters in the return statement and the second is an algorithm that does not run in linear time.

The first problem is easy. Given "aaabbb" and "bbbccc", the algorithm above returns "bbbbbbbbb". The problem specification says "The result should be as short as possible." Failure to read the spec or forgetting to check for this is bad, but not fatal as long as one spots this quickly.

The second problem is one that less than 1% of the candidates manage to avoid - algorithmic complexity. If strings a and b were of length n, the double for loops in the algorithm result in a O(n^2) algorithm. For every character added to the length of string n, the computer can end up doing up to n+1 times more computations. This quickly becomes impossible.

On my laptop, with a data set tweaked for the worst case scenario, I get the following execution times:

`(n=1,000) 0.00372 seconds(n=10,000) 0.29497 seconds(n=100,000) 30.20992 seconds`

For every time I increase n by an order of magnitude (*10), the execution time increases by roughly two orders of magnitude (*100). Following this progression, a value of n=1,000,000 would take around 3,000 seconds or 50 minutes!

This problem is relatively easily solved and there are multiple solutions. For those languages without rich libraries, one solution is to build a 128 char length array (the problem specifies ASCII, which is only 128 values) and to run through each string once, putting a value into the array to specify that the character was found. Once complete, it's a matter of scanning all 128 elements to see what was found in both strings. All these operations are in linear time. This has also the advantage of ensuring that the returned result has no duplicates.

For languages with richer libraries or built ins, you can also use hashed containers or even set data types. We disallow using Sets in python because it would simply be too trivial. In Python 2.4, Set data types are built in and the code would look like this:

`def intersection(a, b):   return ''.join(set(a).intersection(set(b)))`

Sets in Java aren't quite so feature rich, lacking the intersection() method, so we still allow it in Java. A non-set method in Python, using just the standard built-in data types might look like this:

`def intersection(a, b):   char_seen = {}   intersections = {}   for char in a:       char_seen[char] = True   for char in b:       if char_seen.has_key(char)           intersections[char] = True   return ''.join(intersections.keys())`

With a n=1,000,000 string size, this takes 1.7 seconds on my laptop, much faster than the expected 50 minutes required by the inefficient, O(n^2) algorithm. With the n = 100,000 string size, the algorithm takes 0.18 seconds, the expected roughly linear decrease in time.

The algorithm above can certainly be optimized further, for different focus areas. Using two dictionaries does waste a bit of memory, and there are probably faster ways of doing this. There are probably readability tweaks too.

In our interviews, it does not matter if the code has flaws on the first try (it must work though), as long as the interviewee can understand the problem when pointed out and fix them. No one is perfect and mistakes are to be expected. We just try to minimize them and fix them as soon as possible.