Anagrams in an Array of Strings

Question: You are given an array of strings and you are asked to display all the anagrams within the array. For those who don’t know, two words are anagrams if they contain the same characters.

Answer: The brute force solution would be to take each string and compare it to every other string in the array one character at a time ( O(n^4) solution, assuming there are n strings with maximum n characters each). This will surely give you the answer but almost certainly not the job unless you improve it. If you have no idea what the O(?) notation is, this might be a good time to read up on it (wikipedia).

Let’s use this example to improve on the solution {“abc”, “albert”, “cat”, “gate”, “cab”, “grow”, “act”}

One way we can quicken the string comparison process is by pre-sorting each string alphabetically (any sorting would be fine actually). Pre-sorting helps you avoid comparing each character in a string to each character in another string. You only need to compare every character to the character in the corresponding position of the other string. This reduces the order of the solution ( O(n^2 log n (sorting n strings with n characters each) + n^3 (n^2 comparisons of n characters each)) = O(n^3)) assuming the sorting algorithm is O(n log n)). However, this increases memory usage since you now need to store the original strings so you can print them out later.

Continue Reading Below

After sorting the characters in the string, the array would look like this {“abc”, “abelrt”, “act”, “aegt”, “abc”, “gorw”, “act”}.

Currently, we are comparing each string with every other string. That seems inefficient. How about in addition to sorting the characters in each string, we sort each string in the array alphabetically (again any other sorting will do). Sorting the strings in the array means you do not have to compare each string to every other string, you only have to compare it to the next string in line. If they happen to be the same (i.e. found an anagram), then you can compare with the one after that. Whichever the case is, the order of comparison in this case will be of the order O(n) instead of O(n^2). This sorting will reduce the order to ( O(n^2 log n (sorting characters) + n^2 log n (sorting strings) + n^2 (n comparisons of n characters each)) = O(n^2 log n) assuming the sorting algorithm is O(n log n)). Make sure you also move the strings from the original array in order to maintain the mapping of the sorted array to the original array.

After sorting the strings, the array would look like {“abc”, “abc”, “abelrt”, “act”, “act”, “aegt”, “gorw”}.

There is an alternate solution using hashing. You can first sort the characters in the string ( O(n log n) ) and then put each string in a hash table ( O(n) ). If the spot in the hash table is already taken, then that is an anagram. The order for this solution is ( O(n^2 log n (sorting n strings with n characters each) + n^2 (hashing n strings) + n (n insertions into hash table) = O(n^2 log n) assuming hashing each string is O(n) ).

As you can see, both the solutions have the same order of execution. Do you prefer one over the other? Why? Post as a comment.

As usual, if there is something wrong with this solution, feel free to ping me. Any improvements are always welcome. Cheers!

9 Responses

  1. Tao says:

    How about using trie?
    Steps:
    1, sort all the words in alphabetical order;
    2. build a trie with all the words, during this step, if one leaf node already exists, then an anagram is found.

  2. Timothy L. says:

    There can be n^2 solution.

    The first observation is that two words are anagram iff their character frequencies are the same.

    Then we can just generate n hash tables each representing the character frequency/count of each word. This process takes n^2 time, assuming the hash table access takes O(1) time (since it is hashing a character, not a string).

    Then sort the n hash tables. Each comparison of two hash tables takes a constant time, since the number of different alphabet is a constant number. This process takes n*log(n)*c = O(nlogn).

    Last compare the hash tables in the linear way. Again, each comparison takes a constant time. This process is O(n).

    In all, the running time is O(n^2)+O(nlogn)+O(n) = O(n^2)

  3. Nick says:

    To Kishore – very neat idea, but impractical:

    The product of the first 16 prime numbers is > 64bit uint. Thus, using a bitset would be less memory and computationally intensive than multiplying prime numbers. And a bitset would be O(n) just for the bitset for each string.

  4. i suggest somewhat different approach:
    if these two conditions met then two string are anagram of each other:
    1. both have equal length
    2. their sum of characters in terms of integer are equal like ‘cat’ and ‘act’ would
    like
    sum_act = int(a) + int(c) + int(t);
    sum_cat = int(c) + int(a) + int(t);

    so sum_act == sum_cat

  5. Kishore says:

    Another alternative is to come up with hashing function which is independent of position of character. one simple solution is represent each character by a prime number and hash will be product of the numbers. This way there is no need to sort the strings.

  6. <<>>

    There is one problem with my solution above – are “abc” and “accb” anagrams? if it is, then it’s not working, we need to remove all duplicates before the hash, which can be done in O(n), and does not change the order of execution. E.g., for each string, use a temp array in the size of all possible characters, run over all letters in the string, write “true” in the array for each char you find, and then create a new temp string from all the characters in the array. Same as counting sort but without the counting :)
    This temp string should be hashed and continue with the solution above.

    LJ

  7. Hi!

    Two comments regarding the hash solution:

    1. If 2 strings are hashed into the same spot, it might be an anagram but not surely, because 2 different strings can have the same hash value

    2. Improvement to the hash solution: don’t sort! just run a different hash function – if we use a standard hash on “abc” and “acb” then we get 2 different values, which is not good for us. But, if we use, let’s say, the sum of all characters as the hash function, then both “abc” and “acb” have the same hash value and will hit the same spot. Therefore the order of execution of this solution is O(n^2).

    Obviously, I like the hash solution better.

    LJ

  8. Well , the first solution is definitely better if you’re concerned with deep optimization. This is because collisions can happen. Then collision resolution will take some extra steps when using probing for resolution. if using chaining (linked list solution) if a poor hash function is used, the resolution alone would take O(N) which makes the hashing effectively O(N^2). Even if the hash function is good and only a few collisions take place, we would be faced by the problem of locality of reference which causes many cache misses and causing slower time.

  9. Nate says:

    I prefer the second, because I can already picture implementing it with a single fold.

Leave a Reply

Using Gravatars in the comments - get your own and be recognized!

XHTML: These are some of the tags you can use: <a href=""> <b> <blockquote> <code> <em> <i> <strike> <strong>