Friday, February 22, 2008

Using LinQ to find largest group of anagrams in a given set of words

A friend of mine, Paweł R., who is an academic lecturer once gave following task to his students:

Two words are anagrams if they contain the exactly same number of the same letters. Write a program which, for given a set of words, will find a largest group of anagrams from this set. If more than one set exists with the same number of words, write the contents of each set.

A large set of words to play with this problem can be downloaded from my friend's page.

The task is not difficult at all - you just have to group words which are anagrams and then find the largest group. The easiest way to group is to split each word, sort letters ascending to get the hash key that uniquely identifies the word and then store the word in a hashtable using the key.

For example, having "leaf", "flea", "shore" and "horse", you split each word, sort letters and you get two distinct hash keys "aefl" and "ehors" and each hash group containst exactly two words. Let's call such hash a norm of a word.

It is easy to write a program in C# just few dozens of lines long but if are you a LinQ guy you would probably love to try to solve this task using LinQ.

I strongly recommend to stop reading now. Try to solve this task by yourself. You can start with sligtly simplified version: do not find all largest groups of anagrams, find just any largest group.

Are you ready:

// from the collection of words
( from slowo in File.ReadAllLines( "slowa.txt", Encoding.GetEncoding( "iso-8859-2" ) )
      group slowo by
          // group by the anagram "norm"  
          new string(
                // split the word into letters
              ( from letter in slowo.ToLower().ToCharArray()
                // and sort letters
                orderby letter
                select letter ).ToArray() )
          // so we have groups indexed by "norm"
          into anagramy
      // order this groups so the largest group is the first  
      orderby anagramy.Count() descending
      select anagramy.ToArray()
    ).First()

You probably came up with something similar to this. When I test it on the large set of words from the link given above, I get a group of seven words, just foreach the result and you will see them.


Now something more complicated - let's find all largest groups, not just the first one. To accomplish this we just group these groups of words by the size of the group and take the first groupped group of words. Sounds funny? Well, take a look:



foreach ( string[] tablicaanagramow in
    // tablica is a list of groups of words
    ( from tablica in
        // return list of groups
        ( from slowo in File.ReadAllLines(
                        "slowa.txt", Encoding.GetEncoding( "iso-8859-2" ) )
          group slowo by
              new string(
                  ( from letter in slowo.ToLower().ToCharArray()
                    orderby letter
                    select letter ).ToArray() )
              into anagramy
              // we do not have to orderby anagramy.Count() descending
              // since groups of anagrams are groupped
              select anagramy.ToArray()
        )
     // group by the group size
     group tablica by tablica.Count() into tablicatablic
     // and order by the size of any group from the group :)
     orderby tablicatablic.First().Count() descending
     select tablicatablic.ToArray()
     ).First() )
 {
        foreach ( string slowo in tablicaanagramow )
            Console.Write( slowo + "," );
        Console.WriteLine();
 }

Isn't LinQ amazing?

2 comments:

Anonymous said...

Console.Write( string.Join(", ", tablicaanagramow));

:)

Wiktor Zychla said...

true, I wrote the exact advise two months later:

http://netpl.blogspot.com/2008/04/c-code-snippets-list.html

(at the end of the entry)