Performance of Finding Duplicates in LINQ

Today I saw about a dozen lines of code in a project that I was working on simply dedicated to determining whether or not there are duplicates within a stringlist (List). This made me start to question whether or not there was a better way of doing this, to which the little light bulb in my head went off and directed me towards LINQ. A quick search on the web for “Finding duplicates with LINQ” landed me at this chunk of code:

myList.GroupBy(i => i).Where(g => g.Count() > 1).Select(g => g.Key).ToList()

This works beautifully and returns the non-unique items in myList. This not only reduces the line count in applications needing to do similar operations the harder way, but to me it is much more readable. These days, with the complex applications that are getting built, readability is a major importance in code.

However, even though this is the new .kewl. way for me to identify non-unique elements in lists, I still question whether or not it is the .efficient. way to do it. The first thing I looked into is what the code looks like after it is compiled. This is what I found:

private static List<string> LinqNonUnique(List<string> items)
{
	return items.GroupBy<string, string>(delegate (string i) {
		return i;
	}).Where<IGrouping<string, string>>(delegate (IGrouping<string, string> g) {
		return (g.Count<string>() > 1);
	}).Select<IGrouping<string, string>, string>(delegate (IGrouping<string, string> g) {
		return g.Key;
	}).ToList<string>();
}

This makes sense given that what I know about LINQ is that the expressions mainly get turned into inline delegates. So, what does the logic within LINQ.s .GroupBy and .Where look like . back to reflector I go.

// GroupBy() Method
public static IEnumerable<IGrouping<TKey, TSource>> GroupBy<TSource, TKey>(this IEnumerable<TSource> source, Func<TSource, TKey> keySelector)
{
	return new GroupedEnumerable<TSource, TKey, TSource>(source, keySelector, IdentityFunction<TSource>.Instance, null);
}

// GroupedEnumerable Class Definition
internal class GroupedEnumerable<TSource, TKey, TElement> : IEnumerable<IGrouping<TKey, TElement>>, IEnumerable
{
	// Fields
	private IEqualityComparer<TKey> comparer;
	private Func<TSource, TElement> elementSelector;
	private Func<TSource, TKey> keySelector;
	private IEnumerable<TSource> source;

	// Methods
	public GroupedEnumerable(IEnumerable<TSource> source, Func<TSource, TKey> keySelector, Func<TSource, TElement> elementSelector, IEqualityComparer<TKey> comparer);
	public IEnumerator<IGrouping<TKey, TElement>> GetEnumerator();
	IEnumerator IEnumerable.GetEnumerator();
}

// GroupedEnumerable CTOR
public GroupedEnumerable(IEnumerable<TSource> source, Func<TSource, TKey> keySelector, Func<TSource, TElement> elementSelector, IEqualityComparer<TKey> comparer)
{
	if (source == null)
	{
		throw Error.ArgumentNull("source");
	}
	if (keySelector == null)
	{
		throw Error.ArgumentNull("keySelector");
	}
	if (elementSelector == null)
	{
		throw Error.ArgumentNull("elementSelector");
	}
	this.source = source;
	this.keySelector = keySelector;
	this.elementSelector = elementSelector;
	this.comparer = comparer;
}

// GroupedEnumerable Enumerator
public IEnumerator<IGrouping<TKey, TElement>> GetEnumerator()
{
	return Lookup<TKey, TElement>.Create<TSource>(this.source, this.keySelector, this.elementSelector, this.comparer).GetEnumerator();
}

So, it seems that the actual groupping operations are getting handed off to the enumerator to do the work with a comparison. This seems pretty good but I.m not really sold yet. I want to know just how much it power it takes to perform the .GroupBy and .Where operations.

private static List<string> LinqNonUnique(List<string> items)
{
	return items.GroupBy(i => i).Where(g => g.Count() > 1).Select(g => g.Key).ToList();
}

private static List<string> ForeachNonUnique(List<string> items)
{
	List<string> retItems = new List<string>();

	for (int i = 0; i < items.Count; i++)
	{
		if (items.LastIndexOf(items[i]) != i)
		{
			retItems.Add(items[i]);
		}
	}

	return retItems;
}

I profiled the above two methods and found that using the foreach is about twice as slow as the linq method. I.m sure there are better ways of performing a uniqueness operation but I am basing this on what I most commonly see in code.

 

All-in-all I feel more confident in LINQ after investigating this.

 

Leave a Reply

Your email address will not be published.

Humanity Verification *Captcha loading...