Originally posted on: http://geekswithblogs.net/akraus1/archive/2014/12/21/160740.aspx
We all make dumb mistakes which are not interesting to share. But sometimes people make interesting mistakes. Below is a simple helper class that makes combining several delegates in a logical and manner very easy.
staticclass Constraints<T> {/// <summary>/// Logical and of a list of predicates./// </summary>publicstatic Predicate<T> And(params Predicate<T>[] ands) {// return a new delegate which checks if all delegates in the array did return // true for a given itemreturn item => ands.All(x => x(item)); } }
You can use it then like this:
class Program {staticvoid Main(string[] args) {int[] data = newint[20 * 1000 * 1000]; // 20 million ints Predicate<int> CheckForZero = Constraints<int>.And(i => i > -1, i => i < 1); var sw = Stopwatch.StartNew(); for (int i = 0; i < data.Length; i++) {if (CheckForZero(data[i]) == false) { Console.WriteLine("Violation found at {0}", i); } } sw.Stop(); Console.WriteLine("Did take: {0:F2}s", sw.Elapsed.TotalSeconds); } }
Now you have a nice way to combine separate checks in a reusable way. The only issue is that for many calls to CheckForZero the running time is quite high.
Did take: 0,90s
If we compare this to the "normal" way
for (int i = 0; i < data.Length; i++) {if ( !( data[i] > -1 && data[i] < 1) ) { Console.WriteLine("Violation found at {0}", i); } }
Did take: 0,01s
Whoa. The delegate version is 90 times slower! I found this issue by looking at the mount Everest's in terms of CPU consumption. What is causing so much overhead?
Well you just did wake up a sleeping dragon
with your LINQ delegate version. It is the Garbage Collector! Here is the allocation report of our slow sample:
We have a 80MB int array with 20 million elements. But we allocate 2,5GB of closure classes! Our Constraints.And method needs one closure to capture the list of delegates.
This is only needed once since we cache the returned delegate and will not show up in the profiler at all.
Things get interesting when we invoke the And method of the captured class. In this case we are allocating one additional closure to give access to the current item to check and the enclosing capturing class which gives us access to the ands array needs also be allocated. On top of that we need to create a new Func<Predicate<T>, bool> delegate instance as argument to the All LINQ statement.
To sum it up we allocate on each iteration
- 1 c_DisplayClass4 32 bytes
- 1 Func<Predicate<T,bool>> delegate instance 64 bytes
- 1 SZArrayHelper+SZGenericArrayEnumerator 32 bytes
- Where does this come from? It is caused by the All LINQ statements which iterates over an IEnumerable. If you foreach over an array it needs to get an enumerator from the object to enumerate. Since an array has none out of the box the CLR allocates a helper enumerator which is this strange SZArrayHelper class. If you directly foreach over an array this is NOT used. Instead more efficient code is generated.
In total we allocate 128 bytes on each integer we want to check (4 bytes) which is quite some overhead. 128*20.000.000=2.560.000.000 which is exactly what the profiler reports. Since I did use from PerfView the ETW .NET Alloc method
I get a correct total sum of the allocated bytes back. But the types which are responsible for the allocation and especially the counts can be misleading. If you use .NET Alloc you get all allocations but the use case will not run 1s but >2 minutes which is not something I really like to do. If you want to get exact allocation number you need to check the .NET Alloc checkbox which injects a profiling dll into all new started processes so you can get exact allocation reports. Unfortunately this does not seem to work with the old and current PerfView (1.7) version where I see anything but my new allocations.
When in doubt I would always use ETW .NET Alloc to get reliable although sampled numbers. This works also for JITed code in x64 on Windows 7 machines because for each allocation stack a managed stackwalk is also performed which gives you the full allocation stack. ETW does not work for JITed x64 code on Win7 except if you precompile it with NGen as workaround. On Win 8 and beyond this is no longer an issue except if you try to use API Hooking. There the x64 stack walker still stops walking the stack when the method interception is done in a form it cannot recognize.
This sounds like some complex stuff. Can we fix it? How about providing an overload to our class?
publicstatic Predicate<T> And(Predicate<T> p1, Predicate<T> p2) {return item => p1(item) && p2(item); }
This little overload spares us pretty much all of the overhead. We do not allocate a single byte while iterating over our array. This time we get
Did take: 0,14s
Which is nearly 9 times faster. This is quite good for such a simple overload which is preferred by the compiler when it finds this one as best match. Now we only pay for 3 delegate invocations which are not avoidable if we want to keep the flexibility here.
Now go and check your LINQ statements for hidden dragons. 2,5GB allocations with one line are quite easy to achieve with delegates and LINQ. This is the main reason why LINQ is banned from the Roslyn project because of these subtle delegate issues. It is kind of ironic that the inventors of LINQ do not want to use their own invention in their C# compiler backend for performance reasons.