Originally posted on: http://geekswithblogs.net/akraus1/archive/2015/03/22/161982.aspx
Since ETW know how is still considered an arcane science by most people I have the chance to look at a lot of "easy" things which did affect performance a lot but they went bye unnoticed until I did look into with ETW tracing. While checking some GC related stuff to reduce the memory allocations I have noticed that one process had a object[] allocations on the Large Object Heap. This is nothing unusual at all but the allocation stack was interesting.
The allocations were coming from Delegate.Remove. What does this tell me when I see large object allocations in this method? Inside MulticastDelegate you can find the allocation with e.g. ILSpy at
privateobject[] DeleteFromInvocationList(object[] invocationList, int invocationCount, int deleteIndex, int deleteCount) {object[] array = this._invocationList asobject[];int num = array.Length;while (num / 2 >= invocationCount - deleteCount) // Try to use a much smaller array /2 or /4 or even less if the remaining delegates can fit into it. { num /= 2; }object[] array2 = newobject[num]; // Here is the delegate list allocationfor (int i = 0; i < deleteIndex; i++) // Copy all delegates until we get to the deletion index { array2[i] = invocationList[i]; }for (int j = deleteIndex + deleteCount; j < invocationCount; j++) // continue filling up the array with the remaining delegates after the deletion index and deletion count { array2[j - deleteCount] = invocationList[j]; }return array2; }
This code basically tries to allocate a smaller array (half of the original size or less) if possible. Otherwise an array with the same size is allocated. The for loops copy the remaining delegates into the new invocation array.
Ok now we know that we have allocated a new array for the delegate invocation list. What is the error here?
Nope not yet.
Still not.
I strongly suspect that we have an event handler leak here. Why? Delegate.Remove is called mainly by event handler remove methods. Since the allocated object array goes to the LOH it must be bigger than 85000 bytes. How many object references do fit into a 85k array under x64?
85000/8 = 10625 Delegates
We know therefore that the event which comes later in the call stack had at > 10625 subscribers. So many subscribers are usually the sign of an event handler leak and sure it was. It was a static event which was the source of a quite huge memory leak which was very visible in some memory dumps which were taken later. Thanks to ETW allocation profiling I was able to see the root cause of a memory leak without using a debugger.
If you are not using PerfView for some reason but prefer xperf or tracelog to configure ETW tracing you can enable .NET Allocation Sample profiling with
xperf -on proc_thread+loader+profile -stackwalk profile -f c:\temp\kernel.etl -start clr -on Microsoft-Windows-DotNETRuntime:0x40200095 -f c:\temp\clr.etl
xperf -stop -stop clr -d c:\temp\traces.etl
The magic value 0x40200095 comes from
C:\Windows\Microsoft.NET\Framework64\v4.0.30319\CLR-ETW.man
Keyword | Name | Description |
0x0000000000000001 | GCKeyword | GC events |
0x0000000000000004 | FusionKeyword | Binder (Log assembly loading attempts from various locations) |
0x0000000000000008 | LoaderKeyword | Loader (Assembly Load events). Also necessary to get the module name in JITed call stacks. |
0x0000000000008000 | ExceptionKeyword | Exception |
0x0000000000200000 | GCSampledObjectAllocationHighKeyword | When enabled limit rate of allocation events to 100 events/s. |
0x0000000002000000 | GCSampledObjectAllocationLowKeyword | When enabled limit rate of allocation events to 5 events/s. |
0x0000000040000000 | StackKeyword | When enabled a CLR Stackwalk is performed for every GCSampledObjectAllocation event. This is how PerfView gets its call stacks even for non Ngenned x64 code on Win7 machines where the ETW stackwalks would skip Jited code. |
0x0000000000000010 | JitKeyword | Needed when you want to get call stacks of an application. If the application was already started you need to force the enumeration of all running processes with an extra CLR Rundown provider which gets all required data. |
0x0000000000000080 | EndEnumerationKeyword | Needed to enumerate JITed methods upon process exit which were not recorded by the JitKeyword |
The xperf line above is all you need when the process starts when your ETW session is already running and the process terminates while the session was running. If not you need to force the CLR to enumerate all methods and modules which were JITed. If e.g. your process was already running and still runs when you stop profiling you need the second run to enumerate all already JITed methods. This is called CLRRundown with another provider.
xperf -start clrrundown -on Microsoft-Windows-DotNETRuntimeRundown:0x118 -f c:\temp\rundown.etl
To merge it with the previous recording you can use
xperf -merge c:\temp\traces.etl c:\temp\rundown.etl c:\temp\tracesMerged.etl
The value 0x118 is the combination of LoaderRundownKeyword (0x8), JitRundownKeyword (0x10) and EndRundownKeyword (0x100). The other values which are responsible for NGen stuff are not needed anymore because we can generate from the native images pdbs with xperf and PerfView while stopping an ETW recording which is the fastest approach.
Since corclr is open sourced we can also get our hands on all the internals (finally!). It turns out that we are not limited to 5 (GCSampledObjectAllocationLowKeyword) or 100 (GCSampledObjectAllocationHighKeyword) allocation events/s. If we want to get them all we can set the environment variable
set COM_PLUS_UNSUPPORTED_ETW_ObjectAllocationEventsPerTypePerSec=10000
to the number of allocation events/s we want. If you set it to 10000 you get nearly all allocations at the cost of a slower system but if you want exact number this can help in some situations. If you want to see the other hidden COMPLUS flags have a look at https://github.com/dotnet/coreclr/blob/master/src/inc/clrconfigvalues.h. There are some pretty interesting tweaks inside it. E.g. you can configure the spinning behavior of CLR locks aka Monitor and other stuff which could come in handy to tweak or debug an application.
PerfView knows other .NET private and stress providers but there are no manifest´s for them. This is no longer true. Since CoreClr is basically a clone of the desktop CLR we can take a look at ClrEtwAll.man which contains the definition for all CLR providers. The only missing thing is the resource dll to register it. I have compiled CoreClr and put the resulting clretwrc.dll along with the manifest for the Microsoft-Windows-DotNETRuntimeStress (CC2BCBBA-16B6-4cf3-8990-D74C2E8AF500) and Microsoft-Windows-DotNETRuntimePrivate (763FD754-7086-4dfe-95EB-C01A46FAF4CA) here: https://drive.google.com/folderview?id=0BxlmobpeaahAbEdvWHhQb1NCcFU&usp=sharing&tid=0BxlmobpeaahAa2dMVEMyVG9nRnc.
You need to put the resource dll to D:\privclr or edit the path in ClrEtwAll.man to your location of the dll to make the registration with
wevtutil im ClrEtwAll.man
successful. If you really want to look into these logs you can do it with WPA now as well. Some events could be interesting for troubleshooting:
The xperf calls above were only examples where the -BufferSize, -MinBuffers, -MaxBuffers and -MaxFile were not supplied to make the example not overloaded by these things. In reality you want to trace your recording either until a specific size of the recording was reached (-MaxFile) (e.g. 4 GB is a quite huge size). Additionally you can configure -FileMode Circular to record in a ring buffer which enables you to record for a very long time until something interesting happens where you want to stop recording usually. This can be achieved e.g. with Performance Counter triggers which can execute a command when a performance counter was above or below a threshold.
A more typical xperf command line to record into a ring buffer with 2 GB of buffer size which records
- Context Switches with call stacks
- Enables Profiling with 1ms sample rate with call stacks
- Process lifetimes
- Process memory information
is
xperf -on proc_thread+loader+meminfo+disk_io+filename+cswitch+dispatcher+profile -stackwalk profile+cswitch+readythread -f C:\kernelCircular.etl -MinBuffers 100 -BufferSize 512 -MaxBuffers 2000 -MaxFile 2000 -FileMode Circular
WPR always uses a 10% physical memory ring buffer (1,6GB on my 16 GB box) which is allocated right from the start. The xperf trace buffers on the other hand can grow until 2GB in 512KB chunks. This is better because you usually need two trace sessions (one user and one kernel session) and you do not want to allocate upfront for each session the maximum size which is potentially never reached.
That's it for today. Happy troubleshooting and try out your new tools. If you had good or bad experiences with these tools I would love to hear about it.