ProfBugging - How to find leaks with allocation profiling

Originally posted on: http://geekswithblogs.net/akraus1/archive/2015/03/22/161982.aspx

Since ETW know how is still considered an arcane science by most people I have the chance to look at a lot of "easy" things which did affect performance a lot but they went bye unnoticed until I did look into with ETW tracing. While checking some GC related stuff to reduce the memory allocations I have noticed that one process had a object[] allocations on the Large Object Heap. This is nothing unusual at all but the allocation stack was interesting.

The allocations were coming from Delegate.Remove. What does this tell me when I see large object allocations in this method? Inside MulticastDelegate you can find the allocation with e.g. ILSpy at

privateobject[] DeleteFromInvocationList(object[] invocationList, int invocationCount, int deleteIndex, int deleteCount)
{object[] array = this._invocationList asobject[];int num = array.Length;while (num / 2 >= invocationCount - deleteCount) // Try to use a much smaller array /2 or /4 or even less if the remaining delegates can fit into it.
    {
        num /= 2;
    }object[] array2 = newobject[num];    // Here is the delegate list allocationfor (int i = 0; i < deleteIndex; i++)  // Copy all delegates until we get to the deletion index
    {
        array2[i] = invocationList[i];
    }for (int j = deleteIndex + deleteCount; j < invocationCount; j++)  // continue filling up the array with the remaining delegates after the deletion index and deletion count 
    {
        array2[j - deleteCount] = invocationList[j];
    }return array2;
}

This code basically tries to allocate a smaller array (half of the original size or less) if possible. Otherwise an array with the same size is allocated. The for loops copy the remaining delegates into the new invocation array.

Ok now we know that we have allocated a new array for the delegate invocation list. What is the error here?

Nope not yet.

Still not.

I strongly suspect that we have an event handler leak here. Why? Delegate.Remove is called mainly by event handler remove methods. Since the allocated object array goes to the LOH it must be bigger than 85000 bytes. How many object references do fit into a 85k array under x64?

85000/8 = 10625 Delegates

We know therefore that the event which comes later in the call stack had at > 10625 subscribers. So many subscribers are usually the sign of an event handler leak and sure it was. It was a static event which was the source of a quite huge memory leak which was very visible in some memory dumps which were taken later. Thanks to ETW allocation profiling I was able to see the root cause of a memory leak without using a debugger.

If you are not using PerfView for some reason but prefer xperf or tracelog to configure ETW tracing you can enable .NET Allocation Sample profiling with

xperf -on proc_thread+loader+profile -stackwalk profile -f c:\temp\kernel.etl -start clr -on Microsoft-Windows-DotNETRuntime:0x40200095 -f c:\temp\clr.etl
xperf -stop -stop clr -d c:\temp\traces.etl

The magic value 0x40200095 comes from

C:\Windows\Microsoft.NET\Framework64\v4.0.30319\CLR-ETW.man

Keyword	Name	Description
0x0000000000000001	GCKeyword	GC events
0x0000000000000004	FusionKeyword	Binder (Log assembly loading attempts from various locations)
0x0000000000000008	LoaderKeyword	Loader (Assembly Load events). Also necessary to get the module name in JITed call stacks.
0x0000000000008000	ExceptionKeyword	Exception
0x0000000000200000	GCSampledObjectAllocationHighKeyword	When enabled limit rate of allocation events to 100 events/s.
0x0000000002000000	GCSampledObjectAllocationLowKeyword	When enabled limit rate of allocation events to 5 events/s.
0x0000000040000000	StackKeyword	When enabled a CLR Stackwalk is performed for every GCSampledObjectAllocation event. This is how PerfView gets its call stacks even for non Ngenned x64 code on Win7 machines where the ETW stackwalks would skip Jited code.
0x0000000000000010	JitKeyword	Needed when you want to get call stacks of an application. If the application was already started you need to force the enumeration of all running processes with an extra CLR Rundown provider which gets all required data.
0x0000000000000080	EndEnumerationKeyword	Needed to enumerate JITed methods upon process exit which were not recorded by the JitKeyword

The xperf line above is all you need when the process starts when your ETW session is already running and the process terminates while the session was running. If not you need to force the CLR to enumerate all methods and modules which were JITed. If e.g. your process was already running and still runs when you stop profiling you need the second run to enumerate all already JITed methods. This is called CLRRundown with another provider.

xperf -start clrrundown -on Microsoft-Windows-DotNETRuntimeRundown:0x118 -f c:\temp\rundown.etl

To merge it with the previous recording you can use

xperf -merge c:\temp\traces.etl c:\temp\rundown.etl c:\temp\tracesMerged.etl

The value 0x118 is the combination of LoaderRundownKeyword (0x8), JitRundownKeyword (0x10) and EndRundownKeyword (0x100). The other values which are responsible for NGen stuff are not needed anymore because we can generate from the native images pdbs with xperf and PerfView while stopping an ETW recording which is the fastest approach.

Since corclr is open sourced we can also get our hands on all the internals (finally!). It turns out that we are not limited to 5 (GCSampledObjectAllocationLowKeyword) or 100 (GCSampledObjectAllocationHighKeyword) allocation events/s. If we want to get them all we can set the environment variable

set COM_PLUS_UNSUPPORTED_ETW_ObjectAllocationEventsPerTypePerSec=10000

to the number of allocation events/s we want. If you set it to 10000 you get nearly all allocations at the cost of a slower system but if you want exact number this can help in some situations. If you want to see the other hidden COMPLUS flags have a look at https://github.com/dotnet/coreclr/blob/master/src/inc/clrconfigvalues.h. There are some pretty interesting tweaks inside it. E.g. you can configure the spinning behavior of CLR locks aka Monitor and other stuff which could come in handy to tweak or debug an application.

PerfView knows other .NET private and stress providers but there are no manifest´s for them. This is no longer true. Since CoreClr is basically a clone of the desktop CLR we can take a look at ClrEtwAll.man which contains the definition for all CLR providers. The only missing thing is the resource dll to register it. I have compiled CoreClr and put the resulting clretwrc.dll along with the manifest for the Microsoft-Windows-DotNETRuntimeStress (CC2BCBBA-16B6-4cf3-8990-D74C2E8AF500) and Microsoft-Windows-DotNETRuntimePrivate (763FD754-7086-4dfe-95EB-C01A46FAF4CA) here: https://drive.google.com/folderview?id=0BxlmobpeaahAbEdvWHhQb1NCcFU&usp=sharing&tid=0BxlmobpeaahAa2dMVEMyVG9nRnc.

You need to put the resource dll to D:\privclr or edit the path in ClrEtwAll.man to your location of the dll to make the registration with

wevtutil im ClrEtwAll.man

successful. If you really want to look into these logs you can do it with WPA now as well. Some events could be interesting for troubleshooting:

The xperf calls above were only examples where the -BufferSize, -MinBuffers, -MaxBuffers and -MaxFile were not supplied to make the example not overloaded by these things. In reality you want to trace your recording either until a specific size of the recording was reached (-MaxFile) (e.g. 4 GB is a quite huge size). Additionally you can configure -FileMode Circular to record in a ring buffer which enables you to record for a very long time until something interesting happens where you want to stop recording usually. This can be achieved e.g. with Performance Counter triggers which can execute a command when a performance counter was above or below a threshold.

A more typical xperf command line to record into a ring buffer with 2 GB of buffer size which records

Context Switches with call stacks
Enables Profiling with 1ms sample rate with call stacks
Process lifetimes
Process memory information

xperf -on proc_thread+loader+meminfo+disk_io+filename+cswitch+dispatcher+profile -stackwalk profile+cswitch+readythread -f C:\kernelCircular.etl -MinBuffers 100 -BufferSize 512 -MaxBuffers 2000 -MaxFile 2000 -FileMode Circular

WPR always uses a 10% physical memory ring buffer (1,6GB on my 16 GB box) which is allocated right from the start. The xperf trace buffers on the other hand can grow until 2GB in 512KB chunks. This is better because you usually need two trace sessions (one user and one kernel session) and you do not want to allocate upfront for each session the maximum size which is potentially never reached.

That's it for today. Happy troubleshooting and try out your new tools. If you had good or bad experiences with these tools I would love to hear about it.

ProfBugging - How to find leaks with allocation profiling

Trending Articles

Practice Sheet of Right form of verbs for HSC Students

Download: FK ft Shenky – Nakuyewa ”Prod by: Shenky”

How to win at Markstrat (Markstrat Tips and Tricks) – Vodites

Ominde Commission Report and Recommendations – Ominde Report of 1964

Bureau of Internal Revenue: Regional Offices (Directory)

GO 53 on Enhancement of Ex-gratia upto 5 Lakhs Toddy Tappers in Telangana

Cakewalk CA-2A Leveling Amplifier v2.0.1.97 WiN, v2.0.1.96 OSX Incl Keygen

Mp3 Download: Mdu - Kunjenjenjena

How the kill the job , when DTP request running for long hours.

Microsoft Intune から展開しているアプリのアップデートについて

18-year-old girl was beaten for half an hour by two Northampton men in 'an...

Car crash in Dunton Bassett leaves driver in critical condition

Macky 2, Two Others In Road Accident

Application log 00000000000000089514: Could not convert queue DLVST90CLNT

Detroit mafia: D’Anna Brothers agree to plea deal

Delivery block field greyed out using VA02

Muloraki Au

【個人撮影】スマホのプライベート映像♪「中に出さないで///」カラオケ屋での生ハメ撮りが流出ｗ【リベンジポルノ】＠PornHub

BREAKING NEWS: Diamond Platnumz Is Reported Dead After Ghastly Car Accident

FIAT 500 B0111 B0112