Originally posted on: http://brustblog.com/akraus1/archive/2015/08/16/166270.aspx
Did you ever see a nice tool with a fancy UI and thought: Hey that thing is powerful. I will try it! But later you were left back scratching your head why other can use this tool and get great results but you just don't get it?
Then I have news for you I have created a WPA profile which aims at user and not kernel mode developers. Many columns were removed from the context menus to give you all the power you need to find issues in your application
where no one else has found the root cause.
The ETW profile can be downloaded fromhere as a zip file. Unpack it to a directory and you are ready to go. So what is inside it?
File | Description |
Simple.wpaProfile | The main WPA profile you can use now. |
JIT.xml | WPA region file referenced by Simple.wpaProfile to get JIT times of your application like PerfView. |
GC.xml | WPA region file referenced by Simple.wpaProfile to get Garbage Collection metrics like in PerfView only better! |
HookRegions.xml | WPA region file referenced by Simple.wpaProfile to see mouse clicks in your ETL file when ETWControler is running and capturing your keyboard and mouse events. |
default.stacktags | Stacktag file which serves as base for your application to find common .NET issues like event handle leaks, too much regex usage, monitor contention and much more. |
Besides this there are also some WPR profiles added to enable recording of specific .NET events like exceptions together with GC events into a larger 1 GB buffer.
WPR of Windows 10 has under Scenario Analysis .NET Activity which has a very small 160MB ring buffer which is way to small for my needs. It might be good
for you but I have added extra WPR profiles for GC and JIT as well.
To show you the difference I have created a small PerformanceIssueGenerator.exe application. This generates various issues which you can analyze with WPA.
I have recorded the data already and put the 7z file here. If you want to extract it you need to download the 7z from the official site. To view the ETL file
you need a >= Windows 8 machine and Windows Performance Toolkit from the Windows 10 SDK.
When you have downloaded the profile and the sample etl file you can apply the profile under Profiles - Apply…
Then you get two tabs. The first one contains .NET specific analysis stuff like GC and JIT. The second tab can be used for the usual bottleneck analysis regarding
CPU, disk and memory usage as well as wait chain analysis. You do not need to load the profile every time. You can save it as your default profile
by clicking on Save Startup Profile to open all of your ETL files in this view now.
Normally I use PerfView for GC issues to check out the GC Stats of an application to see if anything significant is going on. If you have ever used PerfView then
you will have noticed that it is a great tool combined with a crappy UI. At least for some key scenarios we can now use WPA with hand crafted region files instead of PerfView.
GC Views
You now get a GC view like this
You can visualize each GC generation type and its time it did take to execute. The Count column also tells you how many GCs you did have. This can help a lot if you
want to get GC metrics only for a specific time region in your application. Now it is easy to see how much time of your use case was available to your application
and how much time your threads did have to wait for the garbage collector. This was my main issue with PerfView and its GCStats view that it is
calculated for the complete duration of the profiling run. Most of the time I want GC metrics only for specific parts of my application because
I am optimizing mostly only at a single place at one time.
Here is how you get the GC statistics in PerfView:
The numbers in GC Rollup By Generation match pretty well to the region file. Also the GC pause time correlate quite well with the distribution of generations although the timings
are not 100% the same but the ratios are a good fit.
Since PerfView and ETW use the same data you can rightfully ask the question why there is a difference at all? The answer is that WPA sums not up all GC regions
by their duration. A WPA region is defined by a start and a stop event which is the displayed in the UI like above. But if the regions happen in parallel in several threads WPA will use as sum
only the time where at least one region was active at a time. This diagram illustrates how WPA region summation works:
This detail is for GC things not so relevant but it will be very important when we look at JIT statistics. Since a foreground GC is blocking your complete application there is little
concurrency going on. We can also visualize when GCs of which types are happening. The view Generic Events GC Generations will show you how many GCs of which type were happening.
Initially it is collapsed so you have to open it. This view takes a little while to build up since it needs to parse all ETW events for GC Start events which contain as data point the generation number which is
visualized here.
This is the greatest thing about WPA of Windows 10. Now you can visualize the data of your custom traces. With a little WPA graph configuration metric you can create for your own application key graphs
which show e.g. Frames per seconds, # of db transactions, … Who needs performance counters when you can visualize your trace data much better now?
Marker Events
A key point to know where you want to start looking for an issue are Marks. In WPRUI you can press Ctrl+Win+x to open a window which allows you to write a marker event to your trace file
while you are recording data. Unfortunately this functionality is not exposed to the outside world to set marks programmatically. If you revert to xperf you can set with xperf -m marks programmatically
if you wish to. But since marks are so useful to navigate in an ETL file I really wonder why the method which xperf calls is not documented at all. If you want to know how real power users
are using WPA then you need to call wpa /?
I guess at Microsoft they let their tests run with enabled profiling while saving screenshots as ETW events. If something happens the ETL file is downloaded
from a central repository and the ETL file is opened with a test specific profile. The file is opened and zoomed into the relevant test part which are identified by markers
or regions from a region file. At least some people at Microsoft use this tool so often that it makes sense to automate it even further since the controls of WPA are UI automatable to
script nearly your complete analysis. Now THAT is what I call good performance regression testing.
A little reverse engineering which APIs xperf -m calls did finally show up the method EtwSetMark which is exported by ntdll.dll. I am sure the readers of my blog can figure out the rest as well.
But I really wish it would become an officially supported API since it is so useful. Sure you can define your own marker events as well but since the support in WPA is already built in
it would really help. It would also be nice if xperf would emit the mark event no only to the "NT Kernel Logger" ETW session but all active kernel session so you could mark also the WPA kernel session
which is currently not possible.
JIT View
PerfView can also give you the timings how much time each method did take to compile. This is useful if your are compiling too much code on the fly for your WCF proxies or your serializers. Sure it is done only once
but if this time plays a significant part of your application startup you should rethink how your application works. Since .NET 4.5 will NGen all assemblies on the fly if they are used often enough you do not need to
consider using NGen explicitly for your application. But if you have much dynamic code generation going on you can still suffer from long JIT times.
You can get a similar view by switching from Garbage Collection to JIT Time per Thread
As I have shown you in the GC section. The summation of regions is not a simple sum of the duration of all JIT events. The JIT compiler can compile code in many different threads concurrently.
The Duration sum of all threads of JIT time reported by WPA is therefore largely irrelevant if you want to compare two different runs of your application. Instead you need to look at the JIT times
of each thread. You can copy the duration column with the context menu "Copy Column Selection" into Excel
which gives us the exact same JIT time as Perfview. Now I do not need PerfView for JIT analysis anymore. I am more and more getting away from programming to configuring WPA to give me just the view I need.
If you expand JITs you get at least the namespace of each JIT event. This is the best I could come up with since WPA does not support concatenating strings of different fields into a region name.
But you still can open the also customized view _Activity by Provider Task Process to view the "raw" data if you need more information. If you would drop e.g. Field 5 to the left of the yellow bar you would get a nice summary how
many methods in this namespace were compiled.
As you can guess there are lots of more goodies inside the now published profiles to make WPA really usable for non kernel developers. I have streamlined nearly every context menu where I removed all useless or kernel only columns
from the context menus. You have a much easier time now to concentrate on the important things. With this profile WPA is nearly a new application and central analysis station for managed and unmanaged performance analysis.
In the next posts I will walk you through the problems of PerformanceIssuesGenerator and how you can visualize them in WPA efficiently.