Originally posted on: http://geekswithblogs.net/akraus1/archive/2015/07/31/166002.aspx
It has been a long wait to finally get the new version of the new Windows Performance Toolkit. I had not much time to test the betas since until now managed call stack resolution was not working.
WPT for Windows 10 makes some nice progress which you can download here. Click on the Download Standalone SDK and run it. Then uncheck everything except Windows Performance Toolkit and
it will install on your machine within minutes. If you download the complete Windows 10 SDK you are not downloading 170MB but several GB.
Management Summary
- Graphs are more readable.
- Graphs with short spikes are much better rendered without the need to zoom into to see them.
- Colors for graphs are changeable (finally).
- Filtering can be undone now in the context menu.
- Quick Search in all tables.
- Symbol load dialog was improved.
- You can configure your own stack tag files in Trace Properties.
- No new Graph Types (at least with the default providers I enable).
- 5 Added columns to CPU Usage Sampled
- Two could be extremely useful: Source File Name and Source Line Number
- 12 Added columns to CPU Usage Precise.
- Two are to user mode developers useful: ProcessOutOfMemory and CPU Usage (in view).
- Occasional crashes are also featured now.
CPU Usage Sampled
The biggest improvement is certainly the possibility to see source file and line numbers. You can find out this way where most CPU is spent on which line if the CPU consumption happens in your code.
A quick recap for the newbies: The CPU Usage Sampled graph is generated by taking the stack trace of all running threads 1000 times per second (this is the default). These call stacks are then added together.
A count in your Main method of 1000 means that the method has used one second of CPU time.
If the call stack ends not in your e.g. Main method when the stack trace is taken then your source file will not be displayed there. As usual you have to take a sharp look at the numbers. Your method
may be causing a high CPU consumption but it might never show up with a source and line number because the stack always ends in an external library call (printf in my case) for which I did not have the pdb loaded.
It would be nice to have file and line numbers for managed code as well but this feature is of limited use as it is now.
Ideally I want to see this stuff while I am drilling into a call stack all my way down and not only if the call stack ends in a method which I did compile.
For Reference here is the complete list of columns you can choose from. Green are unchanged columns, Bold are new ones. I never have found out how to give a thread a name in WPT.
If anyone knows more about the mystical thread names and how they can be set I would be very interested in it.
WPT 8.1 | WPT 10 |
% Weight | % Weight |
Address | Address |
All Count | All Count |
Annotation | Annotation |
Compiler Optimization | Compiler Optimization |
Count | Count |
CPU | CPU |
Display Name | Display Name |
DPC/ISR | DPC/ISR |
Function | Function |
Image RVA | |
Inlined Functions | Inlined Functions |
Is PGO'ed | Is PGO'ed |
Module | Module |
PGO Counts | PGO Counts |
PGO Dynamic Instruction Count | PGO Dynamic Instruction Count |
Priority | Priority |
Process | Process |
Process Name | Process Name |
Rank | Rank |
Section Name | |
Section RVA | |
Source File Name | |
Source Line Number | |
Stack | Stack |
Table | Table |
Thread Activity Tag | Thread Activity Tag |
Thread ID | Thread ID |
Thread Name | Thread Name |
Thread Start Function | Thread Start Function |
Thread Start Module | Thread Start Module |
TimeStamp | TimeStamp |
Trace # | Trace # |
Weight | Weight |
Weight (in view) | Weight (in view) |
CPU Usage Precise
This is by far the most complex table. With WPT 10 we have 66 columns to choose from to find out how our threads interact with each other.
I am not sure how WPA detects that a process was out of memory but I think the kernel knows quite well if something did go wrong. The ProcessOutOfMemory column could be handy
to check stress tests when the machine was under heavy load which could explain subsequent application failures.
CPU Usage (in view) seems to be an attempt to display a more true thread running time. I am not sure how this value is calculated but it seems that if all cores are in use and you
get significant Ready times then it differs quite a lot. Otherwise the differences are marginal.
There have been quite some renaming of columns which might render your custom filter and enablement conditions useless and WPA will show you an error.
If your old preset did enable by default all processes except the Idle process which clutters up the view it was set to:
([Series Name]:="NewProcess" AND NOT ([NewProcess]:="Idle (0)"))
With WPT 10 you need to change it to:
([Series Name]:="New Process" AND NOT ([New Process]:="Idle (0)"))
This happens to all users of WPT 8.1 which have saved a custom default profile which contains this enablement rule.
With an empty enablement rule your view will look like
and with the rule
you will see no difference when you apply it. Why? Because this rule only enables matching stuff but it does not disable the non matching
processes! I am with you that this is very counter intuitive but it is the way it is. If you save this as new default profile and then open a new file
you will no longer get a view where the Idle process is enabled by default.
For reference I list here all available columns with the old and new names so you have an easier time to update your custom filters, enablement
and expansion rules.
Green lines are unchanged, yellow are renamed columns and bold are new columns in WPT 10.
WPT 8.1 | WPT 10 |
% CPU Usage | % CPU Usage |
AdjustIncrement | |
AdjustReason | |
Annotation | Annotation |
Count | Count |
Count:Waits | Count:Waits |
Cpu | Cpu |
CPU Usage (ms) | CPU Usage |
CPU Usage (in view) | |
DirectSwitchAttempt | |
IdealCpu | Ideal Cpu |
KernelStackNotResident | |
LastSwitchOutTime (s) | Last Switch-Out Time |
NewInPri | New In Pri |
NewOutPri | New Out Pri |
NewPrevOutPri | New Prev Out Pri |
NewPrevState | New Prev State |
NewPrevWaitMode | New Prev Wait Mode |
NewPrevWaitReason | New Prev Wait Reason |
NewPriDecr | New Pri Decr |
NewProcess | New Process |
NewProcess Name | New Process Name |
NewQnt | New Qnt |
NewState | New State |
NewInSwitchTime (us) | New Switch-In Time |
NewThreadId | New Thread Id |
NewThreadStack | New Thread Stack |
NewThreadStartFunction | New Thread Start Function |
NewThreadStartModule | New Thread Start Module |
NewWaitMode | New Wait Mode |
NewWaitReason | New Wait Reason |
NewThreadPrevRemainingQuantum | |
NewThreadRank | |
NewThreadRemainingQuantum | |
NextSwitchOutTime (s) | Next Switch-Out Time |
OldOutPri | Old Out Pri |
OldProcess | Old Process |
OldProcess Name | Old Process Name |
OldQnt | Old Qnt |
OldState | Old State |
OldInSwitchTime (us) | Old Switch-In Time |
OldThreadId | Old Thread Id |
OldThreadStartFunction | Old Thread Start Function |
OldThreadStartModule | Old Thread Start Module |
OldWaitMode | Old Wait Mode |
OldWaitReason | Old Wait Reason |
OldThreadRank | |
OldThreadRemainingQuantum | |
PrevCState | Prev CState |
ProcessOutOfMemory | |
Ready (us) | Ready |
ReadyThreadStack | Ready Thread Stack |
ReadyTime (s) | Ready Time |
ReadyingProcess | Readying Process |
ReadyingProcess Name | Readying Process Name |
ReadyingThreadId | Readying Thread Id |
ReadyingThreadStartFunction | Readying Thread Start Function |
ReadyingThreadStartModule | Readying Thread Start Module |
ReadyThreadInDPC | |
SwitchInTime (s) | Switch-In Time |
Table | Table |
Thread Activity Tag | Thread Activity Tag |
Thread Name | Thread Name |
TimeSinceLast (us) | Time Since Last |
Trace # | Trace # |
Waits (us) | Waits |
File IO View
The File IO view has now an additional column named File Path which acts like in Path Tree in Disc IO which is extremely useful if you do differential IO analysis. It seems that my
complaints about the not complete feature parity between Disc and File view have helped a bit.
This view has still a bug. If I zoom into it all graphs are disabled and I have to reenable them after every zoom operation. If I zoom out all is ok.
Generic Events
Besides some minor changes the biggest change is that finally it has got a DateTime (local) column. It shows the clock time and not the time since trace start.
This can be very useful for advanced scenarios when someone did report an issue and he did care to write down the time when it did happen.
Process View
I did not know it until it was removed. Process Tree is no longer with us. It was not really useful anyway.
Process Tree was added to WPT. That was the reason I never saw it before. After using it for some time I find it is easier to navigate in the tree if
you have a parent process which creates many childs.
Disc IO View
The Disc IO view has got only one additional column with the name Boosted which is for my disc IO always false. Not sure if this is a Windows 10 only feature. If someone knows more about it please comment.
Windows Performance Recorder
The recorder has not changed much. It has got some additional profiles like ones for .NET, Edge Browser, … The buffer size for .NET Activity is hard coded
to a 160 MB ring buffer which might not be enough for machines were many .NET apps are running. I still do not buy into this easy to use UI since the xml configs
need hard coding the values into it. It can be a hard value or a percentage of the system memory. Both choices are bad if I want to automate performance regression testing with
very specific workloads and tracing requirements. This GUI and the command line tool (wpr.exe) are just too inflexible.
Yes I know the complete xml has inheritance built into it so I could create derived profiles which override only the specific settings I care about. But this is still not dynamic enough
if I want to create a configuration for each problem and set them depending on the use case at hand.
- xperf is still a good tool to stop a profiling session because it does a full CLR rundown.
- But PerfView is much faster at creating the managed pdbs by not trying to recreate them over and over again with NGen.
- Tracelog from the Windows SDK is still the only tool to start a trace session where you can configure the Win 8.1 enhancement to the ETW infrastructure.
- E.g. filter events from a specific process, or disable events or stackwalking for user mode providers.
Summary
The new WPA tool has got some nice new features but they still need some polishing to become truly awesome. In my daily work I miss the automatic addition
of the etl.NGenPdb folder when present for the generated pdbs of an etl file. This is a useful convention and the old WPA was so nice to automatically include it to my symbol path.
WPA is very sophisticated. To use it right is an art which needs nearly daily practice. That could explain why this tool although it has a great UI has still got not so
much traction. It is like Windbg in disguise which will quickly cause information overload to the untrained brain.
But with some basic tricks which I will show in the near future you can crack otherwise unsolvable performance nuts with a few mouse clicks.