Quantcast
Channel: Geeks with Blogs
Viewing all 682 articles
Browse latest View live

SpinWait Is Dangerous

$
0
0

Originally posted on: http://bobgoedkoop.nl/akraus1/archive/2015/06/18/165219.aspx

Some bugs are easy some are hard. If you are dealing with a massive multithreaded distributed application consisting of many different processes talking to each other performance issues are pretty much impossible to detect with standard profilers. But with WPT the word impossible can be replaced with possible although much work is still needed. Recently I had an interesting regression issue where a use case did become over one second slower which was a quite big difference. So it should be easy to spot right? Two things were easy but one fraction ( 60% ) was still hiding in the data. Since the title of this post is already very clear here is a small test application which shows the issue:

class Program
    {static Stopwatch StartTime = Stopwatch.StartNew();static TimeSpan WaitTime = TimeSpan.FromMilliseconds(2);staticbool ReturnTrueAfter2ms()
        {if( StartTime.Elapsed.Ticks > WaitTime.Ticks )
            {
                StartTime.Restart();returntrue;
            }returnfalse;
        }staticvoid Main(string[] args)
        {while (true)
            {
                SpinWait.SpinUntil(ReturnTrueAfter2ms);
            }
        }
    }
In the main method we are continuously Spinwaiting for an operation which usually takes some ms (e.g. network IO) to complete. The goal of SpinWait is to spare you context switches in favor of burning CPU cycles. We are talking about some microseconds to spare instead of a context switch. Spinwait will therefore by default spin a few microseconds until it gives up and tries to yield to other threads with increasing persistence. If you look at the code of SpinWait

 

image

you find that for every 20th run it will sleep for 1ms. That in itself is not a big deal if your method takes longer to execute. Only by looking at the code all looks sound and reasonable. What could possibly go wrong? Let have a look at the costs for a context switch with Thread.Yield, Thread.Sleep ( 0 ) and Thread.Sleep( 1 ). Thread.Yield is quite cheap if the other thread has not much work to do. The cost looks like 6 microseconds for doing only a context switch and up to 70 microseconds if the other thread is doing some work until it waits again for something.

image

Thread.Sleep( 0 ) on the other hand lets the thread run forever if no other threads want to run on this CPU. The scheduler looks after the thread for ca. every 250ms where a non switching context switch event is logged but apart from that nothing is stopping my thread.

Things become interesting for Thread.Sleep(1) where we forcefully give up our thread quantum.

 

image

Remember the goal of SpinWait is to spare you an expensive context switch. It spins for about 20-80 microseconds. Then it tries from time to time a Thread.Yield, Thread.Sleep ( 0 ) and every 20th time a Thread.Sleep(1). In the comments of SpinWait it is noted that a thread can wait for a long time. In fact a Thread.Sleep(1) can sleep up to 15ms because you give up your current thread quantum and the scheduler looks after your thread only during the next clock tick which is 15,6ms later again to check if your thread is ready to run again. SpinWait is therefore only useful if your operation completes within a few microseconds in 99.9% of all cases. But if your operation frequently or nearly always takes much longer than a typical context switch then this class will not make you any faster. Instead of sparing a few microseconds you are potentially wasting up to 15 milliseconds every time you call SpingWait.SpinUntil!

If you have many call stacks in your application which cause SpinWait to sleep for a thread quantum of 15ms then you are using SpinWait wrongly!

image

This one was very hard to find because I usually search for long pauses when I am after a one second delay. But this time 600ms were scattered across dozens of calls almost evenly distributed throughout the use case because SpinWait was used in a central low level component which was frequently called.

If you looked at the Excel chart above and wondered where difference between with and without Visual Studio for Thread.Sleep(1) comes from I can tell you now. By default the windows timer ticks every 15,6ms which then triggers the thread scheduler to check out its running threads. The strange 15,6ms value comes from 64 ticks/s = 15,6 ms/tick. If someone uses a multimedia resolution timer or changes the system wide timer tick to increase the tick rate your threads will more often be watched after the thread scheduler. That is the reason why Thread.Sleep(1) can wait for "only" 2ms or 15 ms depending on which applications are running which may alter the system wide clock rate. Another great source of many more details can be found at https://randomascii.wordpress.com/2013/07/08/windows-timer-resolution-megawatts-wasted/.

You can use the clockres tool from SysInternals to check out your current timer resolution. The dynamic windows timer tick rate makes it especially tricky to find performance degradations with unit tests on machines where Visual Studio or another WPF GUI is running where the tick rate is 1kHz. But on an installed server the clock tick rate is 64 which will make a big difference. Luckily the performance regression tests did show the issue long before any customer would get this version. It really pays off to performance regression tests. It is even better if you not only execute performance regression tests but you use xperf to record profiling data for all performance tests and store profiling data somewhere. Later when a test breaks you don't need to go to the testing guys and ask them to install the current baseline which has the issue and an old baseline which was good. Then you need to profile both use cases and then you can start analyzing. What a waste of time!

It is much easier to let your central performance tests run with ETW profiling which will affect even heavy use cases < 10%. But you get all the data you could wish for to find the root cause much faster. If you train the people how to use WPA to analyze issues and how to use PerfView, Windbg, VMMap and memory dumps you can create software which has significantly less issues than your competitors because you really understand your system. The usual thinking to treat system libraries and the operating system as black boxes where miracles happen has its place to not be distracted by the many things you need to consider before you write one line of software. But you should not stop at the black boxes. Instead you should develop the ability to look into the black boxes when necessary and write a nice blog post about the found issues so other people can learn how things (not) work.


Windows 10 SDK With New Windows Performance Toolkit Available!

$
0
0

Originally posted on: http://bobgoedkoop.nl/akraus1/archive/2015/07/31/166002.aspx

It has been a long wait to finally get the new version of the new Windows Performance Toolkit. I had not much time to test the betas since until now managed call stack resolution was not working.

WPT for Windows 10 makes some nice progress which you can download here. Click on the Download Standalone SDK and run it. Then uncheck everything except Windows Performance Toolkit and

it will install on your machine within minutes. If you download the complete Windows 10 SDK you are not downloading 170MB but several GB.

 

Management Summary

  • Graphs are more readable.
  • Graphs with short spikes are much better rendered without the need to zoom into to see them.
  • Colors for graphs are changeable (finally).
  • Filtering can be undone now in the context menu.
  • Quick Search in all tables.
  • Symbol load dialog was improved.
  • You can configure your own stack tag files in Trace Properties.
  • No new Graph Types (at least with the default providers I enable).
  • 5 Added columns to CPU Usage Sampled
    • Two could be extremely useful: Source File Name and Source Line Number
  • 12 Added columns to CPU Usage Precise.
    • Two are to user mode developers useful: ProcessOutOfMemory and CPU Usage (in view).
  • Occasional crashes are also featured now.

CPU Usage Sampled

image

The biggest improvement is certainly the possibility to see source file and line numbers. You can find out this way where most CPU is spent on which line if the CPU consumption happens in your code.

A quick recap for the newbies: The CPU Usage Sampled graph is generated by taking the stack trace of all running threads 1000 times per second (this is the default). These call stacks are then added together.

A count in your Main method of 1000 means that the method has used one second of CPU time.

If the call stack ends not in your e.g. Main method when the stack trace is taken then your source file will not be displayed there. As usual you have to take a sharp look at the numbers. Your method

may be causing a high CPU consumption but it might never show up with a source and line number because the stack always ends in an external library call (printf in my case) for which I did not have the pdb loaded.

It would be nice to have file and line numbers for managed code as well but this feature is of limited use as it is now.

Ideally I want to see this stuff while I am drilling into a call stack all my way down and not only if the call stack ends in a method which I did compile.

 

For Reference here is the complete list of columns you can choose from. Green are unchanged columns, Bold are new ones. I never have found out how to give a thread a name in WPT.

If anyone knows more about the mystical thread names and how they can be set I would be very interested in it.

WPT 8.1WPT 10
% Weight% Weight
AddressAddress
All CountAll Count
AnnotationAnnotation
Compiler OptimizationCompiler Optimization
CountCount
CPUCPU
Display NameDisplay Name
DPC/ISRDPC/ISR
FunctionFunction
 Image RVA
Inlined FunctionsInlined Functions
Is PGO'edIs PGO'ed
ModuleModule
PGO CountsPGO Counts
PGO Dynamic Instruction CountPGO Dynamic Instruction Count
PriorityPriority
ProcessProcess
Process NameProcess Name
RankRank
 Section Name
 Section RVA
 Source File Name
 Source Line Number
StackStack
TableTable
Thread Activity TagThread Activity Tag
Thread IDThread ID
Thread NameThread Name
Thread Start FunctionThread Start Function
Thread Start ModuleThread Start Module
TimeStampTimeStamp
Trace #Trace #
WeightWeight
Weight (in view)Weight (in view)

 

CPU Usage Precise

This is by far the most complex table. With WPT 10 we have 66 columns to choose from to find out how our threads interact with each other.

I am not sure how WPA detects that a process was out of memory but I think the kernel knows quite well if something did go wrong. The ProcessOutOfMemory column could be handy

to check stress tests when the machine was under heavy load which could explain subsequent application failures.

CPU Usage (in view) seems to be an attempt to display a more true thread running time. I am not sure how this value is calculated but it seems that if all cores are in use and you

get significant Ready times then it differs quite a lot. Otherwise the differences are marginal.

 

There have been quite some renaming of columns which might render your custom filter and enablement conditions useless and WPA will show you an error.

image

If your old preset did enable by default all processes except the Idle process which clutters up the view it was set to:

([Series Name]:="NewProcess" AND NOT ([NewProcess]:="Idle (0)"))

With WPT 10 you need to change it to:

([Series Name]:="New Process" AND NOT ([New Process]:="Idle (0)"))

This happens to all users of WPT 8.1 which have saved a custom default profile which contains this enablement rule.

With an empty enablement rule your view will look like

image

and with the rule

image

you will see no difference when you apply it. Why? Because this rule only enables matching stuff but it does not disable the non matching

processes! I am with you that this is very counter intuitive but it is the way it is. If you save this as new default profile and then open a new file

you will no longer get a view where the Idle process is enabled by default.

For reference I list here all available columns with the old and new names so you have an easier time to update your custom filters, enablement

and expansion rules.

Green lines are unchanged, yellow are renamed columns and bold are new columns in WPT 10.

WPT 8.1WPT 10
% CPU Usage% CPU Usage
 AdjustIncrement
 AdjustReason
AnnotationAnnotation
CountCount
Count:WaitsCount:Waits
CpuCpu
CPU Usage (ms)CPU Usage
 CPU Usage (in view)
 DirectSwitchAttempt
IdealCpuIdeal Cpu
 KernelStackNotResident
LastSwitchOutTime (s)Last Switch-Out Time
NewInPriNew In Pri
NewOutPriNew Out Pri
NewPrevOutPriNew Prev Out Pri
NewPrevStateNew Prev State
NewPrevWaitModeNew Prev Wait Mode
NewPrevWaitReasonNew Prev Wait Reason
NewPriDecrNew Pri Decr
NewProcessNew Process
NewProcess NameNew Process Name
NewQntNew Qnt
NewStateNew State
NewInSwitchTime (us)New Switch-In Time
NewThreadIdNew Thread Id
NewThreadStackNew Thread Stack
NewThreadStartFunctionNew Thread Start Function
NewThreadStartModuleNew Thread Start Module
NewWaitModeNew Wait Mode
NewWaitReasonNew Wait Reason
 NewThreadPrevRemainingQuantum
 NewThreadRank
 NewThreadRemainingQuantum
NextSwitchOutTime (s)Next Switch-Out Time
OldOutPriOld Out Pri
OldProcessOld Process
OldProcess NameOld Process Name
OldQntOld Qnt
OldStateOld State
OldInSwitchTime (us)Old Switch-In Time
OldThreadIdOld Thread Id
OldThreadStartFunctionOld Thread Start Function
OldThreadStartModuleOld Thread Start Module
OldWaitModeOld Wait Mode
OldWaitReasonOld Wait Reason
 OldThreadRank
 OldThreadRemainingQuantum
PrevCStatePrev CState
 ProcessOutOfMemory
Ready (us)Ready
ReadyThreadStackReady Thread Stack
ReadyTime (s)Ready Time
ReadyingProcessReadying Process
ReadyingProcess NameReadying Process Name
ReadyingThreadIdReadying Thread Id
ReadyingThreadStartFunctionReadying Thread Start Function
ReadyingThreadStartModuleReadying Thread Start Module
 ReadyThreadInDPC
SwitchInTime (s)Switch-In Time
TableTable
Thread Activity TagThread Activity Tag
Thread NameThread Name
TimeSinceLast (us)Time Since Last
Trace #Trace #
Waits (us)Waits

File IO View

The File IO view has now an additional column named File Path which acts like in Path Tree in Disc IO which is extremely useful if you do differential IO analysis. It seems that my

complaints about the not complete feature parity between Disc and File view have helped a bit.

image

This view has still a bug. If I zoom into it all graphs are disabled and I have to reenable them after every zoom operation. If I zoom out all is ok.

Generic Events

Besides some minor changes the biggest change is that finally it has got a DateTime (local) column. It shows the clock time and not the time since trace start.

This can be very useful for advanced scenarios when someone did report an issue and he did care to write down the time when it did happen.

image

Process View

I did not know it until it was removed. Process Tree is no longer with us. It was not really useful anyway.

Process Tree was added to WPT. That was the reason I never saw it before. After using it for some time I find it is easier to navigate in the tree if

you have a parent process which creates many childs.

 

Disc IO View

The Disc IO view has got only one additional column with the name Boosted which is for my disc IO always false. Not sure if this is a Windows 10 only feature. If someone knows more about it please comment.

 

Windows Performance Recorder

The recorder has not changed much. It has got some additional profiles like ones for .NET, Edge Browser, … The buffer size for .NET Activity is hard coded

to a 160 MB ring buffer which might not be enough for machines were many .NET apps are running. I still do not buy into this easy to use UI since the xml configs

need hard coding the values into it. It can be a hard value or a percentage of the system memory. Both choices are bad if I want to automate performance regression testing with

very specific workloads and tracing requirements. This GUI and the command line tool (wpr.exe) are just too inflexible.

Yes I know the complete xml has inheritance built into it so I could create derived profiles which override only the specific settings I care about. But this is still not dynamic enough

if I want to create a configuration for each problem and set them depending on the use case at hand.

image

  • xperf is still a good tool to stop a profiling session because it does a full CLR rundown.
  • But PerfView is much faster at creating the managed pdbs by not trying to recreate them over and over again with NGen.
  • Tracelog from the Windows SDK is still the only tool to start a trace session where you can configure the Win 8.1 enhancement to the ETW infrastructure.
    • E.g. filter events from a specific process, or disable events or stackwalking for user mode providers.

Summary

The new WPA tool has got some nice new features but they still need some polishing to become truly awesome. In my daily work I miss the automatic addition

of the etl.NGenPdb folder when present for the generated pdbs of an etl file. This is a useful convention and the old WPA was so nice to automatically include it to my symbol path.

WPA is very sophisticated. To use it right is an art which needs nearly daily practice. That could explain why this tool although it has a great UI has still got not so

much traction. It is like Windbg in disguise which will quickly cause information overload to the untrained brain.

But with some basic tricks which I will show in the near future you can crack otherwise unsolvable performance nuts with a few mouse clicks.

Make WPA Simple - Garbage Collection and JIT Times

$
0
0

Originally posted on: http://bobgoedkoop.nl/akraus1/archive/2015/08/16/166270.aspx

Did you ever see a nice tool with a fancy UI and thought: Hey that thing is powerful. I will try it! But later you were left back scratching your head why other can use this tool and get great results but you just don't get it?

Then I have news for you I have created a WPA profile which aims at user and not kernel mode developers. Many columns were removed from the context menus to give you all the power you need to find issues in your application

where no one else has found the root cause.

The ETW profile can be downloaded fromhere as a zip file. Unpack it to a directory and you are ready to go. So what is inside it?

 

FileDescription
Simple.wpaProfileThe main WPA profile you can use now.
JIT.xmlWPA region file referenced by Simple.wpaProfile to get JIT times of your application like PerfView.
GC.xmlWPA region file referenced by Simple.wpaProfile to get Garbage Collection metrics like in PerfView only better!
HookRegions.xmlWPA region file referenced by Simple.wpaProfile to see mouse clicks in your ETL file when ETWControler is running and capturing your keyboard and mouse events.
default.stacktagsStacktag file which serves as base for your application to find common .NET issues like event handle leaks, too much regex usage, monitor contention and much more.

 

Besides this there are also some WPR profiles added to enable recording of specific .NET events like exceptions together with GC events into a larger 1 GB buffer.

WPR of Windows 10 has under Scenario Analysis .NET Activity which has a very small 160MB ring buffer which is way to small for my needs. It might be good

for you but I have added extra WPR profiles for GC and JIT as well.

To show you the difference I have created a small PerformanceIssueGenerator.exe application. This generates various issues which you can analyze with WPA.

I have recorded the data already and put the 7z file here. If you want to extract it you need to download the 7z from the official site. To view the ETL file

you need a >= Windows 8 machine and Windows Performance Toolkit from the Windows 10 SDK.

 

When you have downloaded the profile and the sample etl file you can apply the profile under Profiles - Apply…

image

Then you get two tabs. The first one contains .NET specific analysis stuff like GC and JIT. The second tab can be used for the usual bottleneck analysis regarding

CPU, disk and memory usage as well as wait chain analysis. You do not need to load the profile every time. You can save it as your default profile

by clicking on Save Startup Profile to open all of your ETL files in this view now.

image

Normally I use PerfView for GC issues to check out the GC Stats of an application to see if anything significant is going on. If you have ever used PerfView then

you will have noticed that it is a great tool combined with a crappy UI. At least for some key scenarios we can now use WPA with hand crafted region files instead of PerfView.

 

GC Views

You now get a GC view like this

image

You can visualize each GC generation type and its time it did take to execute. The Count column also tells you how many GCs you did have. This can help a lot if you

want to get GC metrics only for a specific time region in your application. Now it is easy to see how much time of your use case was available to your application

and how much time your threads did have to wait for the garbage collector. This was my main issue with PerfView and its GCStats view that it is

calculated for the complete duration of the profiling run. Most of the time I want GC metrics only for specific parts of my application because

I am optimizing mostly only at a single place at one time.

Here is how you get the GC statistics in PerfView:

 

image

 

The numbers in GC Rollup By Generation match pretty well to the region file. Also the GC pause time correlate quite well with the distribution of generations although the timings

are not 100% the same but the ratios are a good fit.

image

Since PerfView and ETW use the same data you can rightfully ask the question why there is a difference at all? The answer is that WPA sums not up all GC regions

by their duration. A WPA region is defined by a start and a stop event which is the displayed in the UI like above. But if the regions happen in parallel in several threads WPA will use as sum

only the time where at least one region was active at a time. This diagram illustrates how WPA region summation works:

image

This detail is for GC things not so relevant but it will be very important when we look at JIT statistics. Since a foreground GC is blocking your complete application there is little

concurrency going on. We can also visualize when GCs of which types are happening. The view Generic Events GC Generations will show you how many GCs of which type were happening.

Initially it is collapsed so you have to open it. This view takes a little while to build up since it needs to parse all ETW events for GC Start events which contain as data point the generation number which is

visualized here.

image

This is the greatest thing about WPA of Windows 10. Now you can visualize the data of your custom traces. With a little WPA graph configuration metric you can create for your own application key graphs

which show e.g. Frames per seconds, # of db transactions, … Who needs performance counters when you can visualize your trace data much better now?

Marker Events

A key point to know where you want to start looking for an issue are Marks. In WPRUI you can press Ctrl+Win+x to open a window which allows you to write a marker event to your trace file

while you are recording data. Unfortunately this functionality is not exposed to the outside world to set marks programmatically. If you revert to xperf you can set with xperf -m marks programmatically

if you wish to. But since marks are so useful to navigate in an ETL file I really wonder why the method which xperf calls is not documented at all. If you want to know how real power users

are using WPA then you need to call wpa /?

image

I guess at Microsoft they let their tests run with enabled profiling while saving screenshots as ETW events. If something happens the ETL file is downloaded

from a central repository and the ETL file is opened with a test specific profile. The file is opened and zoomed into the relevant test part which are identified by markers

or regions from a region file. At least some people at Microsoft use this tool so often that it makes sense to automate it even further since the controls of WPA are UI automatable to

script nearly your complete analysis. Now THAT is what I call good performance regression testing.

A little reverse engineering which APIs xperf -m calls did finally show up the method EtwSetMark which is exported by ntdll.dll. I am sure the readers of my blog can figure out the rest as well.

But I really wish it would become an officially supported API since it is so useful. Sure you can define your own marker events as well but since the support in WPA is already built in

it would really help. It would also be nice if xperf would emit the mark event no only to the "NT Kernel Logger" ETW session but all active kernel session so you could mark also the WPA kernel session

which is currently not possible.

 

JIT View

PerfView can also give you the timings how much time each method did take to compile. This is useful if your are compiling too much code on the fly for your WCF proxies or your serializers. Sure it is done only once

but if this time plays a significant part of your application startup you should rethink how your application works. Since .NET 4.5 will NGen all assemblies on the fly if they are used often enough you do not need to

consider using NGen explicitly for your application. But if you have much dynamic code generation going on you can still suffer from long JIT times.

 

image

You can get a similar view by switching from Garbage Collection to JIT Time per Thread

image

As I have shown you in the GC section. The summation of regions is not a simple sum of the duration of all JIT events. The JIT compiler can compile code in many different threads concurrently.

The Duration sum of all threads of JIT time reported by WPA is therefore largely irrelevant if you want to compare two different runs of your application. Instead you need to look at the JIT times

of each thread. You can copy the duration column with the context menu "Copy Column Selection" into Excel

image

image

which gives us the exact same JIT time as Perfview. Now I do not need PerfView for JIT analysis anymore. I am more and more getting away from programming to configuring WPA to give me just the view I need.

If you expand JITs you get at least the namespace of each JIT event. This is the best I could come up with since WPA does not support concatenating  strings of different fields into a region name.

image

But you still can open the also customized view _Activity by Provider Task Process to view the "raw" data if you need more information. If you would drop e.g. Field 5 to the left of the yellow bar you would get a nice summary how

many methods in this namespace were compiled.

image

As you can guess there are lots of more goodies inside the now published profiles to make WPA really usable for non kernel developers. I have streamlined nearly every context menu where I removed all useless or kernel only columns

from the context menus. You have a much easier time now to concentrate on the important things. With this profile WPA is nearly a new application and central analysis station for managed and unmanaged performance analysis.

In the next posts I will walk you through the problems of PerformanceIssuesGenerator and how you can visualize them in WPA efficiently.

Make WPA Simple - Symbol Loading Is Not Slow

$
0
0

Originally posted on: http://bobgoedkoop.nl/akraus1/archive/2015/08/17/166298.aspx

When you open an etl file you normally want to see which methods were executed most often. To decode the recorded stack pointers WPA needs your symbols. By default WPA adds the Microsoft symbol server to the list. If you click on Load Symbols

image

then you will experience some slowness which is very nicely explained by Sasha here: http://blogs.microsoft.co.il/sasha/2014/12/02/diagnosing-native-memory-leaks-etw-wpa/. It is ok for the first time to get all relevant symbols from Microsoft but afterwards make sure you disable the Microsoft server from the symbol search list!

image

The symbol download folder is by default C:\Symbols. The directory after the srv* is the pdb download folder and after the second * the symbol server address is appended.

If you wonder why you need a SymCache folder:

image

The pdb format generated by the compilers is so arcane, bloated and parser unfriendly that the guys doing the pdb parsing within dbghelp.dll (part of Windbg) decided to convert the bloated pdb format into something much less bloated and much easier to parse. The SymCache folder contains therefore not the pdbs but the converted pdb files.

Here is the comparison:

image

And here the original pdb

image

Now you know why the SymCache folder is a good thing. You can safely delete it if it gets to big as long as you have the original pdbs still around nothing is lost. The vigilant reader has certainly noticed that I have added a path to C:\ProgramData\WindowsPerformanceRecorder\NGenPdbs_Cache. This is the path were WPR and xperf stores the generated managed pdbs while saving an ETL file to disk. By default WPR will create besides your xxx.etl file an extra folder named xxx.etl.NGENPDB which contains all managed pdbs from NGenned binaries. Since WPA of Windows 10 follows not its own convention to expand the symbol search path of the loaded etl file to its accompanying pdb directory I go directly to the source where WPR copies the generated pdbs from.

If you follow this advice and disable the MS symbol servers you will load the pdbs not within the next 20 minutes but in 3,6s for my sample ETL file. Kudos to the Symbol Cache and its inventors. While speaking of performance. PerfView has a much better (read faster) managed pdb generation strategy. WPA and xperf will call ngen.exe createpdb  xxxx.ni.dll for each and every native image in the trace recording. This is not necessary since you can read the pdb age guid directly from the dll and check then in the pdb cache folder if the pdb already exists. No need to call ngen.exe all the time! Finally it is not necessary to copy the generated pdbs to the xxx.etl.NGENPDB directory. My own trace collection tool simply creates a directory link to the NGenPdbs folder and I have all pdbs I could need without any file copies.

If you find that the default paths of WPA, xperf, … on your crowded C drive leave no room for pdbs you can configure WPA to a different storage folder. Or you leave all as is and create with mklink /D directory links from your C drive to a drive which has enough space. I prefer the latter approach since in my experience tools tend to rely on the defaults and at some point they will come back to it (e.g. Windbg …).

Why Is The First Start Of An Application Slow?

$
0
0

Originally posted on: http://bobgoedkoop.nl/akraus1/archive/2015/08/25/166493.aspx

All of us have experienced it but very few of us have gone into the details to really find the reasons for it. First of all why should I care? After the first time the application starts much faster. Well you should care if you want to give your customer not only a good second startup time but also a decent first startup experience. To do that you need to do repeated tests. The first startup is commonly referred to cold startup and all subsequent runs are warm startup scenarios. To test a cold startup scenario you need to reboot your computer every time and start your application. Not very practical and I have no one ever seen going down that road. The difference between warm and cold startup time is that the first time the dlls and data of the application need to be loaded from disc which can take quite some time. All subsequent runs can read the data not from disk but from the in memory file system cache which is managed by the operating system.

What Is The File System Cache?

Windows uses the memory of your system in some clever way. All memory that is used by your applications is shown in the Task Manager. The rest of the memory is free and not used by anything. Really? Of course not. The free memory consists of data of previously exited applications which was in use some time a ago. This "free" memory is called Standby List which is where the file system cache of Windows is really living. It is therefore not a good idea to max out your memory usage in task manager because if you do this you are flushing the file system cache which will result in a slow system. This cache is the reason why your second and all further startup times are much better. No disc accesses at all compared to the first run if all goes well.

How can you flush it?

If you want to measure cold startup times you need to clear the file system cache somehow. The easiest way is to use RAMMap from SysInternals. It has an option to flush the Standby List.

image

That is nice but unfortunately there is no command line tool available to do this in a test script to create repeatable tests. But not all is lost. Process Hacker is a better clone of Process Explorer which comes with full source code. There you can find the undocumented Windows API calls to flush the OS file system cache directly. It is basically

SYSTEM_MEMORY_LIST_COMMAND command = MemoryPurgeStandbyList;
NtSetSystemInformation(SystemMemoryListInformation, &command, sizeof(SYSTEM_MEMORY_LIST_COMMAND));

and you are done. There is also another trick I want to share: You want to clear the file system cache for a specific file? Windows will flush the file system cache if a file was opened with the FILE_FLAG_NO_BUFFERING flag. Here is little C# class that employs this trick:

class FileSystemTools
    {constint FILE_FLAG_NO_BUFFERING = 0x20000000;const FileOptions Unbuffered = (FileOptions)FILE_FLAG_NO_BUFFERING;/// <summary>/// Flush the file system cache for this file. This ensures that all subsequent operations on the file really cause disc/// access and you can measure the real disc access time./// </summary>/// <param name="file">full path to file.</param>publicstaticvoid FlushFSCacheForFile(string file)
        {using (FileStream fileStream = new FileStream(file, FileMode.Open, FileAccess.Read, FileShare.ReadWrite, 4096, FileOptions.None | Unbuffered))
            {

            }
        }
    }

Hidden Dragons

If the first test run is slow and all other are fast we know that the file system cache is to blame. If you see a pattern like this

image

you know that the first run was cold and all others were warm. If we now flush the file system cache we should get rock solid cold times with no other influencers except virus scanner, indexing service and other pesky applications making our test unpredictable. If we measure again - this time with flushing the file system cache - we get this:

image

 

This is not warm and not cold. What is it then? The answer is it was mostly a warm startup because another instance of the application was still running. If the a process is alive and its working set was not trimmed then the dll data is NOT in the cache since it still belongs to the working set of a running process. You need therefore make sure that you close all applications which keep your dlls in memory (e..g Visual Studio with its designers ….) to test a real cold startup scenario. There are other pitfalls as well. If you kill a process without waiting until it has terminated you can get random good runs because a process kill is an asynchronous operation. You have to wait on the process handle until it becomes signaled to be sure that it is really gone. Or if you are using .NET you need to call Process.WaitForExit() to be sure.

After we have fixed that we get this graph:

image

What the hell is going on here? This was quite hard to find out since there is very little information on the net how to reliably test cold startup scenarios. The hidden thing is Superfetch.

Doesn´t ring a bell? Besides speeding up the boot it also tries to prefetch often used application dlls into memory before the application is started based on intelligent heuristics. This gets in our way if we flush the file system cache. After that the Superfetch system service named SysMain  which is hosted in a generic svchost.exe process tries to prefetch often used dlls at low IO priority. This is additional IO collides with our test where we want to load our dlls as well. To get consistent IO numbers it is therefore best to stop the Superfetch service.

sc stop SysMain

After doing that the graph still looks the same. What is interesting that if our application reads about 30MB more data it starts nearly 10s faster. This is a mystery which only xperf can solve. Here is a Disc IO graph of a 23s slow run

image

And here is a fast 14s run.

image

We read more data but the Disc Service Time is less! The average Disc Queue Depth is very high which should indicate an overwhelmed disc. What stands out is that we have about 70% less disc read operations on the assembly folder where the native images of .NET applications are stored. Sometimes we read the disc in a slow and sometimes in a fast way. To answer the question why we read from the disc in such different ways we need to check out the Disc IO Init call stacks. These will tell us who did initiate the read operations and most important when. First comes the slow run

image

It is no surprise that our slow cold startup causes many hard page faults which triggers the kernel to read the missing pages from disc. This is known to be slow. We can do little about it except to execute less code during startup. The graph shows a nice scattered pattern across the complete startup time. Now lets have a look at the fast run

image

That looks radically different. We load the same dlls but we have no hard faults? All dll data is read before the application main thread starts running. In our process the Prefetcher reads its prefetch database which is 1,3MB in size (see PfSnPrefetchMetaData in 687ms) which allows it to read our dlls in PfSnPrefetchSections (151MB in 3,1s) to read the dll data mostly with sequential reads in about half of the time. That is the real reason why we have so wildly varying startup times. The prefetcher will look at ca. the first 12s of application startup and will then update the prefetch database for your executable. Next time the startup is a little better and it will update it again causing it to prefetch a little more and so on. This is a nice self tuning system which not many people are aware of. From time to time the prefetcher seems to do nothing to check (that is my guess) if the prefetch database is helping at all. That explains the mostly good and sometimes very slow numbers whre no prefetching is active.

If we aim at stable cold repeatable tests we need to delete disable the Superfetch service because this process is responsible to write and update the prefetch database. The prefetch database file is located at

C:\Windows\Prefetch\ExeName-FileHashPrefetchFlag.pf

We need to delete that file as well and now we have real control over our IO during application startup. If you want to calculate the file name of your application prefetch database you need to take the full path of your executable and run it through a hash function which is described at http://forensicswiki.org/wiki/Prefetch. There are still some sites out there which tell you to delete the prefetch folder to prevent caching of dlls to make your system faster. If these guys would have really measured their startup time in a controlled and systematic manner they would not have written such misleading guidelines. As you can see you need to take care about a lot of details to really understand your system. But it is worth it. If you have looked closely at the .pf file name there is a PrefetchFlag at the end. That is a number which you can pass to generic container processes to generate for the same executable different prefetch databases. This can be useful if you have generic container processes with wildly varying dll sets inside them. How do you tell the OS loader that you want to use a different prefetch database to speed up cold startup times?

Simple: Pass it as command line parameter! If you create e.g. an executable with the name prefetchTest.exe and pass /prefetch:dd as command line argument to it you get different prefetch files for each scenario:

prefetchTest.exe
    C:\Windows\Prefetch\PREFETCHTEST.EXE-47E1C1B1.pf 
prefetchTest.exe /prefetch:1
    C:\Windows\Prefetch\PREFETCHTEST.EXE-47E1C1B2.pf
prefetchTest.exe /prefetch:2
    C:\Windows\Prefetch\PREFETCHTEST.EXE-47E1C1B3.pf
That is a nice find. Perhaps you can make your own customers happier by including a cold startup test in your test suite to make the pain public. Most test reports I have seen include only warm numbers because it is quite difficult to create repeatable and stable cold startup tests. Here are my tips as short bullets for repeatable cold startup measurements:

Test Machine Preparation

  • Stop/Disable the Superfetch (SysMain) service.
  • Disable Windows Search (WSearch) service.
  • Disable Virus scanners.
  • Disable scheduled tasks and other system management software.
  • Delete the prefetch database for your executable.

Test Execution

  • Kill all applications which keep your dlls loaded.
  • Flush the file system cache before the test.

When you follow these tips you will get easily repeatable and stable cold startup numbers. If for some reason during startup much more code is executed which causes many page faults you can easily spot it. The warm startup time might not be affected at all because the actual run time of the additional code might be low but the new disc access pattern could be pricy in terms of cold startup time.

Making WPA Simple - CPU/Disk/Memory and Wait Analysis

$
0
0

Originally posted on: http://bobgoedkoop.nl/akraus1/archive/2015/09/26/167117.aspx

Last time I have shown how to have a look GC issues with my simplified WPA profile. But there is much more in this profile which will make it a lot easier to analyze standard issues. In general when you are dealing with performance you need to search for

  • CPU
  • Disk
  • Wait
  • Memory

issues. A special case is network which falls into the wait issue category where we wait for some external resource.

When you download my simplified profile and apply it to the the provided sample ETL file you get can analyze any of the above issues within much less time. Here is a screen shot of the default tab you will see when you open the ETL file.

Stack Tags are Important

image

The first and most important graph is CPU Usage Sampled with "Utilization by Process And StackTags" which is a customized view. It is usable for C++ as well as for .NET applications. If you ever did wonder what stack tags are good for you can see it for yourself. I have added a stack tag named Regular Expression which is set for all all calls to the .NET Regex class like this:

<TagName=".NET"><TagName="Regular Expression"><EntrypointModule="System.ni.dll"Method="*System.Text.RegularExpressions.Regex*"/><EntrypointModule="System.dll"Method="*System.Text.RegularExpressions.Regex*"/></Tag></Tag>

If more than one tag can match the deepest method with the first stack tag is used. This is the reason why the default stack tag file is pretty much useless. If you add tags for your application they will never match because the low level tags will match long before the tags in your own code could ever match. You have to uncomment all predefined stuff in it. If you use my stack tag file you need to remove WPA provided default.stacktags under Trace - Trace Properties - Stack Tags Definitions. In practice I overwrite the default file to get rid of it. If you leave it you will get e.g. very cheap CPU times for a GC heavy application because all call stacks of your application which will trigger GC stuff are added to the GC stack tag. This makes your application specific stack tags look much slimmer than they really are because your application tags come later in the call stack which are not used to tag your stacks.  

Why would I need a custom stack tag file? It makes it much easier to tag high level expensive operations of your application so you can see how much time you were spending with e.g. loading game data, decompressing things, rendering, … This makes it easy to detect patterns what your application was doing at a high level. Besides if you find a CPU bottleneck in your application you can add it under a e.g. a "Problems" node so you can document already known issues which are now easy to spot.

For our problematic application PerformanceIssuesGenerator.exe we see that it was doing for 17,5s CPU Regex stuff (Weight in View is ms in time units). To see how long the actual running time was we need to add the Thread ID column since currently we sum the CPU time of all threads which is not the actual clock time we spent waiting for completion.

image

 

The context menu is actually customized. It is much shorter and contains the most relevant columns I find useful. If you want more of the old columns then you can simply drag and drop columns from the View Editor menu which is part of all WPA tables. If you want to remove additional columns you can also drag and drop columns back to the left again. This way you can streamline all of your column selection context menus which is especially useful for the CPU Usage Precise context menu which is huge.

Select A Time Range to Analyze

Now we see that we have two large Regex CPU consumers with a large time gap in between. But what was the application actually doing? This is where marker events from your own application come in handy so you know what high level operation e.g. the user did trigger and how long it did take. This can be achieved with a custom ETW event provider or the special ETW marker events which WPA can display in the Marks graph if any of them are present in your ETL file. To be able to use them to navigate in the ETL file your application must of write them at interesting high level time points which indicate for example the start and stop of a user initiated action. For .NET applications the EventSource class is perfectly suited for this task. Marker events can be written with a custom PInvoke call to

Here is the code to write a marker event in C# which shows up in all kernel sessions. The TraceSession.GetKernelSessionHandles is adapted from the TraceEvent library. If you have "NT Kernel Logger" sessions only (e.g. if you use xperf) then you can use 0 as session handle to write to it. 

/// <summary> Write to Windows Event Tracing a Marker string which does show up in the Marker tab when /// you view the trace. This makes navigation in ETL files much easier. /// </summary>publicstaticclass EtwMarker
    {#region Fieldsconstint FixedBufferSize = 512;

        [StructLayout(LayoutKind.Sequential, Pack = 1, CharSet = CharSet.Ansi)]
        unsafestruct ETWString
        {public Int32 Prefix;publicunsafefixedbyte String[FixedBufferSize];
        }#endregion/// <summary>Write a marker string to ETW to enable easier navigation in ETL files in the viewer.</summary>/// <param name="str">String to write</param>/// <exception cref="Win32Exception">When string could not be written.</exception>publicstaticunsafevoid WriteMark(string fmt, paramsobject[] args)
        {
            String str = String.Format(fmt, args);
            ETWString etwStr = new ETWString();int len = CreateFromString(str, ref etwStr);
            var sessionHandles = TraceSession.GetKernelSessionHandles();// Write to all enabled kernel sessions so this marker shows up everywhere.foreach (var handle in sessionHandles)
            {int ret = EtwSetMark(handle, &etwStr, len);if (ret != 0)
                {thrownew Win32Exception(ret);
                }
            }
        }/// <summary>/// Create an ETWString from a .NET string. The struct is fixed size./// Only the first 511 bytes are copied from the ASCII string./// </summary>/// <param name="str">String to convert</param>/// <param name="data">Reference to ETWString struct</param>/// <returns>Length of ETWString</returns>staticunsafeint CreateFromString(string str, ref ETWString data)
        {
            var bytes = Encoding.ASCII.GetBytes(str);int copyLen = bytes.Length < FixedBufferSize ? bytes.Length : FixedBufferSize - 1;fixed (void* p = &data)
            {
                IntPtr dest = IntPtr.Add(new IntPtr(p), 4); // pointer to String buffer// copy at most Length-1 bytes of string to keep the terminating null char
                Marshal.Copy(bytes, 0, dest, copyLen);int length = 4 + copyLen + 1; // Prefix DWORD + string length + closing \0 charreturn length;
            }
        }/// <summary>/// Undocumented method of the kernel to write an ETW marker which does show/// up in the WPA viewer under the Marks tab. This makes navigation in trace files/// much easier to find important time marks./// This is the same as xperf -m "string" which writes only to the "NT Kernel Logger" session./// </summary>/// <param name="sessionHandle">if 0 the NT Kernel Logger Session is written to. Otherwise you need to use supply a handle to a kernel ETW session.</param>/// <param name="pStr">Ascii encoded string with a DWORD prefix</param>/// <param name="length">Length of complete string including the prefix and terminating \0 char</param>/// <returns>0 on success, otherwise a WIN32 error code</returns>
        [DllImport("ntdll.dll")]externunsafestaticint EtwSetMark(ulong sessionHandle, ETWString* pStr, int length);
    }

Now that we have our marks we can use them to navigate to key time points in our trace session:

image

We see that the first CPU spike comes comes from RegexProcessing which did take 3,1s. The second regex block was active between the Hang_Start/Stop events which did take 2,7s. This looks like we have some real problems in our PerformanceIssueGenerator code. Since we have according to our ETW marks Regex processing, many small objects, many large objects and a hang with simultaneous regex processing we need to select one problem after the other so we can look at each issue in isolation. That is the power which custom ETW providers or ETW Marks can give you. Normally you are lost if you know that you have several issues to follow up. But with application specific context marker events you can navigate to the first regex processing issue. To do that select the first event by and then hold down the Ctrl key while clicking at the stop event to multi select events. Then you can right click on the graph to zoom into the region defined by the first and last event.

image

Analyze CPU Issue

When you now look at the CPU consumption of the busiest threads we find 2,7s of CPU time. At the bottom WPA displays the selected duration which is 3,158s which matches quite well the reported timing of 3,178s. But the reported thread time of 2,7s is not quite the observed duration. In the graph you see some drops of the CPU graph which indicates that for some short time the thread was not active possibly waiting for something else.

image

Wait Chain Analysis

That calls for a Wait Chain Analysis. If you scroll down you will find a second CPU graph with the name CPU Usage (Precise) Waits. This customized graph is perfectly suited to find not only how much CPU was consumed but also how long any thread was waiting for something. Please note that this graph does not replace the CPU Usage Sampled graph. I have explained the difference between both CPU graphs already earlier. The column selection context menu of this graph has been massively thinned out to keep only the most relevant columns. Otherwise you would have to choose from over 50 items and the context menu even has scroll bars! Now we have only process, thread id, thread stack as groupings. Next comes a list of fixed columns which are always visible because they are so important. There we see how long each thread did wait for something (WaitForSingleObject …) as total and maximum wait time and the CPU usage in milliseconds. If we sum up for the most expensive thread the wait time of 0,385s and 2,726s of CPU time we get 3,111s which is within a small error margin exactly the time we did get by measuring the start and stop of our regex processing operation.

image

Is this conclusion correct? Not so fast since a thread can be only running or waiting (it can also wait in the ready queue which is only relevant if you have more threads running than cores) the sum of CPU and Wait times for each thread will always add up to 3,1s because this is the time range you did zoom into. To actually prove that this time is really waited while we were regex processing we have to sort by the wait column and then drill down for the thread 10764.

image

When we do this we see that all waiting did occur while the DoRegexProcessing delegate was called. It was waiting on the regular expression JIT compilation to complete. Now we have proven that that wait time is really spent while executing the regex processing stuff. If we would like to optimize the total running time we have two options: Either we could use more threads to parallelize even further or we need to tune our Regular expression or replace it by something else. Before going down that route you should always check if this string processing is necessary at all. Perhaps strings are not the best data structure and you should think about your data structures. If you need to sue strings you still should verify that the regex processing was really necessary at this point of time. Eventually you do not need the results of this regex processing stuff right now.

A Garbage Collection Issue?

In the list of marker events we see that the first regex issue overlaps with a GenerateManySmallObjects operation. Lets zoom into that one and check out the what we see under CPU usage. There we see that we are consuming a significant amount of CPU in the Other stack tag which categorizes unnamed stacks into its own node. If we drill into it we find out allocating method PerformanceIssueGenerator.exe!PerformanceIssueGenerator.IssueGenerator::GenerateManySmallObjects. That is interesting. Was it called on more than one thread?

image

To answer that question it is beneficial to select the method in question and open from the context menu View Callers- By Function

image

This will let you start to drill up to all callers of that method which is useful to find the total cost of a commonly used method (e.g. garbage_collect …). This is still the total sum for all threads. Now we need to bring back out Thread Ids to see on which threads this methods was called.

image

If the Thread ID column has more than one node beneath it it will be expandable like I have shown above. This proves that only one thread was calling this method and it did use 1,3s of CPU time. But there are still 0,4s of wait time missing. Where was the rest spent? We could use the Wait graph again to see where the wait operation was hanging around but since we know that it was allocating things we can open the Region for Garbage collections

image

And we just have found that during that time the garbage collector did block our application of 0,4s which explains the missing time in our CPU graph. We also know that we had 75 GCs during that time which is a rather high number. We can even see how many GCs of which generation we did have during that time by using my (also hand rolled) Generic Events GC Generations graph:

image

That's all about the small object allocation issue. Now lets go back by doing "Undo Zoom" in the graph to get our previous time range back where we can do the same analysis for the Generate Many Large objects issue. This is not terribly interesting which I leave to you as exercise.

UI Hang (2,727s) Analysis

Now lets investigate why our UI did hang. The hang operation has nice markers which we can use to zoom into the region of interest.

image

We are having a lot of CPU intensive threads here. If the UI was hung it must either using much CPU on the UI thread or the UI thread was blocked for some reason. If you know that the UI thread was started on your Main method you can search for CorExeMain which is the CLR method which calls into your main method. Or you search for a typical window method like user32.dll!DispatchMessageWorker. With that we quickly find thread 11972 as the thread which was hung in:

image

Obviously it was waiting for a task to complete. If you call Task.Wait on your UI thread you will block further UI message processing and your UI will not be able to redraw anymore. At least that was the visible observation. The Hang did take 2,727s which exactly matches up the total summation of all wait operations with 2,724s. If that is true we should see as maximum wait time the same the same time but there we have only 1,383s. When using WPA you can look deep into the inner workings of Windows and .NET. Lets be a little curious and check out why there were not one Wait operation but 30 waits while we were blocked in Taks.Wait. We observere that Task.Wait calls into the current Dispatcher to delegate the wait to its SynchronizationContext. This in turn calls into the current CLR thread to DoAppropriateWait which in turn calls on the UI thread MsgWaitForMultipleObjectEx. This method can block a UI thread but leave e.g. mouse events or COM messages through. Depending on which which parameters you call it it can even pump your UI thread with message which is sometimes necessary. This can lead to unintended side effects that you can execute code while you were waiting for a blocking operation. I have found such issues in Can You Execute Code While Waiting For A Lock? and Unexpected Window Message Pumping Part 2.

Digging Deeper

We know that MsgWaitForMultipleObjects can pump messages with the right flags. So what was happening in these 30 wait operations? If you open the column context menu you can add Ready Thread Stack which gives you all call stacks which did cause MsgWaitForMultpleObjects or its child methods to wake up.

image

Poor other tools. They will never be able to tell you that MsgWaitForMultipleObjects was woken up 28 times to by dwm.exe which is the Desktop Window Manager to tell our wait method: Here you have 28 mouse events which you can process or not depending on how your wait flags were configured. The last event which did finally end our Task.Wait operation was the last thread that did finish processing. This is the call where our Wait operation did wake up our blocked message loop.

 

image

You can ask: How do you know that? Well the Switch-In Time column of our event in the table which is to the right of the yellow bar with 14,425 tells me this. This is were our wait operation did end and our thread was woken up again. This is also the time where the CPU consumption has a sharp drop which is no coincidence. If you have sharp eyes you will notice in the graph at the right a blue line. This is when our thread was woken up. Every thread wake up is displayed in the graph which makes it much easier to spot regular patterns of activity. It is this wealth of information which makes WPA so powerful. But the default profile is mainly aimed at kernel developers which is a pity because it is an excellent system wide profiling tool which gives you deep insights into how Windows really works and sometimes how Windows or your application breaks.

If you have counted you will still miss one ready event. This one was Visual Studio which did send us a window message from vslog.dll!VSResponsiveness::Detours::DetourPeekMessageW. I am not sure why Visual Studio does this but if you profile your system more often you will notice that the Visual Studio responsiveness stuff gets from time to time into a loop and all Visual Studio instances will send each other messages (still the case with VS2015). This does not show up as a surge of CPU consumption. But it will increase your ETL file size considerably because you will get millions of context switches. This is not a big problem except for battery powered devices which have a higher power consumption than they could have and for people profiling their machines getting huge files.

 

.NET Exception Analysis

The previous analysis was a bit complex. Now for something nice and simple. You can easily check if your .NET application throws exceptions if you enable the .NET provider. With WPRUI you need to check ".NET Activity" to get these events and many others. For a short list of interesting .NET events you can check out the ETW manifest of .NET for yourself or you read Which Keywords Has An ETW Provider Enabled? which covers the most interesting .NET events as well. If you zoom into the Do Failing Things region which is visible as ETW Marker event in the Marks graph. Then you need to add a Generic Events Graph and change the graph type from "Activty by Provider, Task, Opcode" to .NET Exceptionswhich is only part of my WPA profile.

image

This gives you a nice view of all thrown exceptions in all .NET applications. You can play with the groupings and group e.g. by exception type and message to see how many different ones you have. To do that you only need to Drag the "Exception Message (Field 2)" and drop it to the right of ExceptionType. Now you have Process - Exception Type - ExceptionMessage in a nice cohesive view. You just have to remember that all columns to the left of the yellow bar are the columns which are grouped together by their values.

Starting with .NET 4.6 you can also use the .NET Exceptions Caught  view which shows you all thrown exceptions and in which method these exceptions were caught.

image

This can be interesting for automated regression tests if something breaks because an exception which was always happening is now caught in a different method which might be too late in the call chain. Having an exception is nice but the most interesting part of it is its call stack. From this graph we know the Thread on which it was happening. We do not need the process id because each thread Id is unique across all processes. Then you can zoom to a specific exception an investigate in CPU Usage (Precise or Sampled) the call stacks if something related to exception handling shows up. Searching for the method KernelBase.dll!RaiseException is usually a good way to find it. If that did not help you need to enable stack walking for the .NET provider. I blogged about how to do that a long time ago for a custom provider with xperf in Semantic Tracing for .NET 4.0. If you did record the provider "Microsoft-Windows-DotNETRuntime" with stack walking enabled you can add in the View Editor from the column context menu

image

the Stack column to display the call stack of all thrown exceptions. An alternative is to enable the StackKeyword+ExceptionKeyword  of the "Microsoft-Windows-DotNETRuntime" ETW provider. This will cause .NET to write an additional ETW event for every .NET event with a managed stackwalk event. This event is not human readable but PerfView can decode it. This is one of the things where PerfView is still the best tool to investigate managed application issues.

Conclusions

Although this is already a rather long post you still have only seen a small fraction of what you can do with ETW profiling and the WPA tool. I hope that I have convinced a few more people out there to try it out because it can help you to find the problems you never knew that they existed. But these problems might be ones your customers are always complaining about.

Can You Execute Code While Waiting For A Lock?

$
0
0

Originally posted on: http://geekswithblogs.net/akraus1/archive/2014/10/07/159558.aspx

Threading can sometimes lead to impossible bugs where you have a hard time to find out how you got into this state. Lets suppose you have a nice WPF application which you dutifully test via the UIAutomation provider which can alter some state in your application.

/// <summary>/// Interaction logic for TestWindow.xaml/// </summary>publicpartialclass TestWindow : Window
    {int Data;public TestWindow()
        {
            InitializeComponent();
        }publicvoid AlterData(int newValue)
        {Debug.Print("Executing AlterData {0}", Environment.StackTrace);
            Data = newValue;
        }protectedoverride AutomationPeer OnCreateAutomationPeer()
        {
            var lret = new OwnPeer(this);return lret;
        }class OwnPeer : WindowAutomationPeer
        {public OwnPeer(TestWindow owner)
                : base(owner)
            { }protectedoverride HostedWindowWrapper GetHostRawElementProviderCore()
            {
                Debug.Print("GetHostRawElementProviderCore called");((TestWindow)base.Owner).AlterData(500);returnbase.GetHostRawElementProviderCore();
            }
        }
}

When you look at the call stacks all is normal:

Executing AlterData    at System.Environment.GetStackTrace(Exception e, Boolean needFileInfo)
   at System.Environment.get_StackTrace()   at ThreadingIsHard.TestWindow.AlterData(Int32 newValue) in d:\Source\ThreadingIsHard\ThreadingIsHard\TestWindow.xaml.cs:line 52
   at ThreadingIsHard.TestWindow.OwnPeer.GetHostRawElementProviderCore() in d:\Source\ThreadingIsHard\ThreadingIsHard\TestWindow.xaml.cs:line 69
   at MS.Internal.Automation.ElementProxy.InContextGetHostRawElementProvider(Object unused)
   at MS.Internal.Automation.ElementUtil.<>c__DisplayClass1.<Invoke>b__0(Object unused)
   at System.Windows.Threading.ExceptionWrapper.InternalRealCall(Delegate callback, Object args, Int32 numArgs)
   at MS.Internal.Threading.ExceptionFilterHelper.TryCatchWhen(Object source, Delegate method, Object args, Int32 numArgs, Delegate catchHandler)
   at System.Windows.Threading.DispatcherOperation.InvokeImpl()
   at System.Threading.ExecutionContext.RunInternal(ExecutionContext executionContext, ContextCallback callback, Object state, Boolean preserveSyncCtx)
   at System.Threading.ExecutionContext.Run(ExecutionContext executionContext, ContextCallback callback, Object state, Boolean preserveSyncCtx)
   at System.Threading.ExecutionContext.Run(ExecutionContext executionContext, ContextCallback callback, Object state)
   at System.Windows.Threading.DispatcherOperation.Invoke()
   at System.Windows.Threading.Dispatcher.ProcessQueue()
   at System.Windows.Threading.Dispatcher.WndProcHook(IntPtr hwnd, Int32 msg, IntPtr wParam, IntPtr lParam, Boolean& handled)
   at MS.Win32.HwndWrapper.WndProc(IntPtr hwnd, Int32 msg, IntPtr wParam, IntPtr lParam, Boolean& handled)
   at MS.Win32.HwndSubclass.DispatcherCallbackOperation(Object o)
   at System.Windows.Threading.ExceptionWrapper.InternalRealCall(Delegate callback, Object args, Int32 numArgs)
   at MS.Internal.Threading.ExceptionFilterHelper.TryCatchWhen(Object source, Delegate method, Object args, Int32 numArgs, Delegate catchHandler)
   at System.Windows.Threading.Dispatcher.LegacyInvokeImpl(DispatcherPriority priority, TimeSpan timeout, Delegate method, Object args, Int32 numArgs)
   at MS.Win32.HwndSubclass.SubclassWndProc(IntPtr hwnd, Int32 msg, IntPtr wParam, IntPtr lParam)
   at MS.Win32.UnsafeNativeMethods.DispatchMessage(MSG& msg)
   at MS.Win32.UnsafeNativeMethods.DispatchMessage(MSG& msg)
   at System.Windows.Threading.Dispatcher.PushFrameImpl(DispatcherFrame frame)
   at System.Windows.Application.RunInternal(Window window)
   at System.Windows.Application.Run()
   at ThreadingIsHard.App.Main() in d:\Source\UIAutomationSample\UIAutomationTest\obj\Debug\App.g.cs:line 0

So far so expected. Now lets make our well behaving application at little more unexpected and take a lock for some time. To make it simple I deadlock the main UI thread by trying to enter a lock which is owned by another thread like this

privatevoid Button_Click(object sender, RoutedEventArgs e)
        {
            Deadlock();
        }publicvoid Deadlock()
        {
            Barrier b = new Barrier(2);
            Thread t = new Thread(() =>
            {lock (this)
                {
                    b.SignalAndWait(); // let main UI thread run into lock
                    Thread.Sleep(Timeout.Infinite); // never leave the lock
                }
            });
            t.Start();

            b.SignalAndWait();
            lock (this) // deadlock which never releases lock
            {
            }
        }

If you have never seen Barrier it is a class introduced with .NET 4.0 which is very handy to synchronize threads to a specific time point. Only when both threads have signaled then both threads are released at the same time. In our case this ensures a nice deadlock. Here is the quiz:

Can something bad happen while the main UI thread is deadlocked?

As it turns out it can. Lets have a look at this impossible call stack:

Executing AlterData    at System.Environment.GetStackTrace(Exception e, Boolean needFileInfo)
   at System.Environment.get_StackTrace()   at ThreadingIsHard.TestWindow.AlterData(Int32 newValue) in d:\Source\ThreadingIsHard\ThreadingIsHard\TestWindow.xaml.cs:line 25
   at ThreadingIsHard.TestWindow.OwnPeer.GetHostRawElementProviderCore() in d:\Source\ThreadingIsHard\ThreadingIsHard\TestWindow.xaml.cs:line 44
   at MS.Internal.Automation.ElementProxy.InContextGetHostRawElementProvider(Object unused)
   at MS.Internal.Automation.ElementUtil.<>c__DisplayClass1.<Invoke>b__0(Object unused)
   at System.Windows.Threading.ExceptionWrapper.InternalRealCall(Delegate callback, Object args, Int32 numArgs)
   at MS.Internal.Threading.ExceptionFilterHelper.TryCatchWhen(Object source, Delegate method, Object args, Int32 numArgs, Delegate catchHandler)
   at System.Windows.Threading.Dispatcher.LegacyInvokeImpl(DispatcherPriority priority, TimeSpan timeout, Delegate method, Object args, Int32 numArgs)
   at MS.Internal.Automation.ElementUtil.Invoke(AutomationPeer peer, DispatcherOperationCallback work, Object arg)
   at MS.Internal.Automation.ElementProxy.get_HostRawElementProvider()
   at MS.Internal.Automation.UiaCoreProviderApi.RawUiaReturnRawElementProvider(IntPtr hwnd, IntPtr wParam, IntPtr lParam, IRawElementProviderSimple el)
   at MS.Internal.Automation.UiaCoreProviderApi.RawUiaReturnRawElementProvider(IntPtr hwnd, IntPtr wParam, IntPtr lParam, IRawElementProviderSimple el)
   at System.Windows.Automation.Provider.AutomationInteropProvider.ReturnRawElementProvider(IntPtr hwnd, IntPtr wParam, IntPtr lParam, IRawElementProviderSimple el)
   at System.Windows.Interop.HwndTarget.CriticalHandleWMGetobject(IntPtr wparam, IntPtr lparam, Visual root, IntPtr handle)
   at System.Windows.Interop.HwndTarget.HandleMessage(WindowMessage msg, IntPtr wparam, IntPtr lparam)
   at System.Windows.Interop.HwndSource.HwndTargetFilterMessage(IntPtr hwnd, Int32 msg, IntPtr wParam, IntPtr lParam, Boolean& handled)
   at MS.Win32.HwndWrapper.WndProc(IntPtr hwnd, Int32 msg, IntPtr wParam, IntPtr lParam, Boolean& handled)
   at MS.Win32.HwndSubclass.DispatcherCallbackOperation(Object o)
   at System.Windows.Threading.ExceptionWrapper.InternalRealCall(Delegate callback, Object args, Int32 numArgs)
   at MS.Internal.Threading.ExceptionFilterHelper.TryCatchWhen(Object source, Delegate method, Object args, Int32 numArgs, Delegate catchHandler)
   at System.Windows.Threading.Dispatcher.LegacyInvokeImpl(DispatcherPriority priority, TimeSpan timeout, Delegate method, Object args, Int32 numArgs)
   at MS.Win32.HwndSubclass.SubclassWndProc(IntPtr hwnd, Int32 msg, IntPtr wParam, IntPtr lParam)
   at System.Threading.SynchronizationContext.WaitHelper(IntPtr[] waitHandles, Boolean waitAll, Int32 millisecondsTimeout)
   at System.Windows.Threading.DispatcherSynchronizationContext.Wait(IntPtr[] waitHandles, Boolean waitAll, Int32 millisecondsTimeout)   at System.Threading.Monitor.Enter(Object obj)   at ThreadingIsHard.TestWindow.Deadlock() in d:\Source\ThreadingIsHard\ThreadingIsHard\TestWindow.xaml.cs:line 69
   at ThreadingIsHard.TestWindow.Button_Click(Object sender, RoutedEventArgs e) in d:\Source\ThreadingIsHard\ThreadingIsHard\TestWindow.xaml.cs:line 52
   at System.Windows.EventRoute.InvokeHandlersImpl(Object source, RoutedEventArgs args, Boolean reRaised)
   at System.Windows.UIElement.RaiseEventImpl(DependencyObject sender, RoutedEventArgs args)
   at System.Windows.Controls.Button.OnClick()
   at System.Windows.Controls.Primitives.ButtonBase.OnMouseLeftButtonUp(MouseButtonEventArgs e)
   at System.Windows.RoutedEventArgs.InvokeHandler(Delegate handler, Object target)
   at System.Windows.EventRoute.InvokeHandlersImpl(Object source, RoutedEventArgs args, Boolean reRaised)
   at System.Windows.UIElement.ReRaiseEventAs(DependencyObject sender, RoutedEventArgs args, RoutedEvent newEvent)
   at System.Windows.RoutedEventArgs.InvokeHandler(Delegate handler, Object target)
   at System.Windows.EventRoute.InvokeHandlersImpl(Object source, RoutedEventArgs args, Boolean reRaised)
   at System.Windows.UIElement.RaiseEventImpl(DependencyObject sender, RoutedEventArgs args)
   at System.Windows.UIElement.RaiseTrustedEvent(RoutedEventArgs args)
   at System.Windows.Input.InputManager.ProcessStagingArea()
   at System.Windows.Input.InputProviderSite.ReportInput(InputReport inputReport)
   at System.Windows.Interop.HwndMouseInputProvider.ReportInput(IntPtr hwnd, InputMode mode, Int32 timestamp, RawMouseActions actions, Int32 x, Int32 y, Int32 wheel)
   at System.Windows.Interop.HwndMouseInputProvider.FilterMessage(IntPtr hwnd, WindowMessage msg, IntPtr wParam, IntPtr lParam, Boolean& handled)
   at System.Windows.Interop.HwndSource.InputFilterMessage(IntPtr hwnd, Int32 msg, IntPtr wParam, IntPtr lParam, Boolean& handled)
   at MS.Win32.HwndWrapper.WndProc(IntPtr hwnd, Int32 msg, IntPtr wParam, IntPtr lParam, Boolean& handled)
   at MS.Win32.HwndSubclass.DispatcherCallbackOperation(Object o)
   at System.Windows.Threading.ExceptionWrapper.InternalRealCall(Delegate callback, Object args, Int32 numArgs)
   at MS.Internal.Threading.ExceptionFilterHelper.TryCatchWhen(Object source, Delegate method, Object args, Int32 numArgs, Delegate catchHandler)
   at System.Windows.Threading.Dispatcher.LegacyInvokeImpl(DispatcherPriority priority, TimeSpan timeout, Delegate method, Object args, Int32 numArgs)
   at MS.Win32.HwndSubclass.SubclassWndProc(IntPtr hwnd, Int32 msg, IntPtr wParam, IntPtr lParam)
   at MS.Win32.UnsafeNativeMethods.DispatchMessage(MSG& msg)
   at MS.Win32.UnsafeNativeMethods.DispatchMessage(MSG& msg)
   at System.Windows.Threading.Dispatcher.PushFrameImpl(DispatcherFrame frame)
   at System.Windows.Application.RunInternal(Window window)
   at System.Windows.Application.Run()
   at ThreadingIsHard.App.Main() in d:\Source\UIAutomationSample\UIAutomationTest\obj\Debug\App.g.cs:line 0

Hm the lock is still held but how can AlterData be called while we are trying to enter the lock which was clearly not released? We do see it even in the call stack (green line)? To understand this you need to take a deep look back in history about STA threads and window message pumping. Monitor.Enter is not a replacement for WaitForSingleObject on a STA thread. Instead it calls into CoWaitForMultipleHandles which ultimately calls MsgWaitForMultipleObjectsEx. Here is the (simplified) call stack:

USER32!MsgWaitForMultipleObjectsEx
combase!CCliModalLoop::SetPointerInputModeAsAppropriateForQueueAttachmentConditions
combase!CCliModalLoop::BlockFn
combase!ClientCallRetryContext::StartTimer
combase!ClassicSTAThreadWaitForHandles
combase!CoWaitForMultipleHandles
System.Windows.Threading.DispatcherSynchronizationContext.Wait(IntPtr[], Boolean, Int32)
System.Threading.Monitor.Enter(System.Object) 
ThreadingIsHard.TestWindow.Deadlock() [d:\Source\ThreadingIsHard\ThreadingIsHard\TestWindow.xaml.cs @ 69] 
ThreadingIsHard.TestWindow.Button_Click(System.Object, System.Windows.RoutedEventArgs) [d:\Source\ThreadingIsHard\ThreadingIsHard\TestWindow.xaml.cs @ 52] 
System.Windows.EventRoute.InvokeHandlersImpl(System.Object, System.Windows.RoutedEventArgs, Boolean)

MsgWaitForMultipleObjectsEx is just like WaitForMultipleObjects with the important exception that the wait can also return when specific window messages are put into the window message queue. The official docs are silent about which window messages are pumped. I have tried

  • Keyboard press
  • Mouse move
  • Mouse button
  • WM_PAINT
  • BeginInvoke
  • Invoke
  • Dispatcher.Invoke

NONE of these window messages are pumped while holding a lock. Chris Brumme says about this in his blog

On lower operating systems, the CLR uses MsgWaitForMultipleHandles / PeekMessage / MsgWaitForMultipleHandlesEx and whatever else is available on that version of the operating system to try to mirror the behavior of CoWaitForMultipleHandles. The net effect is that we will always pump COM calls waiting to get into your STA. And any SendMessages to any windows will be serviced. But most PostMessages will be delayed until you have finished blocking.

Although the blog post is over 10 years old the information is still relevant. What I have seen so far all Window messages will be blocked except for COM calls as it is the case for UIAutomation calls which use COM as transport vehicle to marshal your UI automation calls across processes. If you have a custom UI Automation provider you should check twice if you are calling into state altering methods of your UI objects which can be reentrant called while your application was trying get access to a lock. While you were waiting for the lock your window could already be disposed! Bad things can and will happen. UIAutomation is not only used for integration testing but also by screen readers and tools for people who have problems reading a computer screen. It is therefore possible that these issues not only show up during tests but also on customer machines. If you get mysterious crashes at places where you would not have expected them check the call stack  for AutomationInteropProvider calls. If there are some you most likely have hit a problem due to unexpected "continuations" while taking a lock. Unfortunately there is not really a good solution known to me except to pay vigilant attention to your UI Automation provider objects which must ensure not to call any methods which could alter the state of objects. If you have pending Begin/Invoke calls pending in your message queue these messages will still not get pumped. Only the direct COM calls and their side effects can tunnel through blocking calls. Perhaps I am telling you things you already knew but this one was quite surprising. Perhaps it would be the best to stop pumping COM messages at all so your calls nicely line up with the other window messages as it should be. But I guess there is still too much legacy code out there which would be deadlocked if no message pumping occurs anymore. The current Windows 10 Build still pumps messages as always. Perhaps we could get a flag to switch over to the "new" behavior to prevent such nasty problems.

Unexpected Window Message Pumping Part 2

$
0
0

Originally posted on: http://geekswithblogs.net/akraus1/archive/2014/10/12/159645.aspx

Although there was no part one my other post "Can You Execute Code While Waiting For A Lock?" can be considered as part one. I hope that this will not become a larger series where code is executed in the wrong order. Suppose you want to execute the commands

  1. Init
  2. Action
  3. OtherAction
  4. Dispose
  5. Shutdown

in this order on the UI thread of an application from another thread. Then you would do something like this:

using System;using System.Diagnostics;using System.Threading;using System.Windows.Forms;namespace MessagePumping
{publicpartialclass Form1 : Form
    {public SynchronizationContext WindowsFormsSynchronizationContext;public Form1()
        {
            InitializeComponent();
            var tmp = this.Handle;

            WindowsFormsSynchronizationContext = System.Windows.Forms.WindowsFormsSynchronizationContext.Current;
            BlockFor();
        }

        void BlockFor()
        {
            Barrier b = new Barrier(2);
            Thread t = new Thread(() =>
            {lock (this)
                {
                    b.SignalAndWait(); // let main UI thread run into lock                    
                    Thread.Sleep(300); // enter the lock on the UI thread

                    Action otherAction = () => Debug.Print("OtherAction");
                    WindowsFormsSynchronizationContext.Post((o) => Debug.Print("Init"), null);
                    WindowsFormsSynchronizationContext.Post((o) => Debug.Print("Action"), null);
                    WindowsFormsSynchronizationContext.Post((o) => otherAction(), null);
                    WindowsFormsSynchronizationContext.Post((o) => Debug.Print("Dispose"), null);
                    WindowsFormsSynchronizationContext.Post((o) => Debug.Print("Shutdown"), null);
                }
            });
            t.Start();

            b.SignalAndWait();
            lock (this) // deadlock which never releases lock
            {
            }
        }

    }
This will print the messages in order just as you would expect it:

Init
Action
OtherAction
Dispose
Shutdown

Now lets slightly change the code of BlockFor

 

void BlockFor()
        {
            Barrier b = new Barrier(2);
            Thread t = new Thread(() =>
            {lock (this)
                {
                    b.SignalAndWait(); // let main UI thread run into lock                    
                    Thread.Sleep(300); // enter the lock on the UI thread
                    Action otherAction = () => Debug.Print("OtherAction");
                    WindowsFormsSynchronizationContext.Post((o) => Debug.Print("Init"), null);
                    WindowsFormsSynchronizationContext.Post((o) => Debug.Print("Action"), null);
                    WindowsFormsSynchronizationContext.Post((o) =>
                    {
                        WindowsFormsSynchronizationContext.Send((o2) => otherAction(), null);
                    },null);
                    WindowsFormsSynchronizationContext.Post((o) => Debug.Print("Dispose"), null);
                    WindowsFormsSynchronizationContext.Post((o) => Debug.Print("Shutdown"), null);
                }
            });
            t.Start();

            b.SignalAndWait();
            lock (this) // deadlock which never releases lock
            {
            }
        }

and check if during a post a send operation is working as expected:

Init
Action
OtherAction
Dispose
Shutdown

The output is till the same. So far so good. Now keep the code but change something else and now we get:

Init
Action
Dispose
Shutdown
OtherAction

What has changed?

Still not it.

 

The SynchronizationContext! You were thinking you were using a WindowsFormsSynchronizationContext? No you weren´t. The WindowsFormsSynchronizationContext.Current is returning not a WindowsFormsSynchronizationContext but a SynchronizationContext instance. The reason is that the static Current property comes from SynchronizationContext. You were using the current SynchronizationContext whatever it was. In the first run it was actually a WPF DispatcherSynchronizationContext  instance which did work just as expected. But if we actually force the usage of an WindowsFormsSynchronizationContext by setting it explictely

public Form1()
        {
            InitializeComponent();
            var tmp = this.Handle;

            SynchronizationContext.SetSynchronizationContext(new WindowsFormsSynchronizationContext());
            WindowsFormsSynchronizationContext = System.Windows.Forms.WindowsFormsSynchronizationContext.Current;
            BlockForThreeSeconds2();
        }

Then we get the faulty behavior that shutdown is executed before our OtherAction which was not our intention. The first question is why are we getting this? When using window messages you are adding messages to a queue. When you post a message you add it at the end of the queue. The same happens if you Send a messages with the only exception that the call blocks until all pending messages and your message have been processed. Our OhterAction message is special since it calls Send to the message pump while it is being processed. The logical picture you have in mind is

image

 

but what actually happens is that while your wrapper call is executed it enqueues OtherAction at the end of the message queue and waits until OtherAction has been called. But since the Dispose and Shutdown messages come first our OtherAction message will be called far too late.

image

Ok that explains why WinForms suck. But why did it work with the WPF Dispatcher? The WPF dispatcher uses a little optimization: If you try to enqueue a message on the thread that runs the message pump then it will NOT be enqueued but directly executed. The WPF Dispatcher model is therefore much saner thane Winforms and prevents you from shooting yourself in the foot. If you do not use the WinForms synchronization context but you call directly on the Control.Begin/Invoke you can fix this faulty model by calling InvokeRequired

                    Action otherAction = () => Debug.Print("OtherAction");this.BeginInvoke((Action) (() => Debug.Print("Init")));this.BeginInvoke((Action) (() => Debug.Print("Action")));this.BeginInvoke((Action) (() =>
                    {if (this.InvokeRequired) // only enqueue to message queue if we are beeing called from a different thread
                        {this.Invoke((Action)(() => otherAction()));
                        }else
                        {
                            otherAction(); // otherwise execute the callback directly
                        }
                    }));this.BeginInvoke((Action) (() => Debug.Print("Dispose")));this.BeginInvoke((Action) (() => Debug.Print("Shutdown"))); 

The rule to check with InvokeRequiredworks before actually enqueuing messages works well and helps to alleviate the problem of unintended message reordering. Unfortunately this does not work with WindowsFormsSynchronizationContexts since it has no InvokeRequired method to do the check by yourself. Until then your best bet would be to abandon the usage of WindowsFormsSynchronizationContext because of this or you simply use the WPF DispatcherSynchronizationContext which works well. But you should be aware that you then have up to 3 message queues in parallel. First it is the window message queue of your window. Then comes the Winforms message queue which uses its own window message to start working on queued messages. And as last queue you have the WPF message queue which also has its own private message queue and window message.

If you interleave WPF and WinForms messages in some calling code like this

public WindowsFormsSynchronizationContext WinFormsContext;public DispatcherSynchronizationContext WPFContext;public Form1()
        {
            InitializeComponent();
            var tmp = this.Handle;

            WinFormsContext = new System.Windows.Forms.WindowsFormsSynchronizationContext();
            WPFContext = new DispatcherSynchronizationContext();
            BlockFor();
        }void BlockFor()
        {
            Barrier b = new Barrier(2);
            Thread t = new Thread(() =>
            {lock (this)
                {
                    b.SignalAndWait(); // let main UI thread run into lock                    
                    Thread.Sleep(300); // enter the lock on the UI thread

                    Action otherAction = () => Debug.Print("OtherAction");
                    WPFContext.Post((o) => Debug.Print("Init"), null);
                    WinFormsContext.Post((o) => Debug.Print("WinForms Init"),null);
                    WPFContext.Post((o) => Debug.Print("Action"), null);
                    WinFormsContext.Post((o) => Debug.Print("WinForms Action"), null);
                    WPFContext.Post((o) =>
                       WPFContext.Send((o2) => otherAction(), null), null);
                    WPFContext.Post((o) => Debug.Print("Dispose"), null);
                    WinFormsContext.Post((o) => Debug.Print("WinForms Dispose"), null);
                    WPFContext.Post((o) => Debug.Print("Shutdown"), null);
                    WinFormsContext.Post((o) => Debug.Print("WinForms Shutdown"), null);
                }
            });
            t.Start();

then you do not get interleaved messages as the calling code suggests but this:

Init
WinForms Init
WinForms Action
WinForms Dispose
WinForms Shutdown

Action
OtherAction
Dispose
Shutdown

This is another subtle difference between WPF and Windows Forms message pumping. WPF will execute only one posted message whereas Winforms will execute all pending posted messages at once until there are no more left to execute. If you have a mixed WPF/Winforms application and you want to get rid of Invokes in favor of BeginInvokes because message pumping is evil you just are about to find the next trap door late during your system integration test. If you can you should stick to only one message queue and prefer Invoke/Send where you can. If you are using Winforms BeginInvoke you not only do not know when it will be run (perhaps too late) but if you need coordination with WPF window messages you are out of luck in this scenario. I am sure you will not design a system this way but at some day someone will add a hidden dependency between both message pumps leaving you little chances to design you out of this mess.

I hope that MS will address these (obvious?) deficiencies in Winforms message pumping so we do not have to worry about stuff we never wanted to know in such great detail in the first place.

Update 1

No I will not make a different post out of it. The last example has an interesting twist if you are using the WPF Dispatcher. It has changed between .NET Framework 4.0 and 4.5. Depending on the target platform you are compiling for (.NET 4.0 or 4.5+) you will get different behavior. The .NET 4.0 Dispatcher WILL pump messages just like WinForms. This means the code above for the WPF context post messages only returns for .NET 4.5:

Init
Action
OtherAction
Dispose
Shutdown

But if you compile for .NET 4.0

image

you get

Init
Action
Dispose
Shutdown
OtherAction

just like in the bad case of Windows Forms. The magic happens in the DispatcherSynchronizationContext class at:

publicoverridevoid Send(SendOrPostCallback d, object state)
{if (BaseCompatibilityPreferences.GetInlineDispatcherSynchronizationContextSend() && this._dispatcher.CheckAccess())
    {this._dispatcher.Invoke(DispatcherPriority.Send, d, state);return;
    }this._dispatcher.Invoke(this._priority, d, state);
}

This returns true or false depending on what is set in the TargetFramwork attribute of your currently executing assembly.

[assembly: TargetFramework(".NETFramework,Version=v4.5.2", FrameworkDisplayName = ".NET Framework 4.5.2")]

Interesting. You can find the future .NET Framework version numbers (but not their release dates) in the BinaryCompatiblityMap which contains the promising name AddQuirksForFramework.

 

if (buildAgainstVersion >= 50000)
        {this.TargetsAtLeast_Desktop_V5_0 = true;
        }if (buildAgainstVersion >= 40504)
        {this.TargetsAtLeast_Desktop_V4_5_4 = true;
        }if (buildAgainstVersion >= 40503)
        {this.TargetsAtLeast_Desktop_V4_5_3 = true;
        }if (buildAgainstVersion >= 40502)
        {this.TargetsAtLeast_Desktop_V4_5_2 = true;
        }if (buildAgainstVersion >= 40501)
        {this.TargetsAtLeast_Desktop_V4_5_1 = true;
        }if (buildAgainstVersion >= 40500)
        {this.TargetsAtLeast_Desktop_V4_5 = true;this.AddQuirksForFramework(TargetFrameworkId.Phone, 70100);this.AddQuirksForFramework(TargetFrameworkId.Silverlight, 50000);return;
        }…

So we can expect a .NET 4.5.3, 4.5.4 and .NET 5.0 in the future with some confidence now.


Windows Draws Some Unicode Characters Wrong (Windows 7-8.1)

$
0
0

Originally posted on: http://geekswithblogs.net/akraus1/archive/2014/12/01/160433.aspx

Recently I got an interesting drawing issue which looks like a Windows bug. Let´s suppose you want to draw a string in Chinese in a Windows Forms or an WPF application. There you have two options.

  1. TextRenderer.DrawText
  2. Graphics.DrawString

The problem is both methods lead to crappy rendered text when the used Font is bold. The "old" method by using Graphics.DrawString renders some Chinese glyphs wrong. If you try a new shiny WPF application you will find out that WPF uses the equivalent to Graphics.DrawString since the text looks the same. Graphics.DrawString and TextRenderer draw the following string with an Arial font like this:

image

Did you notice that the last 3 characters are slightly shifted above and rendered in some bold font? This is normally a sign that the used font (Arial) in this case did not contain all Unicode characters of the string to draw. Before drawing empty rectangles Windows uses the next suitable font which looks like Microsoft YaHei in this case.

To fix this you need to tweak the registry and add a key for Arial to prefer instead a different font to render the complete string with the same font. With the following setting you switch Arial to SimSun as first preferred Font (the Arial key did not even exist on my 8.1 machine).

image

After doing this modification you need to log out and log in again to enable this change. The only issue is that Graphics.DrawString draws the string just like before. No change here. Either it ignores Font Linking in favor of some awkward switch like if (char > dddd) "Microsoft YaHei"  else "Arial" or it has a plain bug in it. The usual suspects on StackOverflow recommend to switch to TextRenderer which works much better. But it breaks if you try to render bold fonts which look quite bad if you do not set explicitly a background color. Here are some pictures how the text changes if the font size grows:

FontRendering

In my application I cannot set a background color since I am drawing onto a Bitmap which is overlayed with other graphics which must remain visible behind the rendered string. The other option is to use as background e.g. Magenta and use this a transparent color. Unfortunately this doesn't work due to the anti aliasing which leaves some purple shadows on the drawn text which looks really ugly. So far it looks like Windows leaves you with two bad options if you want to make a professional looking graphical localizable application:

  • Either the Chinese characters are drawn with two different fonts. (TextRenderer).
  • Or you end up with barely readable text  (Graphics.DrawString).

If you have solved the issue without setting an explicit background color I would love to hear about your solution. The code to draw the text in both Picture boxes is like this:

        const string combined = "测试标记好不好This is a aber long";
privatevoid cFontSize_ValueChanged(object sender, EventArgs e)
        {
            Bitmap bmp = new Bitmap(750, 140);
            Font f = new Font("Arial", (float)cFontSize.Value, IsBold ? FontStyle.Bold : FontStyle.Regular);using (var g = Graphics.FromImage(bmp))
            {
                g.CompositingMode = System.Drawing.Drawing2D.CompositingMode.SourceOver;// Rendering with a Background Color solves the issue but // this would produce boxes of white or black color in the displayed image which looks even worse
                TextRenderer.DrawText(g, combined, f, new Point(0, 0), FontColor);
            }
            cPictureBox.Image = bmp;

            Bitmap bmp2 = new Bitmap(750, 140);using (var g = Graphics.FromImage(bmp2))
            {
                g.CompositingMode = System.Drawing.Drawing2D.CompositingMode.SourceOver;
                g.DrawString(combined, f, FontBrush, 0, 0);
            }

            cPicture2.Image = bmp2;
        }

Pretty straightforward code I would think. WPF is also not the solution since it draws the text in the same way like Windows Forms. I wonder what market share Chinese people currently have that this issue is even in Windows 8.1 not fixed.

Update 1

After adding Arial Bold as Font Link with the settings

It looks like Graphics.DrawString is the winner with the best rendered string again.

image

WPF still does not love my Font Linking settings. Whatever. Graphics.DrawString looks good now in all languages which is just what I need for now.

image

Update 2

It turns out that a German Windows 8.1 behaves differently than a English Windows 8.1 although I have installed the language pack for Chinese  for it.

  1. You do not need to log out and log in again to see the changes in effect and the Arail Bold setting was not required at all. It is sufficient to have one Font Linking setting in place for a German Windows 8.1
  2. On an English Windows 8.1 the bold fonts are still rendered wrong. Windows 7 dito. This is not a solution.

The only real solution I have come up with was to disable bold fonts for Chinese UI languages and live with non bold fonts. The Old New Thing has an "old" post on it called The danger of using boldface for Chinese characters which also points towards the direction that there is no way to render small bold fonts in Chinese which remain readable.

What Is The Difference Between FileFlags.Random (FILE_FLAG_RANDOM_ACCESS) And FileFlags.SequentialScan (FILE_FLAG_SEQUENTIAL_SCAN)?

$
0
0

Originally posted on: http://geekswithblogs.net/akraus1/archive/2014/12/14/160652.aspx

When you want fast and efficient File IO in Windows you can specify a myriad of flags to CreateFile. But which of these many flags help in which scenarios? MSDN says that some read ahead buffering is enabled if Sequential Scan is done. First a little look into the docs can help:

 

FILE_FLAG_SEQUENTIAL_SCAN  0x08000000

Access is intended to be sequential from beginning to end. The system can use this as a hint to optimize file caching.

This flag should not be used if read-behind (that is, reverse scans) will be used.

This flag has no effect if the file system does not support cached I/O and FILE_FLAG_NO_BUFFERING.

For more information, see the Caching Behavior section of this topic.

Under Caching Behavior is more information:

Specifying the FILE_FLAG_SEQUENTIAL_SCAN flag can increase performance for applications that read large files using sequential access. Performance gains can be even more noticeable for applications that read large files mostly sequentially, but occasionally skip forward over small ranges of bytes. If an application moves the file pointer for random access, optimum caching performance most likely will not occur. However, correct operation is still guaranteed.

FILE_FLAG_RANDOM_ACCESS   0x10000000

Access is intended to be random. The system can use this as a hint to optimize file caching.

This flag has no effect if the file system does not support cached I/O andFILE_FLAG_NO_BUFFERING.

For more information, see the Caching Behavior section of this topic.

 

The Old New Thing clears up the confusion by adding that if you specify neither flag the OS is free to use read ahead or not depending on the observed usage pattern. An interesting note is that if you specify RANDOM_ACCESS then the read pages are evicted from the file system cache with the usual LRU policy. This ensures that random reads do not pollute the file system cache. But if you read the same stuff several times shortly after the first read you still get the cached pages back.  Lets try it out by reading from 1000 files the first 10KB  with two 5000 byte reads for each file:

staticvoid ReadOnly(string dir)
        {string[] files = Directory.GetFiles(dir, "*.bin");
            var sw = Stopwatch.StartNew();int bytesRead = 0;foreach(var file in files)
            {
                Parser.FlushFileBufferOfFileByOpeningItUnbuffered(file);
                FileStream stream = new FileStream(file, FileMode.Open, FileAccess.Read, FileShare.None, 1, FileOptions.RandomAccess);byte[] buffer = newbyte[5000];

                bytesRead += stream.Read(buffer, 0 , buffer.Length);
                bytesRead += stream.Read(buffer, 0, buffer.Length);
            }
            sw.Stop();
            Console.WriteLine("Did read {0:N0} bytes from {1} Files in {2:F2}s", bytesRead, files.Length, sw.Elapsed.TotalSeconds);
        }

To see the actual disc access every time we execute this code we need to delete the file system cache for this file. It turns out that you can do this quite easily by opening the file file once for unbuffered IO like this

const FileOptions FileFlagNoBuffering = (FileOptions)0x20000000;publicstaticvoid FlushFileBufferOfFileByOpeningItUnbuffered(string fileName)
        {
            var tmp = new FileStream(fileName, FileMode.Open, FileAccess.Read, FileShare.Read, 1, FileOptions.SequentialScan | FileFlagNoBuffering);
            tmp.Dispose();
        }

FileAccess.RandomAccess

Now the file data is always read from the physical disc and the File IO layer. Since the file system sector size is usually 4KiB you will need for the first 5 KB read 8192 bytes and for the rest another 4096 bytes to successfully read 10 000 bytes. After enough data from the device has been read you can see the two actual 5000 byte reads from offset 0 and 5000 as you would expect them. From the hard disc actually 12288 bytes are read. It makes therefore from an IO perspective no difference if you read 10000 bytes or 12KiB from a disc with a 4KiB cluster size.

image

If you want to find the true file size requested by actual reads you need to filter for the File IO Init stacks (checked with Win 8.1)

fltmgr.sys!FltpPerformPreCallbacks

fltmgr.sys!FltpPassThrough

whereas the other stacks deal with read ahead caching and the translated calls to the actual device where the read size is adjusted to the actual sector read size the device actually supports.

image

You might ask what the IRP column is for. IRP is an I/O request packet which is the similar to an opaque file handle. Each device driver can create its own IRP where it issues IO requests to other drivers. That way you can follow how different drivers split their work up. So far I have not found a useful grouping for it.

It looks like the FileAccess.RandomAccess behaves largely like unbuffered IO where you exactly read as much data as you need in the granularity of the cluster size of the actual device.

FileAccess.SequentialScan

Now we can try it out a second time by using FileAccess.SequentialScan. The first read is 8KiB and the next one is 184KiB which comes from the read ahead prefetching mechanism. The assumption that we read much more than the first 10KB is quite optimistic which leads in this case to over 18 times more hard disc reads than we would need to for this use case.

image

FileAccess.SequentialScan is certainly a bad choice for small buffered reads. If we look at the timings we see that as well.

Scenario: 1K Files with 2x5000 bytes readsTime in s
FileAccess.RandomAccess 3,9
FileAccess.SequentialScan6,5

FileAccess.None

Last but not least what will happen when we do not specify any option (FileOptions.None). At least in this case it behaves exactly like Sequential Scan where also 196MB of data is fetched from the disc although we only wanted to read only 10MB from the file headers.

Conclusions

That sounds like if you never should use SequentialScan. But in most scenarios where you read a file, parse a little bit and then read the next chunk of the file this is exactly what you want. In theory you can use async IO to read the file asynchronously while parsing the stuff on another thread. But if you are lucky you can do it without extra threading on your side while the OS is kind enough to read ahead of your application logic while you are still reading and parsing the file single threaded. You get the the benefits of multithreading without using extra application threads for it!

The read ahead logic is not optimal in all cases but helps where otherwise much more complex code would be necessary to read from a hard disc at full speed (60-300MB/s). What is the morale of this story? Never use FileOptions.SequentialScan or FileOptions.None if you want to read only the file headers! Otherwise you are consuming much more IO bandwidth than necessary to achieve your goal and you are left scratching your head why switching from unbuffered IO to buffered IO has made reading your files so much slower.

A Single Line Performance Error

$
0
0

Originally posted on: http://geekswithblogs.net/akraus1/archive/2014/12/21/160740.aspx

We all make dumb mistakes which are not interesting to share. But sometimes people make interesting mistakes. Below is a simple helper class that makes combining several delegates in a logical and manner very easy.

 

staticclass Constraints<T>
    {/// <summary>/// Logical and of a list of predicates./// </summary>publicstatic Predicate<T> And(params Predicate<T>[] ands)
        {// return a new delegate which checks if all delegates in the array did return // true for a given itemreturn item => ands.All(x => x(item));
        }
    }

You can use it then like this:

class Program
    {staticvoid Main(string[] args)
        {int[] data = newint[20 * 1000 * 1000]; // 20 million ints

            Predicate<int> CheckForZero = Constraints<int>.And(i => i > -1, 
                                                               i => i < 1);

            var sw = Stopwatch.StartNew();
            for (int i = 0; i < data.Length; i++)
            {if (CheckForZero(data[i]) == false)
                {
                    Console.WriteLine("Violation found at {0}", i);
                }
            }
            sw.Stop();
            Console.WriteLine("Did take: {0:F2}s", sw.Elapsed.TotalSeconds);
        }
   }
Now you have a nice way to combine separate checks in a reusable way. The only issue is that for many calls to CheckForZero the running time is quite high.
Did take: 0,90s

If we compare this to the "normal" way

for (int i = 0; i < data.Length; i++)
            {if ( !( data[i] > -1 && data[i] < 1) )
                {
                    Console.WriteLine("Violation found at {0}", i);
                }
            }

Did take: 0,01s

Whoa. The delegate version is 90 times slower! I found this issue by looking at the mount Everest's in terms of CPU consumption. What is causing so much overhead?

Well you just did wake up a sleeping dragon

image

with your LINQ delegate version. It is the Garbage Collector! Here is the allocation report of our slow sample:

image

We have a 80MB int array with 20 million elements. But we allocate 2,5GB of closure classes! Our Constraints.And method needs one closure to capture the list of delegates.

This is only needed once since we cache the returned delegate and will not show up in the profiler at all.

image

 

Things get interesting when we invoke the And method of the captured class. In this case we are allocating one additional closure to give access to the current item to check and the enclosing capturing class which gives us access to the ands array needs also be allocated. On top of that we need to create a new Func<Predicate<T>, bool> delegate instance as argument to the All LINQ statement.

image

To sum it up we allocate on each iteration

  • 1 c_DisplayClass4   32 bytes
  • 1 Func<Predicate<T,bool>> delegate instance 64 bytes
  • 1 SZArrayHelper+SZGenericArrayEnumerator 32 bytes
    • Where does this come from? It is caused by the All LINQ statements which iterates over an IEnumerable. If you foreach over an array it needs to get an enumerator from the object to enumerate. Since an array has none out of the box the CLR allocates a helper enumerator which is this strange SZArrayHelper class. If you directly foreach over an array this is NOT used.  Instead more efficient code is generated.

 

In total we allocate 128 bytes on each integer we want to check (4 bytes) which is quite some overhead. 128*20.000.000=2.560.000.000 which is exactly what the profiler reports. Since I did use from PerfView the ETW .NET Alloc method

image 

I get a correct total sum of the allocated bytes back. But the types which are responsible for the allocation and especially the counts can be misleading. If you use .NET Alloc you get all allocations but the use case will not run 1s but >2 minutes which is not something I really like to do. If you want to get exact allocation number you need to check the .NET Alloc checkbox which injects a profiling dll into all new started processes so you can get exact allocation reports. Unfortunately this does not seem to work with the old and current PerfView (1.7) version where I see anything but my new allocations.

image

When in doubt I would always use ETW .NET Alloc to get reliable although sampled numbers. This works also for JITed code in x64 on Windows 7 machines because for each allocation stack a managed stackwalk is also performed which gives you the full allocation stack. ETW does not work for JITed x64 code on Win7 except if you precompile it with NGen as workaround. On Win 8 and beyond this is no longer an issue except if you try to use API Hooking. There the x64 stack walker still stops walking the stack when the method interception is done in a form it cannot recognize.

This sounds like some complex stuff. Can we fix it? How about providing an overload to our class?

publicstatic Predicate<T> And(Predicate<T> p1, Predicate<T> p2)
        {return item => p1(item) && p2(item);
        }
 

This little overload spares us pretty much all of the overhead. We do not allocate a single byte while iterating over our array. This time we get

Did take: 0,14s

Which is nearly 9 times faster. This is quite good for such a simple overload which is preferred by the compiler when it finds this one as best match. Now we only pay for 3 delegate invocations which are not avoidable if we want to keep the flexibility here.

Now go and check your LINQ statements for hidden dragons. 2,5GB allocations with one line are quite easy to achieve with delegates and LINQ. This is the main reason why LINQ is banned from the Roslyn project because of these subtle delegate issues. It is kind of ironic that the inventors of LINQ do not want to use their own invention in their C# compiler backend for performance reasons.

ConcurrentCollections Can Be Prohibitively Expensive

$
0
0

Originally posted on: http://geekswithblogs.net/akraus1/archive/2015/01/26/161276.aspx

Since most of us use multiple threads we get all sorts of race conditions which we can solve with proper locking and concurrent data structures. A simple issue is if you have concurrency on an object which exposes a Dictionary<string,string> from where you get exceptions back if concurrent writers add the same data to it. You would then either need to coordinate the writers or use a different dictionary which allows concurrent adds. As it turns out ConcurrentDictionary is just the collection you were searching for and your exceptions are gone. You let the tests run. All green, check in and submit it to the main branch.

A long time later at the end of the project until now ignored performance issues tend to get identified and fixed. Someone calls you that your checkin from a long time ago is the root cause of a major performance issue and that you are showing up as mount Everest in the VirtualAlloc reports.

 

image

How can that be? The problem with the Concurrent collections which claim to be largely lock free is that to achieve uncontended writes you have to allocate your key data structures per core. This means that you are allocating memory proportional to the number of your cores. This is not a problem if you have relatively few instances around but if you are exchanging a dictionary in key data structures which are heavily allocated you will get much slower. You can test it for yourself with a small sample app like this:

class DataObject // Original Version
    {public IDictionary<string, string> Data { get; private set; }public DataObject()
        {
            Data = new Dictionary<string, string>();
        }
    }class FixedDataObject // Fixed version with ConcurrentDictionary
    {public IDictionary<string, string> Data { get; private set; }public FixedDataObject()
        {
            Data = new ConcurrentDictionary<string, string>();
        }
    }class Program
    {staticvoid Main(string[] args)
        {
            List<DataObject> dataList = new List<DataObject>();
            List<FixedDataObject> dataListFixed = new List<FixedDataObject>();

            var sw = Stopwatch.StartNew();
            constint Runs = 300*1000;for(int i=0;i<Runs;i++)
            {if( args.Length == 0 ) // No args use old
                {
                    dataList.Add(new DataObject());
                }else// some args passed use fixed version
                {
                    dataListFixed.Add(new FixedDataObject());
                }
            }
            sw.Stop();
            Console.WriteLine("{0:N0} did take {1:F2}s", Runs, sw.Elapsed.TotalSeconds);

            GC.KeepAlive(dataList);
            GC.KeepAlive(dataListFixed);
        }
    }

Old Version

300.000 did take 0,14s

Fixed Version with Concurrent Dictionary

300.000 did take 1,08s

Update

On a 24 core server it takes 3,3s and it consumes 1,4GB of memory. Data structures which allocate proportional to the number of cores for the sake of thread performance do not sound like a good idea to me.

This is over 7 times slower and introduces tons of GCs in your application limiting what you can do in parallel even further. The worst part of this is that the bigger the server you let it run on the slower (more cores means more memory to allocate!) it will get. On a slow dev machine with 2 or 4 cores you might not notice it much. But if you deploy this on real server hardware with 40 cores you have a really bad performance problem at hand. That is the reason why you need to do performance testing on virtually every hardware configuration you support.

The fix was to go back to a plain dictionary with a lock. ConcurrentDictionary supports a loadfactor which controls how many arrays for isolated writes is should allocated but it was still a factor two slower.

So beware by simply replacing non concurrent data structures with their lock free but much more memory hungry counterparts you might be making your system much slower. In effect we were loading data slower the bigger the server did get. This is definitively not what the customer has paid for.

ProfBugging - How to find leaks with allocation profiling

$
0
0

Originally posted on: http://geekswithblogs.net/akraus1/archive/2015/03/22/161982.aspx

Since ETW know how is still considered an arcane science by most people I have the chance to look at a lot of "easy" things which did affect performance a lot but they went bye unnoticed until I did look into with ETW tracing. While checking some GC related stuff to reduce the memory allocations I have noticed that one process had a object[] allocations on the Large Object Heap. This is nothing unusual at all but the allocation stack was interesting.

image

The allocations were coming from Delegate.Remove. What does this tell me when I see large object allocations in this method? Inside MulticastDelegate you can find the allocation with e.g. ILSpy at

privateobject[] DeleteFromInvocationList(object[] invocationList, int invocationCount, int deleteIndex, int deleteCount)
{object[] array = this._invocationList asobject[];int num = array.Length;while (num / 2 >= invocationCount - deleteCount) // Try to use a much smaller array /2 or /4 or even less if the remaining delegates can fit into it.
    {
        num /= 2;
    }object[] array2 = newobject[num];    // Here is the delegate list allocationfor (int i = 0; i < deleteIndex; i++)  // Copy all delegates until we get to the deletion index
    {
        array2[i] = invocationList[i];
    }for (int j = deleteIndex + deleteCount; j < invocationCount; j++)  // continue filling up the array with the remaining delegates after the deletion index and deletion count 
    {
        array2[j - deleteCount] = invocationList[j];
    }return array2;
}

This code basically tries to allocate a smaller array (half of the original size or less) if possible. Otherwise an array with the same size is allocated. The for loops copy the remaining delegates into the new invocation array.

Ok now we know that we have allocated a new array for the delegate invocation list. What is the error here?

Nope not yet.

Still not.

I strongly suspect that we have an event handler leak here. Why? Delegate.Remove is called mainly by event handler remove methods. Since the allocated object array goes to the LOH it must be bigger than 85000 bytes. How many object references do fit into a 85k array under x64?

85000/8 = 10625 Delegates

We know therefore that the event which comes later in the call stack had at > 10625 subscribers. So many subscribers are usually the sign of an event handler leak and sure it was. It was a static event which was the source of a quite huge memory leak which was very visible in some memory dumps which were taken later. Thanks to ETW allocation profiling I was able to see the root cause of a memory leak without using a debugger.

If you are not using PerfView for some reason but prefer xperf or tracelog to configure ETW tracing you can enable .NET Allocation Sample profiling with

xperf -on proc_thread+loader+profile -stackwalk profile -f c:\temp\kernel.etl -start clr -on Microsoft-Windows-DotNETRuntime:0x40200095 -f c:\temp\clr.etl

xperf -stop -stop clr -d c:\temp\traces.etl

The magic value 0x40200095 comes from

C:\Windows\Microsoft.NET\Framework64\v4.0.30319\CLR-ETW.man

KeywordNameDescription
0x0000000000000001GCKeywordGC events
0x0000000000000004FusionKeyword        Binder (Log assembly loading attempts from various locations)
0x0000000000000008LoaderKeyword       Loader (Assembly Load events). Also necessary to get the module name in JITed call stacks.
0x0000000000008000ExceptionKeywordException
0x0000000000200000GCSampledObjectAllocationHighKeywordWhen enabled limit rate of allocation events to 100 events/s.
0x0000000002000000GCSampledObjectAllocationLowKeywordWhen enabled limit rate of allocation events to 5 events/s.
0x0000000040000000StackKeywordWhen enabled a CLR Stackwalk is performed for every GCSampledObjectAllocation event. This is how PerfView gets its call stacks even for non Ngenned x64 code on Win7 machines where the ETW stackwalks would skip Jited code.
0x0000000000000010JitKeywordNeeded when you want to get call stacks of an application. If the application was already started you need to force the enumeration of all running processes with an extra CLR Rundown provider which gets all required data.
0x0000000000000080EndEnumerationKeywordNeeded to enumerate JITed methods upon process exit which were not recorded by the JitKeyword

The xperf line above is all you need when the process starts when your ETW session is already running and the process terminates while the session was running. If not you need to force the CLR to enumerate all methods and modules which were JITed. If e.g. your process was already running and still runs when you stop profiling you need the second run to enumerate all already JITed methods. This is called CLRRundown with another provider.

xperf -start clrrundown -on Microsoft-Windows-DotNETRuntimeRundown:0x118 -f c:\temp\rundown.etl

To merge it with the previous recording you can use

xperf -merge c:\temp\traces.etl c:\temp\rundown.etl c:\temp\tracesMerged.etl

The value 0x118 is the combination of LoaderRundownKeyword (0x8), JitRundownKeyword (0x10) and EndRundownKeyword (0x100). The other values which are responsible for NGen stuff are not needed anymore because we can generate from the native images pdbs with xperf and PerfView while stopping an ETW recording which is the fastest approach.

Since corclr is open sourced we can also get our hands on all the internals (finally!). It turns out that we are not limited to 5 (GCSampledObjectAllocationLowKeyword) or 100 (GCSampledObjectAllocationHighKeyword) allocation events/s. If we want to get them all we can set the environment variable

set COM_PLUS_UNSUPPORTED_ETW_ObjectAllocationEventsPerTypePerSec=10000

to the number of allocation events/s we want. If you set it to 10000 you get nearly all allocations at the cost of a slower system but if you want exact number this can help in some situations. If you want to see the other hidden COMPLUS flags have a look at https://github.com/dotnet/coreclr/blob/master/src/inc/clrconfigvalues.h. There are some pretty interesting tweaks inside it. E.g. you can configure the spinning behavior of CLR locks aka Monitor and other stuff which could come in handy to tweak or debug an application.

PerfView knows other .NET private and stress providers but there are no manifest´s for them. This is no longer true. Since CoreClr is basically a clone of the desktop CLR we can take a look at ClrEtwAll.man which contains the definition for all CLR providers. The only missing thing is the resource dll to register it. I have compiled CoreClr and put the resulting clretwrc.dll along with the manifest for the Microsoft-Windows-DotNETRuntimeStress (CC2BCBBA-16B6-4cf3-8990-D74C2E8AF500) and Microsoft-Windows-DotNETRuntimePrivate (763FD754-7086-4dfe-95EB-C01A46FAF4CA) here: https://drive.google.com/folderview?id=0BxlmobpeaahAbEdvWHhQb1NCcFU&usp=sharing&tid=0BxlmobpeaahAa2dMVEMyVG9nRnc.

You need to put the resource dll to D:\privclr or edit the path in ClrEtwAll.man to your location of the dll to make the registration with

wevtutil im ClrEtwAll.man

successful. If you really want to look into these logs you can do it with WPA now as well. Some events could be interesting for troubleshooting:

image

The xperf calls above were only examples where the -BufferSize, -MinBuffers, -MaxBuffers and -MaxFile were not supplied to make the example not overloaded by these things. In reality you want to trace your recording either until a specific size of the recording was reached (-MaxFile) (e.g. 4 GB is a quite huge size). Additionally you can configure -FileMode Circular to record in a ring buffer which enables you to record for a very long time until something interesting happens where you want to stop recording usually. This can be achieved e.g. with Performance Counter triggers which can execute a command when a performance counter was above or below a threshold.

A more typical xperf command line to record into a ring buffer with 2 GB of buffer size which records

  • Context Switches with call stacks
  • Enables Profiling with 1ms sample rate with call stacks
  • Process lifetimes
  • Process memory information

is

xperf -on proc_thread+loader+meminfo+disk_io+filename+cswitch+dispatcher+profile -stackwalk profile+cswitch+readythread -f C:\kernelCircular.etl -MinBuffers 100 -BufferSize 512 -MaxBuffers 2000 -MaxFile 2000 -FileMode Circular

WPR always uses a 10% physical memory ring buffer (1,6GB on my 16 GB box) which is allocated right from the start. The xperf trace buffers on the other hand can grow until 2GB in 512KB chunks. This is better because you usually need two trace sessions (one user and one kernel session) and you do not want to allocate upfront for each session the maximum size which is potentially never reached.

That's it for today. Happy troubleshooting and try out your new tools. If you had good or bad experiences with these tools I would love to hear about it.

Which Keywords Has An ETW Provider Enabled?

$
0
0

Originally posted on: http://geekswithblogs.net/akraus1/archive/2015/03/25/162046.aspx

Todays little program has a simple task. It should decode a hex bitmask into its original bits as hex values from which it was composed from. As bonus we can also parse an ETW manifest and print the corresponding keywords. Since most low level tools use hex numbers which have specific bits turned on I often want to know which things were actually enabled. This is useful to e.g. see which providers PerfView enables for .NET specific stuff.

If you check in PerfView the .NET Alloc checkbox I want to see what is enabled.

xperf -Loggers | findstr /i Microsoft-Windows-DotNETRuntime

".NET Common Language Runtime":0xc14fccbd

BitmaskDecoder.exe 0xc14fccbd C:\Windows\Microsoft.NET\Framework64\v4.0.30319\CLR-ETW.man

0x1             GCKeyword
0x4             FusionKeyword
0x8             LoaderKeyword
0x10            JitKeyword
0x20            NGenKeyword
0x80            EndEnumerationKeyword
0x400           SecurityKeyword
0x800           AppDomainResourceManagementKeyword
0x4000          ContentionKeyword
0x8000          ExceptionKeyword
0x10000         ThreadingKeyword
0x20000         JittedMethodILToNativeMapKeyword
0x40000         OverrideAndSuppressNGenEventsKeyword
0x80000         TypeKeyword
0x400000        GCHeapSurvivalAndMovementKeyword
0x1000000       GCHeapAndTypeNamesKeyword
0x40000000      StackKeyword
0x80000000      ThreadTransferKeyword

"Microsoft-Windows-DotNETRuntimePrivate":0x4002000b

BitmaskDecoder.exe 0x4002000b d:\privclr\ClrEtwAll.man


0x1             undocumented
0x2             undocumented
0x8             undocumented
0x20000         MulticoreJitPrivateKeyword
0x40000000      StackKeyword

 

If you on the other hand enable ETW .NET Alloc you get

BitmaskDecoder.exe 0xc16fccbd C:\Windows\Microsoft.NET\Framework64\v4.0.30319\CLR-ETW.man

0x1             GCKeyword
0x4             FusionKeyword
0x8             LoaderKeyword
0x10            JitKeyword
0x20            NGenKeyword
0x80            EndEnumerationKeyword
0x400           SecurityKeyword
0x800           AppDomainResourceManagementKeyword
0x4000          ContentionKeyword
0x8000          ExceptionKeyword
0x10000         ThreadingKeyword
0x20000         JittedMethodILToNativeMapKeyword
0x40000         OverrideAndSuppressNGenEventsKeyword
0x80000         TypeKeyword
0x200000        GCSampledObjectAllocationHighKeyword
0x400000        GCHeapSurvivalAndMovementKeyword
0x1000000       GCHeapAndTypeNamesKeyword
0x40000000      StackKeyword
0x80000000      ThreadTransferKeyword

The only difference between .NET Alloc and ETW .NET Alloc is that PerfView addtionally enables the high sampling rate (max 100 alloc events/s) with GCSampledObjectAllocationHighKeyword. If you want to record longer with less events you can use this mask and replace GCSampledObjectAllocationHighKeyword with GCSampledObjectAllocationLowKeyword to get 20 times less allocation events (max 5 alloc events/s).

This makes it easy to find out what other people think about good ETW settings if you have the manifest for the provider. If you have found the list of hex values you want to combine you can call

BitmaskDecoder.exe -encode 0x1 0x2 0x4 0x80000

Encoded value: 0x80007

to get the corresponding hex mask. You can use it of course also for other bit masks if you omit the ETW manifest file name.

Here is the code for this simple tool:

Update: Now it also supports keywords. Thanks for the suggestion Andre.

You can now also write

BitmaskDecoder.exe -encode C:\Windows\Microsoft.NET\Framework64\v4.0.30319\CLR-ETW.man  JittedMethodILToNativeMapKeyword GCHeapAndTypeNamesKeyword

The executable can also be found here.

using System;using System.Globalization;using System.IO;using System.Linq;namespace BitMaskDecoder
{class Program
    {staticvoid Main(string[] args)
        {try
            {if (args.Length == 0)
                {
                    Help();return;
                }string[] lines = null;if (args[0] == "-encode")
                {if( args.Length < 2 )
                    {
                        Help();return;
                    }int skip = 1;if( File.Exists(args[1]) )
                    {
                        lines = File.ReadAllLines(args[1]);
                        skip++;
                    }ulong res = args.Skip(skip).Select(x => x.TrimStart(LeadingHex))
                                               .Select(str => DecodeKeyword(str,lines))
                                               .Aggregate((x, y) => x | y);
                    Console.WriteLine("Encoded value: 0x{0:X}", res);return;
                }if (args.Length > 1)
                {
                    lines = File.ReadAllLines(args[1]);
                }string hex = args[0].TrimStart(LeadingHex);ulong hexNum = ulong.Parse(hex, System.Globalization.NumberStyles.AllowHexSpecifier);for (int i = 0; i < 64; i++)
                {ulong mask = 1ul << i;ulong hexMask = mask & hexNum;if (hexMask != 0)
                    {stringvalue = String.Format("0x{0:X}", hexMask);string searchValue = value + "\"";bool bFound = false;if (lines != null)
                        {string curProvider = null;for (int line = 0; line < lines.Length; line++)
                            {string lineStr = lines[line];if( lineStr.Contains("<provider") )
                                {
                                    curProvider = ExtractName(lineStr);
                                }// we are after lines like <keyword name="GCKeyword" mask="0x1" message="$(string.RuntimePublisher.GCKeywordMessage)" symbol="CLR_GC_KEYWORD"/>if (lineStr.Contains(searchValue) && lineStr.Contains("keyword"))
                                {
                                    Console.WriteLine("{0,-8}\t{1,-50}\t{2}", value, ExtractName(lineStr), curProvider);
                                    bFound = true;
                                }
                            }
                        }if (!bFound)
                        {
                            Console.WriteLine("{0,-8}\t{1}", value, "undocumented");
                        }
                    }
                }
            }catch (Exception ex)
            {
                Console.WriteLine("Error: {0} {1}", ex.Message, ex.StackTrace);
            }
        }staticvoid Help()
        {
            Console.WriteLine("BitmaskDecoder 0xddddd [ETWManifest] or -encode [manifestfile] 0xddd 0xddddd keyword1 keyword2 ....");
            Console.WriteLine("\tDecode a hex value into its bits and print the bits as hex values.");
            Console.WriteLine("\tWhen ETWManifest file is present the file is parsed and the matching lines of the manifest with the keywords are displayed");
        }staticulong DecodeKeyword(string hexOrKeyword, string[] manifest)
        {ulong lret = 0;if (!ulong.TryParse(hexOrKeyword, NumberStyles.AllowHexSpecifier, CultureInfo.InvariantCulture, out lret))
            {
                lret = ulong.Parse(ExtractMaskValue(hexOrKeyword, manifest), NumberStyles.AllowHexSpecifier);
            }return lret;
        }staticstring ExtractName(string line)
        {return ExtractAttribute(line, "name");
        }staticstring ExtractMaskValue(string keyword, string[] lines)
        {string hex = lines.Where(line => line.Contains("<keyword") && line.Contains(keyword))
                              .Select(line => ExtractAttribute(line, "mask"))
                              .FirstOrDefault();if( hex == null )
            {thrownew NotSupportedException(String.Format("The keyword {0} was not found in the manfifest", keyword));
            }return hex.TrimStart(LeadingHex);
        }staticstring ExtractAttribute(string line, string attribute)
        {returnnewstring(line.Substring(line.IndexOf(attribute))
                                            .SkipWhile(c => c != '"') // search first "
                                            .Skip(1)                  // skip "
                                            .TakeWhile(c => c != '"') // take all until next "
                                            .ToArray());
        }staticchar[] LeadingHex = newchar[] { '0', 'x' };
    }
} 

 

 


                       

Windows Performance Toolkit for Windows 10

$
0
0

Originally posted on: http://geekswithblogs.net/akraus1/archive/2015/04/21/163342.aspx

All are talking about Windows 10 but what about the developer Tools? Since 4/20/2015 you can also download the beta of the upcoming Visual Studio 2015 and Windows / Phone SDKs. This includes also a new version of the (at least for me) long awaited Windows Performance Analyzer. You can get the ISO image here:

http://dev.windows.com/en-us/windows-10-developer-preview-tools?ocid=WIP_r_Mar_Body_SDK

From there you can download the iso file for the tools. You do not need to install Visual Studio CTP for this.  Now mount the iso files and install for x64 Windows versions (x86 is also there)

\packages\OneSDK\Windows App Certification Kit x64-x86_en-us.msi

This will put the new Windows Performance Toolkit under

C:\Program Files (x86)\Windows Kits\10\Windows Performance Toolkit

You do not need Windows 10 to be able to install the toolkit beta. You can install the kit at least on Win 8.1 and (not tested) also on Win7 machines.

Update 1

You can download the released pieces with the Windows 10 SDK installer here: https://dev.windows.com/en-us/downloads/windows-10-sdk

To download just the SDK run the installer and uncheck everything except the Windows Performance Toolkit. If you want to install it offline you can choose a download directory from where you can copy  the msi files on machines with e.g. no internet connection. From the download directory you need only the files

WPTx64-x86_en-us.msi (64 bit Windows versions since Windows 7)

WPTx86-x86_en-us.msi (32 bit Windows versions since Windows 7)

So what's inside? As far as I can tell the changes are not too dramatic but some nice improvements are there.

 

Xperf

Added option to -injectonly -merge.

    -injectonly                  Do not copy (merge) events, only inject 
                                 image identification information. Cannot be 
                                 used with -mergeonly.

What is the meaning of this? You can extract from a kernel trace file the events into a separate file and then merge a user etl file which will then become readable with WPA (albeit the process names do not show up).

This will generate the extract of the kernel etl file:

xperf -merge  c:\temp\LowLatencyGCEvents.kernel.etl c:\temp\lowlatnetinjected.etl -injectonly

Now you  can merge it with a user mode session etl file without ending up with a multi GB etl file if the user mode session contains only a few interesting events in which you are interested in.

xperf -merge  c:\temp\LowLatencyGCEvents.etl c:\temp\lowlatnetinjected.etl c:\temp\mergedWithImage.etl

Besides this there are no new features to xperf that are worth noting.

WPA

Now you can open the enriched user mode etl file with WPA and finally you can read the events:

image

Which new features do you spot in the picture above?

  • Colors! You can select the colors you want for your graphs.
  • You can graph numerical data from custom events now!
    • This works currently only to some extent. I hope it will get better in the released version.
    • The main issue is that it cannot graph hex values which have defined in their event template as outType hex values:

<template tid="GCHeapStats_V1">
    <data name="GenerationSize0" inType="win:UInt64" outType="win:HexInt64"/>

    • Another issue that I would like the possibility to graph a running value as line chart (e.g. memory usage, durations, …). Currently it is displayed only as a spike in the line diagram. Ok I can use bar charts but the line diagram type looks not too useful to me since I never know how many events are hiding behind this bar. Yes I can hover over the bar and press Alt+Space (also a new feature) to see more information about the graphed values.
  • Finally you can add at least for generic events in the View Editor a DateTime (Local) column. This is a first step into the right direction. But why on earth is it still not possible to switch on local time for the x axis? This would really help when I spot an issue and I can look at the clock, write down the time, save the trace and continue with the tests. Later I can look at the gathered data and the timepoints from my notes to see if something suspicious was going on. This is even more annoying for long term recordings (hours or even days) where the customer tells you that around 2 o`clock something bad did happen. Currently I have to look into Trace - System Configuration - Traces -  Start/End (UTC) Time to calculate from the seconds since trace start and the reported time the actual time window I should take a look at.

Other Features

Symbol Loading was significantly improved. Now you can deselect symbol locations you do not want via a checkbox instead of manually editing a long string. That is good thing although I miss the automatic heuristic to add from an etl file to add xxx.etl.NGenPDB to the symbol path if the folder exists.

The symbol settings are remembered which is good. What I do not like is that although I have disabled the MS symbol server it tries to load missing symbols still from MS when I select to load the symbols automatically after trace load.

image

When this is enabled there is no way to cancel symbol load after it has started and reconfigure it. This is not the best solution for me because it can take a long time until all symbols are downloaded. I have not tried to check out what will happen on a machine with no internet connection. The old WPA had a cancel button which did work after a long time but at least it was possible to cancel the symbol download.

image

Custom stack tag files can now be added. Before that you needed to overwrite the default.stacktags file in the folder where WPA was installed.  It was easy but I still felt dirty when I did overwrite the file.

If you have not used stack tags yet you should definitely have a look at them because if you see the same issue over and over again you can add this method to your stack tag file and add it under a problem node. When the problem node gets much CPU you can spot the "old" issue within seconds without the need to drill deep into some call stacks.

image

 

What is still on my Wish List?

  • Path Tree as in File IO like in the Disc IO table view.
  • Add a process command line as column in CPU Usage graphs so I can see which generic process caused which CPU consumption.
  • Some help with Wait Chain analysis. I have seen in seen in some WPA video this option but it seems not to be published.
  • WPA still knows nothing about managed code. Basic GC metrics as in PerfView should be part of it!
  • Region files where the SelfNest feature works as expected. Ideally I would like to create a "stack trace" like structure from my custom traces as regions.
    • Currently WPA seems to ignore the TID option and uses for method enter/leave traces simply the nearest trace in the timeline which gives garbage call stacks.
  • Improved Stack Tagging
    • I want to see for expensive methods how expensive they really are. This includes GC overhead. Currently this is grouped extra under CLR - GC because the stack tags "steal" from other methods if they have a better matching call stack. The exact logic is not obvious but it could be something like a match counter. The rule with most matches wins. It would be nice to add some nesting to stacktags so I can visualize the GC overhead of specific known expensive methods.

image

My Stacktag file looks like this one:

  <Tag Name="MyApplication">
    <Tag Name="Main">
        <Entrypoint Module="LowLatencyGC.exe" Method="*Program::Main*"/>
        <Tag Name="GC">
            <Entrypoint Module="clr.dll" Method="*GC*"/>
        </Tag>

    </Tag>
  </Tag>


  <Tag Name="CLR"> 
    <Tag Name="GC">
      <Entrypoint Module="clr.dll" Method="*GC*"/>
    </Tag>
 
  </Tag>

      

I want to group all methods which call Main under MyApplication and then factor out the GC overhead. For all other methods the generic CLR GC node should match so I can see for specific methods and the "other" GC overhead nicely lined up. This is currently not possible. Perhaps a IsChild="true" attribute which then alters the tagging logic would do the trick?

That's all for now. I am sure I have missed some new features like the Analysis Assistant which currently looks like some scratch pad for my notes. Perhaps there will show up in a later build wait chain traversal? Lets keep the fingers crossed that the existing bugs of the Beta get fixed and we get an even better tool which is already awesome.


SpinWait Is Dangerous

$
0
0

Originally posted on: http://geekswithblogs.net/akraus1/archive/2015/06/18/165219.aspx

Some bugs are easy some are hard. If you are dealing with a massive multithreaded distributed application consisting of many different processes talking to each other performance issues are pretty much impossible to detect with standard profilers. But with WPT the word impossible can be replaced with possible although much work is still needed. Recently I had an interesting regression issue where a use case did become over one second slower which was a quite big difference. So it should be easy to spot right? Two things were easy but one fraction ( 60% ) was still hiding in the data. Since the title of this post is already very clear here is a small test application which shows the issue:

class Program
    {static Stopwatch StartTime = Stopwatch.StartNew();static TimeSpan WaitTime = TimeSpan.FromMilliseconds(2);staticbool ReturnTrueAfter2ms()
        {if( StartTime.Elapsed.Ticks > WaitTime.Ticks )
            {
                StartTime.Restart();returntrue;
            }returnfalse;
        }staticvoid Main(string[] args)
        {while (true)
            {
                SpinWait.SpinUntil(ReturnTrueAfter2ms);
            }
        }
    }
In the main method we are continuously Spinwaiting for an operation which usually takes some ms (e.g. network IO) to complete. The goal of SpinWait is to spare you context switches in favor of burning CPU cycles. We are talking about some microseconds to spare instead of a context switch. Spinwait will therefore by default spin a few microseconds until it gives up and tries to yield to other threads with increasing persistence. If you look at the code of SpinWait

 

image

you find that for every 20th run it will sleep for 1ms. That in itself is not a big deal if your method takes longer to execute. Only by looking at the code all looks sound and reasonable. What could possibly go wrong? Let have a look at the costs for a context switch with Thread.Yield, Thread.Sleep ( 0 ) and Thread.Sleep( 1 ). Thread.Yield is quite cheap if the other thread has not much work to do. The cost looks like 6 microseconds for doing only a context switch and up to 70 microseconds if the other thread is doing some work until it waits again for something.

image

Thread.Sleep( 0 ) on the other hand lets the thread run forever if no other threads want to run on this CPU. The scheduler looks after the thread for ca. every 250ms where a non switching context switch event is logged but apart from that nothing is stopping my thread.

Things become interesting for Thread.Sleep(1) where we forcefully give up our thread quantum.

 

image

Remember the goal of SpinWait is to spare you an expensive context switch. It spins for about 20-80 microseconds. Then it tries from time to time a Thread.Yield, Thread.Sleep ( 0 ) and every 20th time a Thread.Sleep(1). In the comments of SpinWait it is noted that a thread can wait for a long time. In fact a Thread.Sleep(1) can sleep up to 15ms because you give up your current thread quantum and the scheduler looks after your thread only during the next clock tick which is 15,6ms later again to check if your thread is ready to run again. SpinWait is therefore only useful if your operation completes within a few microseconds in 99.9% of all cases. But if your operation frequently or nearly always takes much longer than a typical context switch then this class will not make you any faster. Instead of sparing a few microseconds you are potentially wasting up to 15 milliseconds every time you call SpingWait.SpinUntil!

If you have many call stacks in your application which cause SpinWait to sleep for a thread quantum of 15ms then you are using SpinWait wrongly!

image

This one was very hard to find because I usually search for long pauses when I am after a one second delay. But this time 600ms were scattered across dozens of calls almost evenly distributed throughout the use case because SpinWait was used in a central low level component which was frequently called.

If you looked at the Excel chart above and wondered where difference between with and without Visual Studio for Thread.Sleep(1) comes from I can tell you now. By default the windows timer ticks every 15,6ms which then triggers the thread scheduler to check out its running threads. The strange 15,6ms value comes from 64 ticks/s = 15,6 ms/tick. If someone uses a multimedia resolution timer or changes the system wide timer tick to increase the tick rate your threads will more often be watched after the thread scheduler. That is the reason why Thread.Sleep(1) can wait for "only" 2ms or 15 ms depending on which applications are running which may alter the system wide clock rate. Another great source of many more details can be found at https://randomascii.wordpress.com/2013/07/08/windows-timer-resolution-megawatts-wasted/.

You can use the clockres tool from SysInternals to check out your current timer resolution. The dynamic windows timer tick rate makes it especially tricky to find performance degradations with unit tests on machines where Visual Studio or another WPF GUI is running where the tick rate is 1kHz. But on an installed server the clock tick rate is 64 which will make a big difference. Luckily the performance regression tests did show the issue long before any customer would get this version. It really pays off to performance regression tests. It is even better if you not only execute performance regression tests but you use xperf to record profiling data for all performance tests and store profiling data somewhere. Later when a test breaks you don't need to go to the testing guys and ask them to install the current baseline which has the issue and an old baseline which was good. Then you need to profile both use cases and then you can start analyzing. What a waste of time!

It is much easier to let your central performance tests run with ETW profiling which will affect even heavy use cases < 10%. But you get all the data you could wish for to find the root cause much faster. If you train the people how to use WPA to analyze issues and how to use PerfView, Windbg, VMMap and memory dumps you can create software which has significantly less issues than your competitors because you really understand your system. The usual thinking to treat system libraries and the operating system as black boxes where miracles happen has its place to not be distracted by the many things you need to consider before you write one line of software. But you should not stop at the black boxes. Instead you should develop the ability to look into the black boxes when necessary and write a nice blog post about the found issues so other people can learn how things (not) work.

Windows 10 SDK With New Windows Performance Toolkit Available!

$
0
0

Originally posted on: http://geekswithblogs.net/akraus1/archive/2015/07/31/166002.aspx

It has been a long wait to finally get the new version of the new Windows Performance Toolkit. I had not much time to test the betas since until now managed call stack resolution was not working.

WPT for Windows 10 makes some nice progress which you can download here. Click on the Download Standalone SDK and run it. Then uncheck everything except Windows Performance Toolkit and

it will install on your machine within minutes. If you download the complete Windows 10 SDK you are not downloading 170MB but several GB.

 

Management Summary

  • Graphs are more readable.
  • Graphs with short spikes are much better rendered without the need to zoom into to see them.
  • Colors for graphs are changeable (finally).
  • Filtering can be undone now in the context menu.
  • Quick Search in all tables.
  • Symbol load dialog was improved.
  • You can configure your own stack tag files in Trace Properties.
  • No new Graph Types (at least with the default providers I enable).
  • 5 Added columns to CPU Usage Sampled
    • Two could be extremely useful: Source File Name and Source Line Number
  • 12 Added columns to CPU Usage Precise.
    • Two are to user mode developers useful: ProcessOutOfMemory and CPU Usage (in view).
  • Occasional crashes are also featured now.

CPU Usage Sampled

image

The biggest improvement is certainly the possibility to see source file and line numbers. You can find out this way where most CPU is spent on which line if the CPU consumption happens in your code.

A quick recap for the newbies: The CPU Usage Sampled graph is generated by taking the stack trace of all running threads 1000 times per second (this is the default). These call stacks are then added together.

A count in your Main method of 1000 means that the method has used one second of CPU time.

If the call stack ends not in your e.g. Main method when the stack trace is taken then your source file will not be displayed there. As usual you have to take a sharp look at the numbers. Your method

may be causing a high CPU consumption but it might never show up with a source and line number because the stack always ends in an external library call (printf in my case) for which I did not have the pdb loaded.

It would be nice to have file and line numbers for managed code as well but this feature is of limited use as it is now.

Ideally I want to see this stuff while I am drilling into a call stack all my way down and not only if the call stack ends in a method which I did compile.

 

For Reference here is the complete list of columns you can choose from. Green are unchanged columns, Bold are new ones. I never have found out how to give a thread a name in WPT.

If anyone knows more about the mystical thread names and how they can be set I would be very interested in it.

WPT 8.1WPT 10
% Weight% Weight
AddressAddress
All CountAll Count
AnnotationAnnotation
Compiler OptimizationCompiler Optimization
CountCount
CPUCPU
Display NameDisplay Name
DPC/ISRDPC/ISR
FunctionFunction
 Image RVA
Inlined FunctionsInlined Functions
Is PGO'edIs PGO'ed
ModuleModule
PGO CountsPGO Counts
PGO Dynamic Instruction CountPGO Dynamic Instruction Count
PriorityPriority
ProcessProcess
Process NameProcess Name
RankRank
 Section Name
 Section RVA
 Source File Name
 Source Line Number
StackStack
TableTable
Thread Activity TagThread Activity Tag
Thread IDThread ID
Thread NameThread Name
Thread Start FunctionThread Start Function
Thread Start ModuleThread Start Module
TimeStampTimeStamp
Trace #Trace #
WeightWeight
Weight (in view)Weight (in view)

 

CPU Usage Precise

This is by far the most complex table. With WPT 10 we have 66 columns to choose from to find out how our threads interact with each other.

I am not sure how WPA detects that a process was out of memory but I think the kernel knows quite well if something did go wrong. The ProcessOutOfMemory column could be handy

to check stress tests when the machine was under heavy load which could explain subsequent application failures.

CPU Usage (in view) seems to be an attempt to display a more true thread running time. I am not sure how this value is calculated but it seems that if all cores are in use and you

get significant Ready times then it differs quite a lot. Otherwise the differences are marginal.

 

There have been quite some renaming of columns which might render your custom filter and enablement conditions useless and WPA will show you an error.

image

If your old preset did enable by default all processes except the Idle process which clutters up the view it was set to:

([Series Name]:="NewProcess" AND NOT ([NewProcess]:="Idle (0)"))

With WPT 10 you need to change it to:

([Series Name]:="New Process" AND NOT ([New Process]:="Idle (0)"))

This happens to all users of WPT 8.1 which have saved a custom default profile which contains this enablement rule.

With an empty enablement rule your view will look like

image

and with the rule

image

you will see no difference when you apply it. Why? Because this rule only enables matching stuff but it does not disable the non matching

processes! I am with you that this is very counter intuitive but it is the way it is. If you save this as new default profile and then open a new file

you will no longer get a view where the Idle process is enabled by default.

For reference I list here all available columns with the old and new names so you have an easier time to update your custom filters, enablement

and expansion rules.

Green lines are unchanged, yellow are renamed columns and bold are new columns in WPT 10.

WPT 8.1WPT 10
% CPU Usage% CPU Usage
 AdjustIncrement
 AdjustReason
AnnotationAnnotation
CountCount
Count:WaitsCount:Waits
CpuCpu
CPU Usage (ms)CPU Usage
 CPU Usage (in view)
 DirectSwitchAttempt
IdealCpuIdeal Cpu
 KernelStackNotResident
LastSwitchOutTime (s)Last Switch-Out Time
NewInPriNew In Pri
NewOutPriNew Out Pri
NewPrevOutPriNew Prev Out Pri
NewPrevStateNew Prev State
NewPrevWaitModeNew Prev Wait Mode
NewPrevWaitReasonNew Prev Wait Reason
NewPriDecrNew Pri Decr
NewProcessNew Process
NewProcess NameNew Process Name
NewQntNew Qnt
NewStateNew State
NewInSwitchTime (us)New Switch-In Time
NewThreadIdNew Thread Id
NewThreadStackNew Thread Stack
NewThreadStartFunctionNew Thread Start Function
NewThreadStartModuleNew Thread Start Module
NewWaitModeNew Wait Mode
NewWaitReasonNew Wait Reason
 NewThreadPrevRemainingQuantum
 NewThreadRank
 NewThreadRemainingQuantum
NextSwitchOutTime (s)Next Switch-Out Time
OldOutPriOld Out Pri
OldProcessOld Process
OldProcess NameOld Process Name
OldQntOld Qnt
OldStateOld State
OldInSwitchTime (us)Old Switch-In Time
OldThreadIdOld Thread Id
OldThreadStartFunctionOld Thread Start Function
OldThreadStartModuleOld Thread Start Module
OldWaitModeOld Wait Mode
OldWaitReasonOld Wait Reason
 OldThreadRank
 OldThreadRemainingQuantum
PrevCStatePrev CState
 ProcessOutOfMemory
Ready (us)Ready
ReadyThreadStackReady Thread Stack
ReadyTime (s)Ready Time
ReadyingProcessReadying Process
ReadyingProcess NameReadying Process Name
ReadyingThreadIdReadying Thread Id
ReadyingThreadStartFunctionReadying Thread Start Function
ReadyingThreadStartModuleReadying Thread Start Module
 ReadyThreadInDPC
SwitchInTime (s)Switch-In Time
TableTable
Thread Activity TagThread Activity Tag
Thread NameThread Name
TimeSinceLast (us)Time Since Last
Trace #Trace #
Waits (us)Waits

File IO View

The File IO view has now an additional column named File Path which acts like in Path Tree in Disc IO which is extremely useful if you do differential IO analysis. It seems that my

complaints about the not complete feature parity between Disc and File view have helped a bit.

image

This view has still a bug. If I zoom into it all graphs are disabled and I have to reenable them after every zoom operation. If I zoom out all is ok.

Generic Events

Besides some minor changes the biggest change is that finally it has got a DateTime (local) column. It shows the clock time and not the time since trace start.

This can be very useful for advanced scenarios when someone did report an issue and he did care to write down the time when it did happen.

image

Process View

I did not know it until it was removed. Process Tree is no longer with us. It was not really useful anyway.

Process Tree was added to WPT. That was the reason I never saw it before. After using it for some time I find it is easier to navigate in the tree if

you have a parent process which creates many childs.

 

Disc IO View

The Disc IO view has got only one additional column with the name Boosted which is for my disc IO always false. Not sure if this is a Windows 10 only feature. If someone knows more about it please comment.

 

Windows Performance Recorder

The recorder has not changed much. It has got some additional profiles like ones for .NET, Edge Browser, … The buffer size for .NET Activity is hard coded

to a 160 MB ring buffer which might not be enough for machines were many .NET apps are running. I still do not buy into this easy to use UI since the xml configs

need hard coding the values into it. It can be a hard value or a percentage of the system memory. Both choices are bad if I want to automate performance regression testing with

very specific workloads and tracing requirements. This GUI and the command line tool (wpr.exe) are just too inflexible.

Yes I know the complete xml has inheritance built into it so I could create derived profiles which override only the specific settings I care about. But this is still not dynamic enough

if I want to create a configuration for each problem and set them depending on the use case at hand.

image

  • xperf is still a good tool to stop a profiling session because it does a full CLR rundown.
  • But PerfView is much faster at creating the managed pdbs by not trying to recreate them over and over again with NGen.
  • Tracelog from the Windows SDK is still the only tool to start a trace session where you can configure the Win 8.1 enhancement to the ETW infrastructure.
    • E.g. filter events from a specific process, or disable events or stackwalking for user mode providers.

Summary

The new WPA tool has got some nice new features but they still need some polishing to become truly awesome. In my daily work I miss the automatic addition

of the etl.NGenPdb folder when present for the generated pdbs of an etl file. This is a useful convention and the old WPA was so nice to automatically include it to my symbol path.

WPA is very sophisticated. To use it right is an art which needs nearly daily practice. That could explain why this tool although it has a great UI has still got not so

much traction. It is like Windbg in disguise which will quickly cause information overload to the untrained brain.

But with some basic tricks which I will show in the near future you can crack otherwise unsolvable performance nuts with a few mouse clicks.

Make WPA Simple - Garbage Collection and JIT Times

$
0
0

Originally posted on: http://geekswithblogs.net/akraus1/archive/2015/08/16/166270.aspx

Did you ever see a nice tool with a fancy UI and thought: Hey that thing is powerful. I will try it! But later you were left back scratching your head why other can use this tool and get great results but you just don't get it?

Then I have news for you I have created a WPA profile which aims at user and not kernel mode developers. Many columns were removed from the context menus to give you all the power you need to find issues in your application

where no one else has found the root cause.

The ETW profile can be downloaded fromhere as a zip file. Unpack it to a directory and you are ready to go. So what is inside it?

 

FileDescription
Simple.wpaProfileThe main WPA profile you can use now.
JIT.xmlWPA region file referenced by Simple.wpaProfile to get JIT times of your application like PerfView.
GC.xmlWPA region file referenced by Simple.wpaProfile to get Garbage Collection metrics like in PerfView only better!
HookRegions.xmlWPA region file referenced by Simple.wpaProfile to see mouse clicks in your ETL file when ETWControler is running and capturing your keyboard and mouse events.
default.stacktagsStacktag file which serves as base for your application to find common .NET issues like event handle leaks, too much regex usage, monitor contention and much more.

 

Besides this there are also some WPR profiles added to enable recording of specific .NET events like exceptions together with GC events into a larger 1 GB buffer.

WPR of Windows 10 has under Scenario Analysis .NET Activity which has a very small 160MB ring buffer which is way to small for my needs. It might be good

for you but I have added extra WPR profiles for GC and JIT as well.

To show you the difference I have created a small PerformanceIssueGenerator.exe application. This generates various issues which you can analyze with WPA.

I have recorded the data already and put the 7z file here. If you want to extract it you need to download the 7z from the official site. To view the ETL file

you need a >= Windows 8 machine and Windows Performance Toolkit from the Windows 10 SDK.

 

When you have downloaded the profile and the sample etl file you can apply the profile under Profiles - Apply…

image

Then you get two tabs. The first one contains .NET specific analysis stuff like GC and JIT. The second tab can be used for the usual bottleneck analysis regarding

CPU, disk and memory usage as well as wait chain analysis. You do not need to load the profile every time. You can save it as your default profile

by clicking on Save Startup Profile to open all of your ETL files in this view now.

image

Normally I use PerfView for GC issues to check out the GC Stats of an application to see if anything significant is going on. If you have ever used PerfView then

you will have noticed that it is a great tool combined with a crappy UI. At least for some key scenarios we can now use WPA with hand crafted region files instead of PerfView.

 

GC Views

You now get a GC view like this

image

You can visualize each GC generation type and its time it did take to execute. The Count column also tells you how many GCs you did have. This can help a lot if you

want to get GC metrics only for a specific time region in your application. Now it is easy to see how much time of your use case was available to your application

and how much time your threads did have to wait for the garbage collector. This was my main issue with PerfView and its GCStats view that it is

calculated for the complete duration of the profiling run. Most of the time I want GC metrics only for specific parts of my application because

I am optimizing mostly only at a single place at one time.

Here is how you get the GC statistics in PerfView:

 

image

 

The numbers in GC Rollup By Generation match pretty well to the region file. Also the GC pause time correlate quite well with the distribution of generations although the timings

are not 100% the same but the ratios are a good fit.

image

Since PerfView and ETW use the same data you can rightfully ask the question why there is a difference at all? The answer is that WPA sums not up all GC regions

by their duration. A WPA region is defined by a start and a stop event which is the displayed in the UI like above. But if the regions happen in parallel in several threads WPA will use as sum

only the time where at least one region was active at a time. This diagram illustrates how WPA region summation works:

image

This detail is for GC things not so relevant but it will be very important when we look at JIT statistics. Since a foreground GC is blocking your complete application there is little

concurrency going on. We can also visualize when GCs of which types are happening. The view Generic Events GC Generations will show you how many GCs of which type were happening.

Initially it is collapsed so you have to open it. This view takes a little while to build up since it needs to parse all ETW events for GC Start events which contain as data point the generation number which is

visualized here.

image

This is the greatest thing about WPA of Windows 10. Now you can visualize the data of your custom traces. With a little WPA graph configuration metric you can create for your own application key graphs

which show e.g. Frames per seconds, # of db transactions, … Who needs performance counters when you can visualize your trace data much better now?

Marker Events

A key point to know where you want to start looking for an issue are Marks. In WPRUI you can press Ctrl+Win+x to open a window which allows you to write a marker event to your trace file

while you are recording data. Unfortunately this functionality is not exposed to the outside world to set marks programmatically. If you revert to xperf you can set with xperf -m marks programmatically

if you wish to. But since marks are so useful to navigate in an ETL file I really wonder why the method which xperf calls is not documented at all. If you want to know how real power users

are using WPA then you need to call wpa /?

image

I guess at Microsoft they let their tests run with enabled profiling while saving screenshots as ETW events. If something happens the ETL file is downloaded

from a central repository and the ETL file is opened with a test specific profile. The file is opened and zoomed into the relevant test part which are identified by markers

or regions from a region file. At least some people at Microsoft use this tool so often that it makes sense to automate it even further since the controls of WPA are UI automatable to

script nearly your complete analysis. Now THAT is what I call good performance regression testing.

A little reverse engineering which APIs xperf -m calls did finally show up the method EtwSetMark which is exported by ntdll.dll. I am sure the readers of my blog can figure out the rest as well.

But I really wish it would become an officially supported API since it is so useful. Sure you can define your own marker events as well but since the support in WPA is already built in

it would really help. It would also be nice if xperf would emit the mark event no only to the "NT Kernel Logger" ETW session but all active kernel session so you could mark also the WPA kernel session

which is currently not possible.

 

JIT View

PerfView can also give you the timings how much time each method did take to compile. This is useful if your are compiling too much code on the fly for your WCF proxies or your serializers. Sure it is done only once

but if this time plays a significant part of your application startup you should rethink how your application works. Since .NET 4.5 will NGen all assemblies on the fly if they are used often enough you do not need to

consider using NGen explicitly for your application. But if you have much dynamic code generation going on you can still suffer from long JIT times.

 

image

You can get a similar view by switching from Garbage Collection to JIT Time per Thread

image

As I have shown you in the GC section. The summation of regions is not a simple sum of the duration of all JIT events. The JIT compiler can compile code in many different threads concurrently.

The Duration sum of all threads of JIT time reported by WPA is therefore largely irrelevant if you want to compare two different runs of your application. Instead you need to look at the JIT times

of each thread. You can copy the duration column with the context menu "Copy Column Selection" into Excel

image

image

which gives us the exact same JIT time as Perfview. Now I do not need PerfView for JIT analysis anymore. I am more and more getting away from programming to configuring WPA to give me just the view I need.

If you expand JITs you get at least the namespace of each JIT event. This is the best I could come up with since WPA does not support concatenating  strings of different fields into a region name.

image

But you still can open the also customized view _Activity by Provider Task Process to view the "raw" data if you need more information. If you would drop e.g. Field 5 to the left of the yellow bar you would get a nice summary how

many methods in this namespace were compiled.

image

As you can guess there are lots of more goodies inside the now published profiles to make WPA really usable for non kernel developers. I have streamlined nearly every context menu where I removed all useless or kernel only columns

from the context menus. You have a much easier time now to concentrate on the important things. With this profile WPA is nearly a new application and central analysis station for managed and unmanaged performance analysis.

In the next posts I will walk you through the problems of PerformanceIssuesGenerator and how you can visualize them in WPA efficiently.

Make WPA Simple - Symbol Loading Is Not Slow

$
0
0

Originally posted on: http://geekswithblogs.net/akraus1/archive/2015/08/17/166298.aspx

When you open an etl file you normally want to see which methods were executed most often. To decode the recorded stack pointers WPA needs your symbols. By default WPA adds the Microsoft symbol server to the list. If you click on Load Symbols

image

then you will experience some slowness which is very nicely explained by Sasha here: http://blogs.microsoft.co.il/sasha/2014/12/02/diagnosing-native-memory-leaks-etw-wpa/. It is ok for the first time to get all relevant symbols from Microsoft but afterwards make sure you disable the Microsoft server from the symbol search list!

image

The symbol download folder is by default C:\Symbols. The directory after the srv* is the pdb download folder and after the second * the symbol server address is appended.

If you wonder why you need a SymCache folder:

image

The pdb format generated by the compilers is so arcane, bloated and parser unfriendly that the guys doing the pdb parsing within dbghelp.dll (part of Windbg) decided to convert the bloated pdb format into something much less bloated and much easier to parse. The SymCache folder contains therefore not the pdbs but the converted pdb files.

Here is the comparison:

image

And here the original pdb

image

Now you know why the SymCache folder is a good thing. You can safely delete it if it gets to big as long as you have the original pdbs still around nothing is lost. The vigilant reader has certainly noticed that I have added a path to C:\ProgramData\WindowsPerformanceRecorder\NGenPdbs_Cache. This is the path were WPR and xperf stores the generated managed pdbs while saving an ETL file to disk. By default WPR will create besides your xxx.etl file an extra folder named xxx.etl.NGENPDB which contains all managed pdbs from NGenned binaries. Since WPA of Windows 10 follows not its own convention to expand the symbol search path of the loaded etl file to its accompanying pdb directory I go directly to the source where WPR copies the generated pdbs from.

If you follow this advice and disable the MS symbol servers you will load the pdbs not within the next 20 minutes but in 3,6s for my sample ETL file. Kudos to the Symbol Cache and its inventors. While speaking of performance. PerfView has a much better (read faster) managed pdb generation strategy. WPA and xperf will call ngen.exe createpdb  xxxx.ni.dll for each and every native image in the trace recording. This is not necessary since you can read the pdb age guid directly from the dll and check then in the pdb cache folder if the pdb already exists. No need to call ngen.exe all the time! Finally it is not necessary to copy the generated pdbs to the xxx.etl.NGENPDB directory. My own trace collection tool simply creates a directory link to the NGenPdbs folder and I have all pdbs I could need without any file copies.

If you find that the default paths of WPA, xperf, … on your crowded C drive leave no room for pdbs you can configure WPA to a different storage folder. Or you leave all as is and create with mklink /D directory links from your C drive to a drive which has enough space. I prefer the latter approach since in my experience tools tend to rely on the defaults and at some point they will come back to it (e.g. Windbg …).

Why Is The First Start Of An Application Slow?

$
0
0

Originally posted on: http://geekswithblogs.net/akraus1/archive/2015/08/25/166493.aspx

All of us have experienced it but very few of us have gone into the details to really find the reasons for it. First of all why should I care? After the first time the application starts much faster. Well you should care if you want to give your customer not only a good second startup time but also a decent first startup experience. To do that you need to do repeated tests. The first startup is commonly referred to cold startup and all subsequent runs are warm startup scenarios. To test a cold startup scenario you need to reboot your computer every time and start your application. Not very practical and I have no one ever seen going down that road. The difference between warm and cold startup time is that the first time the dlls and data of the application need to be loaded from disc which can take quite some time. All subsequent runs can read the data not from disk but from the in memory file system cache which is managed by the operating system.

What Is The File System Cache?

Windows uses the memory of your system in some clever way. All memory that is used by your applications is shown in the Task Manager. The rest of the memory is free and not used by anything. Really? Of course not. The free memory consists of data of previously exited applications which was in use some time a ago. This "free" memory is called Standby List which is where the file system cache of Windows is really living. It is therefore not a good idea to max out your memory usage in task manager because if you do this you are flushing the file system cache which will result in a slow system. This cache is the reason why your second and all further startup times are much better. No disc accesses at all compared to the first run if all goes well.

How can you flush it?

If you want to measure cold startup times you need to clear the file system cache somehow. The easiest way is to use RAMMap from SysInternals. It has an option to flush the Standby List.

image

That is nice but unfortunately there is no command line tool available to do this in a test script to create repeatable tests. But not all is lost. Process Hacker is a better clone of Process Explorer which comes with full source code. There you can find the undocumented Windows API calls to flush the OS file system cache directly. It is basically

SYSTEM_MEMORY_LIST_COMMAND command = MemoryPurgeStandbyList;
NtSetSystemInformation(SystemMemoryListInformation, &command, sizeof(SYSTEM_MEMORY_LIST_COMMAND));

and you are done. There is also another trick I want to share: You want to clear the file system cache for a specific file? Windows will flush the file system cache if a file was opened with the FILE_FLAG_NO_BUFFERING flag. Here is little C# class that employs this trick:

class FileSystemTools
    {constint FILE_FLAG_NO_BUFFERING = 0x20000000;const FileOptions Unbuffered = (FileOptions)FILE_FLAG_NO_BUFFERING;/// <summary>/// Flush the file system cache for this file. This ensures that all subsequent operations on the file really cause disc/// access and you can measure the real disc access time./// </summary>/// <param name="file">full path to file.</param>publicstaticvoid FlushFSCacheForFile(string file)
        {using (FileStream fileStream = new FileStream(file, FileMode.Open, FileAccess.Read, FileShare.ReadWrite, 4096, FileOptions.None | Unbuffered))
            {

            }
        }
    }

Hidden Dragons

If the first test run is slow and all other are fast we know that the file system cache is to blame. If you see a pattern like this

image

you know that the first run was cold and all others were warm. If we now flush the file system cache we should get rock solid cold times with no other influencers except virus scanner, indexing service and other pesky applications making our test unpredictable. If we measure again - this time with flushing the file system cache - we get this:

image

 

This is not warm and not cold. What is it then? The answer is it was mostly a warm startup because another instance of the application was still running. If the a process is alive and its working set was not trimmed then the dll data is NOT in the cache since it still belongs to the working set of a running process. You need therefore make sure that you close all applications which keep your dlls in memory (e..g Visual Studio with its designers ….) to test a real cold startup scenario. There are other pitfalls as well. If you kill a process without waiting until it has terminated you can get random good runs because a process kill is an asynchronous operation. You have to wait on the process handle until it becomes signaled to be sure that it is really gone. Or if you are using .NET you need to call Process.WaitForExit() to be sure.

After we have fixed that we get this graph:

image

What the hell is going on here? This was quite hard to find out since there is very little information on the net how to reliably test cold startup scenarios. The hidden thing is Superfetch.

Doesn´t ring a bell? Besides speeding up the boot it also tries to prefetch often used application dlls into memory before the application is started based on intelligent heuristics. This gets in our way if we flush the file system cache. After that the Superfetch system service named SysMain  which is hosted in a generic svchost.exe process tries to prefetch often used dlls at low IO priority. This is additional IO collides with our test where we want to load our dlls as well. To get consistent IO numbers it is therefore best to stop the Superfetch service.

sc stop SysMain

After doing that the graph still looks the same. What is interesting that if our application reads about 30MB more data it starts nearly 10s faster. This is a mystery which only xperf can solve. Here is a Disc IO graph of a 23s slow run

image

And here is a fast 14s run.

image

We read more data but the Disc Service Time is less! The average Disc Queue Depth is very high which should indicate an overwhelmed disc. What stands out is that we have about 70% less disc read operations on the assembly folder where the native images of .NET applications are stored. Sometimes we read the disc in a slow and sometimes in a fast way. To answer the question why we read from the disc in such different ways we need to check out the Disc IO Init call stacks. These will tell us who did initiate the read operations and most important when. First comes the slow run

image

It is no surprise that our slow cold startup causes many hard page faults which triggers the kernel to read the missing pages from disc. This is known to be slow. We can do little about it except to execute less code during startup. The graph shows a nice scattered pattern across the complete startup time. Now lets have a look at the fast run

image

That looks radically different. We load the same dlls but we have no hard faults? All dll data is read before the application main thread starts running. In our process the Prefetcher reads its prefetch database which is 1,3MB in size (see PfSnPrefetchMetaData in 687ms) which allows it to read our dlls in PfSnPrefetchSections (151MB in 3,1s) to read the dll data mostly with sequential reads in about half of the time. That is the real reason why we have so wildly varying startup times. The prefetcher will look at ca. the first 12s of application startup and will then update the prefetch database for your executable. Next time the startup is a little better and it will update it again causing it to prefetch a little more and so on. This is a nice self tuning system which not many people are aware of. From time to time the prefetcher seems to do nothing to check (that is my guess) if the prefetch database is helping at all. That explains the mostly good and sometimes very slow numbers whre no prefetching is active.

If we aim at stable cold repeatable tests we need to delete disable the Superfetch service because this process is responsible to write and update the prefetch database. The prefetch database file is located at

C:\Windows\Prefetch\ExeName-FileHashPrefetchFlag.pf

We need to delete that file as well and now we have real control over our IO during application startup. If you want to calculate the file name of your application prefetch database you need to take the full path of your executable and run it through a hash function which is described at http://forensicswiki.org/wiki/Prefetch. There are still some sites out there which tell you to delete the prefetch folder to prevent caching of dlls to make your system faster. If these guys would have really measured their startup time in a controlled and systematic manner they would not have written such misleading guidelines. As you can see you need to take care about a lot of details to really understand your system. But it is worth it. If you have looked closely at the .pf file name there is a PrefetchFlag at the end. That is a number which you can pass to generic container processes to generate for the same executable different prefetch databases. This can be useful if you have generic container processes with wildly varying dll sets inside them. How do you tell the OS loader that you want to use a different prefetch database to speed up cold startup times?

Simple: Pass it as command line parameter! If you create e.g. an executable with the name prefetchTest.exe and pass /prefetch:dd as command line argument to it you get different prefetch files for each scenario:

prefetchTest.exe
    C:\Windows\Prefetch\PREFETCHTEST.EXE-47E1C1B1.pf 
prefetchTest.exe /prefetch:1
    C:\Windows\Prefetch\PREFETCHTEST.EXE-47E1C1B2.pf
prefetchTest.exe /prefetch:2
    C:\Windows\Prefetch\PREFETCHTEST.EXE-47E1C1B3.pf
That is a nice find. Perhaps you can make your own customers happier by including a cold startup test in your test suite to make the pain public. Most test reports I have seen include only warm numbers because it is quite difficult to create repeatable and stable cold startup tests. Here are my tips as short bullets for repeatable cold startup measurements:

Test Machine Preparation

  • Stop/Disable the Superfetch (SysMain) service.
  • Disable Windows Search (WSearch) service.
  • Disable Virus scanners.
  • Disable scheduled tasks and other system management software.
  • Delete the prefetch database for your executable.

Test Execution

  • Kill all applications which keep your dlls loaded.
  • Flush the file system cache before the test.

When you follow these tips you will get easily repeatable and stable cold startup numbers. If for some reason during startup much more code is executed which causes many page faults you can easily spot it. The warm startup time might not be affected at all because the actual run time of the additional code might be low but the new disc access pattern could be pricy in terms of cold startup time.

Viewing all 682 articles
Browse latest View live