Different Not Automatically Implies Better

November 5, 2013, 4:19 pm

≫ Next: Recently Used Files And The Windows Filesystem Cache

≪ Previous: From Console.WriteLine To ETW

Originally posted on: http://geekswithblogs.net/akraus1/archive/2013/11/05/154556.aspx

Recently I was digging deeper why some WCF hosted workflow application did consume quite a lot of memory although it did basically only load a xaml workflow. The first tool of choice is Process Explorer or even better Process Hacker (has more options and the best feature copy&paste does work). The three most important numbers of a process with regards to memory are Working Set, Private Working Set and Private Bytes.

Working set is the currently consumed physical memory (parts can be shared between processes e.g. loaded dlls which are read only)
Private Working Set is the physical memory needed by this process which is not shareable
Private Bytes is the number of non shareable memory which is only visible in the current process (e.g. all new, malloc, VirtualAlloc calls do create private bytes)

When you have a bigger workflow it can consume as 64 bit process easily 500MB for a 1-2 MB xaml file. This does not look very scalable. When running at 64 bit the issue is excessive private bytes consumption and not the managed heap. The picture is quite different for 32 bit which looks a bit strange but it seems that the hosted 32 bit VB compiler is a lot less memory hungry. I did try to repro the issue with a medium sized xaml file (400KB) which does contain 1000 variables and 1000 if which can be represented by C# code like this:

string Var1;string Var2;
...string Var1000;if (!String.IsNullOrEmpty(Var1) ) 
{
     Console.WriteLine(“Var1”);
}
if (!String.IsNullOrEmpty(Var2) ) 
{
     Console.WriteLine(“Var2”);
}
....

Since WF is based on VB.NET expressions you are bound to the hosted VB.NET compiler which does result in (x64) 140 MB of private bytes which is ca. 140 KB for each if clause which is quite a lot if you think about the actually present functionality. But there is hope. .NET 4.5 does allow now C# expressions for WF which is a major step forward for all C# lovers. I did create some simple patcher to “cross compile” my xaml to C# expressions. Lets look at the result:

C# Expressions	VB Expressions
x86	x86

On my home machine I have only 32 bit which gives you quite exactly half of the memory consumption under 64 bit. C# expressions are 10 times more memory hungry than VB.NET expressions! I wanted to do more with less memory but instead it did consume a magnitude more memory. That is surprising to say the least. The workflow does initialize in about the same time under x64 and x86 where the VB code does it in 2s whereas the C# version needs 18s. Also nearly ten times slower. That is a too high price to pay for any bigger sized xaml workflow to convert from VB.NET to C# expressions. If I do reduce the number of expressions to 500 then it does need 400MB which is about half of the memory. It seems that the cost per if does rise linear with the number of total expressions in a xaml workflow.

Expression Language	Cost per IF	Startup Time
C# 1000 Ifs x64	1,5 MB	18s
C# 500 Ifs x64	750 KB	9s
VB 1000 Ifs x64	140 KB	2s
VB 500 Ifs x64	70 KB	1s

Update: The cost per if is not constant but rises with O(x^2) where x is the number of expressions. On x64 for 3000 IFs the same workflow needs 11 GB of memory.

Now we can directly compare two MS implementations. It is clear that the VB.NET compiler uses the same underlying structure but it has much higher offset compared to the highly inefficient C# expression compiler. I have filed a connect bug here with harsher wording about recent advances in memory consumption. The funniest thing is that one MS employee did give an Azure AppFabric demo around early 2011 which was so slow that he needed to investigate with xperf. He was after startup time and the call stacks with regards to VB.NET expression compilation were remarkably similar. In fact I only found this post by googling for parts of my call stacks.

… “C# expressions will be coming soon to WF, and that will have different performance characteristics than VB” …

What did he know Jan 2011 what I did no know until today? ;-). He knew that C# expression will come but that they will not be automatically have better footprint. It is about time to fix that. In its current state C# expressions are not usable for bigger workflows. That also explains the headline for today. You can cheat startup time by prestarting workflows so that the demo looks nice and snappy but it does hurt scalability a lot since you do need much more memory than necessary.

I did find the stacks by enabling virtual allocation tracking within XPerf which is still the best tool out there. But first you need to look at your process to check where the memory is hiding:

For the C# Expression compiler you do not need xperf. You can directly dump the managed heap and check with a profiler of your choice. But if the allocations are happening on the Private Data ( VirtualAlloc ) you can find it with xperf. There is a nice video on channel 9 explaining VirtualAlloc tracking it in greater detail.

If your data allocations are on the Heap it does mean that the C/C++ runtime did create a heap for you where all malloc, new calls do allocate from it. You can enable heap tracing with xperf and full call stack support as well which is doable via xperf like it is shown also on channel 9. Or you can use WPRUI directly:

To make “Heap Usage” it work you need to set for your executable the tracing flags (before you start it). For example devenv.exe

HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Windows NT\CurrentVersion\Image File Execution Options\devenv.exe

DWORD TracingFlags 1

Do not forget to disable it after you did complete profiling the process or it will impact the startup time quite a lot. You can with xperf attach directly to a running process and collect heap allocation information from a gone wild process. Very handy if you need to find out what a process was doing which has arrived in a funny state.

“VirtualAlloc usage” does work without explicitly enabling stuff for a specific process and is always on machine wide. I had issues on my Windows 7 machines with the call stack collection and the latest Windows 8.1 Performance Toolkit. I was told that WPA from Windows 8.0 should work fine but I do not want to downgrade.

11/09/2013 Update: WPT 8.1 has a bug which hides the VirtualAlloc stacks and a whole bunch of graphs for etl files taken on Windows 7 machines. Until this issue is fixed you can use the WPT 8.0 to analyze the etl files. The etl files are ok but the viewer is wrong.

Using WPT 8.0 does work as expected. You can xcopy deploy it to a different directory put from Windbg for your target OS the right dlls into the WPT folder:

dbghelp.dll
symsrv.dll
srcsrv.dll

The Windows ADK does also contain the WPT which you can download to your hard drive and change the msi with Orca the Upgrade table the property WIX_DOWNGRADE_DETECTED to WIX_DOWNGRADE_DETECTED1 to make it think that no newer version is installed it will be fine then.

Alternatively you can also remove the FindRelatedProducts action from the InstallExecuteSequence and InstallUISequence table to skip the check for previous versions.

Or you can simply uninstall the 8.1 SDK and install the 8.0 SDK which does work as well.

↧

Recently Used Files And The Windows Filesystem Cache

November 14, 2013, 2:22 pm

≫ Next: Windows 8/8.1 Enhancements to ETW

≪ Previous: Different Not Automatically Implies Better

Originally posted on: http://geekswithblogs.net/akraus1/archive/2013/11/14/154624.aspx

I constantly try out new and old tools to check out if they can be useful in my daily work. Did you ever wonder where Windows does put the file contents of your just read 200MB file in memory and why reading the same file is almost always faster? The answer to both question is the file system cache. When you look at the performance counters you will find quite a lot cache related counters but when you look at the values they are typically only a few hundred MB but the task manager shows very different values.

In theory on my Windows 7 machine I should get the value from Task Manager as “System Cache Resident Bytes” of the Memory section. The description sounds good:

“System Cache Resident Bytes is the size, in bytes, of the pageable operating system code in the file system cache. This value includes only current physical pages and does not include any virtual memory pages not currently resident. It does equal the System Cache value shown in Task Manager. As a result, this value may be smaller than the actual amount of virtual memory in use by the file system cache. This value is a component of Memory\\System Code Resident Bytes which represents all pageable operating system code that is currently in physical memory. This counter displays the last observed value only; it is not an average.”

The bold statement about equality is indeed a very bold one. On my Windows 7 x32 machine Task Manager reports as Cached 504 MB whereas the performance counter only gives 185 MB. There seems to be quite a bit of confusion what should be counted as cache and what not. This is an old description from Windows XP times which did have in the task manager a “System Cache” value.

So far I have not found a really good explanation what exactly the task manager is showing as Cached value.

But there are better tools around like Resource Monitor (perfmon /res) which has more to offer. It shows tooltips at unexpected places like value descriptions in the Memory Tab:

Now we can explain what Task Manager is actually showing there. Standby and Modifed are the names of Windows kernel structures which are used by the Windows Memory Manager.

The tool RamMap gives you all the details about the current state of the Memory Manager and its priority lists for different types of memory. More details can be found on the web what all these different types of memory mean and how the Memory Manager does organize its stuff.

Name	Explanation	Formula
Total	Installed Memory in your computer	Total = Active List + Standby List + Modified List + Transition List + Zeroed List + Free List
Cached	Memory used by File System Cache	Cached = Standby List + Modified List
Available	Can be allocated as private memory by other applications	Available = Standby List + Free List
Free	Unused Memory	Free = Zeroed List + Free List

Why should I care about all that stuff? Because if you are reading this you are generally interested how things work. Besides of curiosity you can now answer questions like

Why needs a performance regression test two warmup runs up before the numbers become stable?
What files did my colleague have open but deleted them?

Lets start with the most disturbing question. I can have a look at all files you did have open at your computer if I have direct access with Admin rights. Open RamMap and check out the File Details tab. There are all files listed which are currently completely or partially stored in the file system cache in the one or other way.

With livekd from SysInternals I can also view the contents of the cached files

0: kd> !db 7e3ee000
#7e3ee000 3b 20 66 6f 72 20 31 36-2d 62 69 74 20 61 70 70 ; for 16-bit app
#7e3ee010 20 73 75 70 70 6f 72 74-0d 0a 5b 66 6f 6e 74 73 support..[fonts
#7e3ee020 5d 0d 0a 5b 65 78 74 65-6e 73 69 6f 6e 73 5d 0d ]..[extensions].
#7e3ee030 0a 5b 6d 63 69 20 65 78-74 65 6e 73 69 6f 6e 73 .[mci extensions
#7e3ee040 5d 0d 0a 5b 66 69 6c 65-73 5d 0d 0a 5b 4d 43 49 ]..[files]..[MCI
#7e3ee050 20 45 78 74 65 6e 73 69-6f 6e 73 2e 42 41 4b 5d Extensions.BAK]
#7e3ee060 0d 0a 33 67 32 3d 4d 50-45 47 56 69 64 65 6f 0d ..3g2=MPEGVideo.
#7e3ee070 0a 33 67 70 3d 4d 50 45-47 56 69 64 65 6f 0d 0a .3gp=MPEGVideo..
0: kd> !db
#7e3ee080 33 67 70 32 3d 4d 50 45-47 56 69 64 65 6f 0d 0a 3gp2=MPEGVideo..
#7e3ee090 33 67 70 70 3d 4d 50 45-47 56 69 64 65 6f 0d 0a 3gpp=MPEGVideo..
#7e3ee0a0 61 61 63 3d 4d 50 45 47-56 69 64 65 6f 0d 0a 61 aac=MPEGVideo..a
#7e3ee0b0 64 74 3d 4d 50 45 47 56-69 64 65 6f 0d 0a 61 64 dt=MPEGVideo..ad
#7e3ee0c0 74 73 3d 4d 50 45 47 56-69 64 65 6f 0d 0a 6d 32 ts=MPEGVideo..m2
#7e3ee0d0 74 3d 4d 50 45 47 56 69-64 65 6f 0d 0a 6d 32 74 t=MPEGVideo..m2t
#7e3ee0e0 73 3d 4d 50 45 47 56 69-64 65 6f 0d 0a 6d 32 76 s=MPEGVideo..m2v
#7e3ee0f0 3d 4d 50 45 47 56 69 64-65 6f 0d 0a 6d 34 61 3d =MPEGVideo..m4a=

C:\Windows\system32>type c:\Windows\win.ini
; for 16-bit app support
[fonts]
[extensions]
[mci extensions]
[files]
[MCI Extensions.BAK]
3g2=MPEGVideo
3gp=MPEGVideo
3gp2=MPEGVideo
3gpp=MPEGVideo
aac=MPEGVideo
adt=MPEGVideo
adts=MPEGVideo
m2t=MPEGVideo
m2ts=MPEGVideo

Yes the cache is really working and I can view its contents which is quite interesting and disturbing.

Now to question one. Why do some people claim that they need two warmup runs before the numbers become stable? A cache is a cache. If you have a test which does use nearly all of your physical memory (>75%) the cache manager of Windows needs to throw out the least used cached files to make room for the memory hungry application.

The first run of your test loads the data from disc and into the cache which does work well until the memory footprint of your application becomes so big that the Cache Manager needs to evict pages from the cached files. During a second run of the test the same things happen until the cache manager has learned which usage pattern you do follow and tries to adjust its internal policies. Since I do not want to learn every detail of the cache manager it is much easier to switch to a machine with much more RAM than needed by the application to run mostly everything from the cache during the second run.

The Windows task manager does lure millions of users into the wrong safety that you have plenty of free RAM and you can consume a few more GB of it without negative effects. Especially Notebook users with 64 bit Windows, 4 GB of RAM and a slow hard disc do experience long pauses during their routine work. A colleague of mine did ask me if I could investigate why his notebook was so slow. After I have seen his memory configuration I could tell him that he needed simply more RAM to make his work much smoother. Slow notebook hard discs do show you very clearly that the cache is there for a reason and when you do turn it off by using all of the memory application data you will experience many hard page faults which result in a sluggish user experience.

But also developers are not immune to such traps. You want to make your compile times faster by exchanging your hard disc with a solid state disc? Sure you can but how about getting more memory first? If you do get a few GB more RAM it can make a huge difference to your compile times if most of your stuff is stored in the file system cache. That does also explain why some tests with a RAM disc did show no improvements of compile times when we did copy all of the data onto a RAM disc. If you have enough RAM you do not need a RAM disc (most of the time)! At my work machine with 16 GB of RAM I do see that the file system cache begins shrinking already when I do hit 12 GB private bytes count.

What can you learn of this? Be nice to your cache and keep at least 1 GB of free memory around to be sure that the cache can work as a cache. Today's hard discs are still 100 000 times slower than memory which is a good reason to keep your cache big.

If you want to check for performance regression tests if your cache is filled you could execute

rammap C:\temp\ram.xml

to get a nice xml file which shows which files are being in the cach now. The cached files are stored as xml nodes of the form

<File Key="3114984784" Path="C:\windows\system32\aelupsvc.dll"/>

which can be easily parsed. Then you can filter e.g. for your application path and the Windows directory where most dlls during startup (especially NGen images for managed code) are loaded. Once you have set a cache baseline you can verify during subsequent tests that the cache still looks similar compared to your baseline run. If not you can skip the test run or try another warmup run to get consistent values for your performance regression tests. This can give you some hints why a performance regression test on that machine did run so slow compared to the others. You do not need to debug but can immediately prove that the test did most likely experience much more hard faults during its test execution because of a different cache state.

Recently I have talked to a MS program manager about performance tests and I was surprised that their test strategy was quite the opposite. They flush the file system cache before doing the test which always profiles the worst (cold startup) scenario which also leads to consistent numbers. He claimed that when you optimize cold startup warm startup will also benefit from it. That may be true to a certain extent but since warm startup is about a factor two faster you will experience concurrency especially during startup at places where none has been before and a much higher CPU utilization. This can shift the places where you should invest more time to optimize quite a lot if you are not careful.

↧

Windows 8/8.1 Enhancements to ETW

December 14, 2013, 4:53 pm

≫ Next: Scratch folder for a unit test

≪ Previous: Recently Used Files And The Windows Filesystem Cache

Originally posted on: http://geekswithblogs.net/akraus1/archive/2013/12/14/154884.aspx

When you try to make sense of Windows Event Tracing files (.etl) it is often helpful to add some special events into the event stream so you can closer examine near the “Here it was slow” event which helps a lot to look at the right spots. If you are lucky you can do this by routing your tracing data to ETW or you need to set some marker events by hand. The traditional was of doing this was to call xperf –m “Here it was slow” during your repro session. I have observed several times since then that this command did no longer work on some machines.

C:\Program Files (x86)\Windows Kits\8.1\bin\x64> xperf -m "hi"
xperf: error: Set Mark failed

These things did never happen on my Windows 7 machine. I thought it was some oversight in Windows 8. Interestingly if I started the tracing session with xperf all worked well again. But when I did start a tracing session with the new WPR/UI tool from the Windows 8.1 SDK this error was consistently there. After digging a little deeper with xperf –Loggers I found that WPR/UI do not create the usual "NT Kernel Logger" session but instead a WPR_initiated_WprApp_WPR System Collector session for which I cannot see the trace flags anymore.

C:\Windows\system32>xperf -Loggers
Logger Name           : WPR_initiated_WprApp_WPR System Collector
Logger Id             : 11
Logger Thread Id      : 0000000000000000
Buffer Size           : 1024
Maximum Buffers       : 1632
Minimum Buffers       : 1632
Number of Buffers     : 1632
Free Buffers          : 1624
Buffers Written       : 0
Events Lost           : 0
Log Buffers Lost      : 0
Real Time Buffers Lost: 0
Flush Timer           : 0
Age Limit             : 0
Log File Mode         : Secure Buffered PersistOnHybridShutdown SystemLogger
Maximum File Size     : 0
Log Filename          :
Trace Flags           :
PoolTagFilter         : *

That does explain why I cannot add mark events anymore on Windows 8 machines. There is no more "NT Kernel Logger" session to set the mark event.This is bad news for xperf –m but good news for all people who like to keep Process Explorer and other tools open while they are taking kernel trace events. It seems that xperf was not updated to display the tracing flags for the new type of kernel tracing sessions because another tool from the Windows 8.1 SDK can display them:tracelog

C:\Program Files (x86)\Windows Kits\8.1\bin\x64>tracelog.exe -l

Logger Name:            WPR_initiated_WprApp_WPR System Collector
Logger Id:              0x11
Logger Thread Id:       0000000000000000
Guid:                   c7f66f95-6045-11e3-8265-d43d7ef06eb2
Session Security:       D:(A;;0x800;;;WD)(A;;0x120fff;;;SY)(A;;0x120fff;;;LS)(A;;0x120fff;;;NS)(A;;0x120fff;;;BA)(A;;0xee5;;;LU)(A;;LC;;;MU)(A;;
Buffer Size:            1024 Kb
Maximum Buffers:        1632
Minimum Buffers:        1632
Number of Buffers:      1632
Free Buffers:           1624
Buffers Written:        0
Events Lost:            0
Log Buffers Lost:       0
Real Time Buffers Lost: 0
AgeLimit:               0
Real Time Consumers:    0
ClockType:              PerfCounter
Log Mode:               Secure Buffering-only
Hybrid Shutdown:        Persist
Maximum File Size:      not set
Buffer Flush Timer:     not set
Enabled tracing:        Process Thread ProcCounters ImageLoad DiskIo HardFaults CxtSwap Dpc Isr Profile Power MemInfo MemInfoWs Priority Dispatcher…
Stack Tracing:          DiskRead DiskWrite DiskFlush ThreadDCEnd CSwitch ReadyThread KernelQueueEnqueue KernelQueueDequeue Profile
Log Filename:

When you look at the documentation of tracelog you will find many Windows 8 and Windows 8.1 additions which make it a cool tool to record trace sessions. Although I must admit that the sheer number of tools to start traces is getting out of hand. We have

ETW Control Tool	Part Of
WPR/UI	Windows Performance Toolkit
xperf	Windows Performance Toolkit
logman	Windows
tracelog	Since Windows 8.1 SDK formerly only part of WDK.
PerfView	Extra Download from MS

Each tool has another syntax and philosophy behind which makes it a bit cumbersome to use the right tool. But as far as I can tell you should check out tracelog first since it is the only tool which is able to make use of the new cool Windows 8/8.1 enhancements. Below is a list of the new ETW capabilities of Windows 8/8.1

Since	Tracelog Feature	Description
8.1	-btn	Specifies the number (n) of buffers to fill before starting to flush them.
8.1	-EventIdFilter {-in\|-out}nid1 id2 ...	Specifies an event id filter with n event ids (maximum 64 event ids allowed).
8.1	-ExeFilter Executable_file [;Executable_file ...]	Specifies the names of executable files to filter. You can specify a list of files. Separate the names of the files using semi-colons. Files not listed are excluded.
8	-hybridshutdown {stop\|persist}	Controls hybrid shutdown logger behavior.
8.1	-independent	Enables independent mode on the trace session.
8.1	-PidFilter npid1 pid2 ...	Specifies a Pid filter with n Pids (maximum 8 allowed).
8.1	-PkgIdFilter Package Full Name [ ;Package Full Name...]	Specifies a package ID filter. You can specify a list of package files. Separate the names of the files using semi-colons.
8.1	-PkgAppIdFilter PRAID [ ;PRAID...]	Specifies a package-relative app identifier (PRAID) filter. The PRAID is the unique identifier of the application within the package. You can specify more than one PRAID. Separate the ids using semi-colons.
8	-Pmc Ctrs:Events	Configures the performance monitor counter (PMC) sampling on events.
8	-ProfileSource src	Configure profiling source to use. For list of sources, use the command tracelog -ProfileSource Help.
8	-SetProfInt nsrc	Configure the profiling interval (n) for specified source. Where n represents units of 100ns. The default is 10000 (which is equivalent to 1ms.
8.1	-StackWalkFilter {-in\|-out}nid1 id2 ...	Specifies an event id filter with n event ids (maximum 64 event ids allowed).
8	-systemlogger	Logger can receive SystemTraceProvider events. See Configuring and Starting a SystemTraceProvider Session.

By default tracelog will start loggers with kernel mode providers with the systemlogger option so you can start up to 8 additional Kernel Loggers which should be enough for everyone. To log e.g. all Context Switch events with call stacks you can use

tracelog -start MySession -f c:\Kernel1.etl -img -eflag PROC_THREAD+LOADER+CSWITCH+DISPATCHER -stackwalk CSWITCH+ReadyThread

This will create a new kernel trace session (Windows 8 and later only!) without blocking the default Kernel Logger Session which is the most useful thing of the new capabilities of the tracelogger tool. When I stop it with

tracelog -stop mySession

then no image identification rundown is performed which means that I have call stacks but WPR cannot identify in which images the stack addresses do reside. It seems also not be able to merge etl files like xperf can do. Not to mention that it does not create the manged pdbs when stopping a trace session like wpr and xperf are able to. Still tracelog is the only tool which exposes the full power of ETW as a command line tool. It has the most interesting options to start new trace sessions but it seems to (correct me I really would like to know) be the wrong tool to stop a trace session by initiating all the necessary image identification and managed pdb creation. I really would like to see additional parameters in xperf to create the managed pdbs later when I have gathered and etl file.

To limit the potentially huge output of ETW can now be filtered at least at process level for user mode providers (-PidFilter or -ExeFilter) and to enable/disable stack walking for specific events and not the complete provider (-StackWalkFilter …) which is a welcome addition for Windows 8.1 users of ETW. Based on my limited experience with the new tracelog tool I would use it to start trace sessions for sophisticated analysis tasks on Windows 8 machines but stop the trace session with xperf to get the usual image identification and ngen pdb creation support.

The command

xperf -stop MySession -d c:\temp\xperfStop.etl

did the job to stop the system logger and I got readable call stacks with a full .NET support. If you do not want to use a myriad of tools you can also use PerfView which Vance Morrison uses a lot to troubleshoot VS issues. The command line syntax is vastly different than the previous tools but it can be quite handy if you only need to deploy a 10MB exe which can collect and view the ETW data.

perfview start c:\temp\perfview.etl -nogui

will start profiling with a sample interval of 1ms and also capture all disc related stuff which is a good default to start. If you need to do some wait chain analysis you can use

perfview start c:\temp\perfview.etl -nogui -ThreadTime

which will enable an additional bunch of additional providers:

Enabled tracing: Process Thread ProcCounters ImageLoad DiskIo DiskIoInit HardFaults CxtSwap TcpIp Profile Dispatcher
Stack Tracing: DiskRead DiskWrite DiskFlush 0x0220 ProcessCreate ImageLoad ThreadCreate CSwitch ReadyThread Profile

To stop tracing you need to do

perfview -zip:false stop

You can view the generated file with

perfview c:\temp\perfview.etl

or with WPA or your favorite viewing tool. Perfview is for some managed analysis very handy. With handy I mean that it is the only tool to analyze managed processes for things like

GC wait times
JIT times
Managed Allocation Tracing (type and stack)
thrown exceptions without using a debugger, profiler or your own tracing

I did refuse to try perfview for a long time because it complained always that nearly all my stacks were broken. The issues is that it handles loading pdbs for call stack resolution and filtering quite different.

The first hurdle is to notice that there are always stack filters enabled (See GroupPats) which will fold your stack to nearly useless many times. After deleting the GroupPats text box I get real stacks but no symbols:

Now I have the stacks but still no symols. To fix that I must use the context menu to load for this call stack the corresponding symbols. The MS symbol servers are configured by default. If you have your own symbol server you need to reconfigure it in the File - Set Symbol Path menu first.

After that I get call stacks with method names as usual:

Once I did realize that I found PerfView quite useful. It is not only for performance work interesting but also as a system wide debugger which captures all thrown managed exceptions in all processes at once. The tool itself has many more options which you can explore with

Perfview /?

It supports some really advanced scenarios like collecting data in a ring buffer until a specific performance counter reaches a threshold then capture x seconds further and then stop. The main drawback of ETW with full profiling and context switch capturing is that the amount of data on a busy system exceeds 1 GB of data within one minute easily. If you do want to capture data for a really long time and stop when the interesting condition has happened you need more sophisticated stop triggers. PerfView has captured this operational know how in its command line parameters.

It can stop on

Performance Counter Thresholds
Application Event Log Messages
ETW Events
Gen 2 Collections
Long Gen 2 Collections
Managed exceptions of a specific type
AppFabric request time (Azure)

This makes it a cool tool to diagnose sporadic errors but you cannot afford to let the application to run under a debugger. The biggest pain point of PerfView is that it is great to collect data but it cannot live with the generated etl file and needs to convert it to its own format (etlx) for further processing. This can take some additional minutes to load bigger etl files but I hope it will become faster in future versions.

Below is the full command line of PerfView which has gotten not enough attention (just like tracelog) so far. Tracelog can be excused by living its hidden live in the Windows Driver Development Kit until it was moved to the Windows SDK with 8.1. It seems that device driver developers do not share their tricks online like "normal" .NET developers do.

The PerfView application has a number of commands associated with it, each with its own set of parameters and
qualifiers. They are listed below. Options that are common to all commands are listed at the end.

-----------------------------------------------------------------------------------------------------------------------
Usage: PerfView [DataFile]

View profile data.

Parameters:
[DataFile] ETL or ETLX file containing profile data.

-----------------------------------------------------------------------------------------------------------------------
Usage: PerfView run CommandAndArgs ...

Starts data collection, runs a command and stops.

Parameters:
CommandAndArgs ... Command to run and arguments.

-----------------------------------------------------------------------------------------------------------------------
Usage: PerfView collect [DataFile]

Starts data collection, wait for user input, then stops.

Parameters:
[DataFile] ETL file containing profile data.

-----------------------------------------------------------------------------------------------------------------------
Usage: PerfView start [DataFile]

Starts machine wide profile data collection.

Parameters:
[DataFile] ETL file containing profile data.

-----------------------------------------------------------------------------------------------------------------------
Usage: PerfView stop

Stop collecting profile data (machine wide).

-----------------------------------------------------------------------------------------------------------------------
Usage: PerfView mark [Message]

Add a PerfView 'Mark' event to the event stream with a optional string message

Parameters:
[Message] The string message to attach to the PerfView Mark event.

-----------------------------------------------------------------------------------------------------------------------
Usage: PerfView abort

Insures that any active PerfView sessions are stopped.

-----------------------------------------------------------------------------------------------------------------------
Usage: PerfView merge [DataFile]

Combine separate ETL files into a single ETL file (that can be decoded on another machine).

Parameters:
[DataFile] ETL file containing profile data.

-----------------------------------------------------------------------------------------------------------------------
Usage: PerfView listSessions

Lists active ETW sessions.

-----------------------------------------------------------------------------------------------------------------------
Usage: PerfView ListCpuCounters

Lists the ListCpuCounters CPU counters available on the system (win8+ only).

-----------------------------------------------------------------------------------------------------------------------
Usage: PerfView EnableKernelStacks

On X64 machines if you have problems with broken stacks when the code is executing in the kernel, setting this option
and rebooting may improve things

-----------------------------------------------------------------------------------------------------------------------
Usage: PerfView DisableKernelStacks

Resets the registry keys set by EnableKernelStack.

-----------------------------------------------------------------------------------------------------------------------
Usage: PerfView HeapSnapshot Process [DataFile]

Take a snapshot of the CLR GC heap of a process.

Parameters:
    Process                              The process ID or Process Name (Exe without extension) of the process take a
                                         heap snapshot.
    [DataFile]                           The name of the file to place the heap snapshot.

Qualifiers:
    [-SaveETL]                           Save an ETL file along with the GCDump file when dumping the JS Heap.
    [-MaxDumpCountK:250]                 The maximum number of objects (in K or thousands) to place int the .gcDump
                                         file. Sample sufficiently to hit this metric.
    [-Freeze]                            Freeze the dump while data is taken.

-----------------------------------------------------------------------------------------------------------------------
Usage: PerfView ForceGC Process

Forces a GC on the specified process

Parameters:
Process The process ID or Process Name (Exe without extension) of the process to force
a GC.

-----------------------------------------------------------------------------------------------------------------------
Usage: PerfView HeapSnapshotFromProcessDump ProcessDumpFile [DataFile]

Extract the CLR GC heap from a process dump file specified.

Parameters:
ProcessDumpFile The name of the input process dump file.
[DataFile] The name of the file to place the heap snapshot.

-----------------------------------------------------------------------------------------------------------------------
Usage: PerfView GuiRun

Opens the 'Run' dialog box.

-----------------------------------------------------------------------------------------------------------------------
Usage: PerfView GuiCollect

Opens the 'Collect' dialog box.

-----------------------------------------------------------------------------------------------------------------------
Usage: PerfView GuiHeapSnapshot

Opens the 'TakeHeapSnapshot' dialog box.

-----------------------------------------------------------------------------------------------------------------------
Usage: PerfView UserCommand CommandAndArgs ...

Runs a user defined command. Type 'PerfView UserCommandHelp' to see the help for all the user commands. See PerfView
Extensions in the users guide for more on creating user commands.

Parameters:
CommandAndArgs ... User command to run and any arguments.

-----------------------------------------------------------------------------------------------------------------------
Usage: PerfView UserCommandHelp

Displays help for user commands. Also see Help->User Command Help in the GUI.

-----------------------------------------------------------------------------------------------------------------------
Usage: PerfView CreateExtensionProject [ExtensionName]

Creates a VS project for creates a perfView extension.

Parameters:
[ExtensionName.Global] The name of the extension (no .DLL)

-----------------------------------------------------------------------------------------------------------------------
Usage: PerfView FetchSymbolsForProcess [DataFile]

Fetch all the PDBs files needed for viewing locally.

Parameters:
[DataFile] ETL file containing profile data.
-----------------------------------------------------------------------------------------------------------------------
Qualifiers global to all commands:

    [-LogFile:STRING]                    Send messages to this file instead launching the GUI. Intended for batch
                                         scripts and other automation.
    [-BufferSize:64]                     The size the buffers (in MB) the OS should use to store events waiting to be
                                         written to disk.
    [-Circular:0]                        Do Circular logging with a file size in MB. Zero means non-circular.
    [-BufferSizeMB:64]                   The size the buffers (in MB) the OS should use to store events waiting to be
                                         written to disk.
    [-CircularMB:0]                      Do Circular logging with a file size in MB. Zero means non-circular.
    [-MaxCollectSec:0]                   Turn off collection (and kill the program if perfView started it) after this
                                         many seconds. Zero means no timeout.
    [-StopOnPerfCounter:STRING,...]      This is of the form CATEGORY:COUNTERNAME:INSTANCE OP NUM where
                                         CATEGORY:COUNTERNAME:INSTANCE, identify a performance counter (same as
                                         PerfMon), OP is either < or >, and NUM is a number. When that condition is
                                         true then collection will stop. You can specify this qualifier more than once.
                                         See 'Stop Trigger' in the users guide for more.
    [-StopOnGen2GC:STRING]               This will stop on any Gen2 GC from the given process (can be a process ID or a
                                         process Name (exe file name without path or extension) or * (any process)
    [-StopOnEtwEvent:STRING,...]         This is of the form PROVIDER:TASK:OPCODE here only PROVIDER and TASK are
                                         required. (Opcode=Info, Keywords=ALL, Level=INFORMATIONAL are the defaults.
                                         Provider can be a Guid or Name (for EventSources you can use *Name to represent
                                         the Guid. Task is the task Name or 'Task(NNN)' where NNN is the task number in
                                         decimal. You can specify this argument more than once
    [-StopOnException:STRING]            Where the text is a regular expression that will be used to match the full name
                                         of the .NET Exception thrown.The empty string represents any exception.
    [-StopOnEventLogMessage:STRING]      Stop when an event log message that matches the given (ignore case) regular
                                         expression is written to the Windows 'Application' event log. You can specify
                                         a particular event log with the syntax eventLogName@RegExp.
    [-StopOnRequestOverMsec:0]           Trigger a stop of a collect command if there is any IIS request that is longer
                                         than the given number of MSec.
    [-StopOnGCOverMsec:0]                Trigger a stop of a collect command if there is a .NET Garbage Collection (GC)
                                         is longer than the given number of MSec.
    [-StopOnAppFabricOverMsec:0]         Trigger a stop of a collect command if there is a AppFabric request is longer
                                         than the given number of MSec.
    [-DecayToZeroHours:0]                The trigger value used in StopOnPerfCounter or StopOn*OverMSec will decay to
                                         zero in this interval of time.
    [-DelayAfterTriggerSec:10]           Wait this number of seconds after a trigger before actually stopping the trace.
    [-CollectMultiple:0]                 Collect Multiple instance (used in conjunction with StopTrigger).
    [-MonitorPerfCounter:STRING,...]     This is of the form CATEGORY:COUNTERNAME:INSTANCE@NUM where
                                         CATEGORY:COUNTERNAME:INSTANCE, identify a performance counter (same as
                                         PerfMon), and NUM is a number representing seconds. The @NUM part is optional
                                         and defaults to 2. The value of the performance counter is logged to the ETL
                                         file as an event ever NUM seconds
    [-CpuSampleMSec:1]                   The interval (MSec) between CPU samples (.125Msec min).
    [-Merge:BOOLEAN]                     Do a merge after stopping collection.
    [-Zip:BOOLEAN]                       Zip the ETL file (implies /Merge).
    [-NoRundown]                         Don't request CLR Rundown events.
    [-NoNGenRundown]                     Don't do rundown of symbolic information in NGEN images (only needed pre V4.5).
    [-NoClrRundown]                      Don't do rundown .NET (CLR) rundown information )(for symbolic name lookup).
    [-RundownTimeout:120]                Maximum number of seconds to wait for CLR rundown to complete.
    [-MinRundownTime:0]                  Minimum number of seconds to wait for CLR rundown to complete.
    [-KeepAllEvents]                     A debug option to keep all events, even symbolic rundown events.
    [-MaxEventCount:0]                   Limits the total number of events. Useful for trimming large ETL files. 1M
                                         typically yields 300-400Meg. of data considered.
    [-SkipMSec:0]                        Skips the first N MSec of the trace. Useful for trimming large ETL files in
                                         conjunction with the /MaxEventCount qualifier.
    [-CpuCoutners:STRING,...]            A comma separated list of hardware CPU counters specifications NAME:COUNT to
                                         turn on. See Users guide for details. See ListCpuCounters for available
                                         sources (Win8 only)
    [-Providers:STRING,...]              Additional providers. This is comma separated list of
                                         ProviderGuid:Keywords:Level:Stack specs. This qualifier has the same syntax as
                                         the Additional Providers TextBox in the collection window. See help on that
                                         for more.
    [-OnlyProviders:STRING,...]          Like the Providers qualifier, but also turns off the default Kernel and CLR
                                         providers.
    [-ThreadTime]                        Shortcut for turning on context switch and readyThread events
    [-GCOnly]                            Turns on JUST GC collections an allocation sampling.
    [-GCCollectOnly]                     Turns on GC collections (no allocation sampling).
    [-DumpHeap]                          Capture a heap snapshot on profile stop
    [-ClrEventLevel:Verbose]             The verbosity for CLR events Legal values: Always, Critical, Error, Warning,
                                         Informational, Verbose.
    [-ClrEvents:Default]                 A comma separated list of .NET CLR events to turn on. See Users guide for
                                         details. Legal values: None, GC, GCHandle, Binder, Loader, Jit, NGen,
                                         StartEnumeration, StopEnumeration, Security, AppDomainResourceManagement,
                                         JitTracing, Interop, Contention, Exception, Threading,
                                         JittedMethodILToNativeMap, OverrideAndSuppressNGenEvents, SupressNGen, Type,
                                         GCHeapDump, GCSampledObjectAllocationHigh, GCHeapSurvivalAndMovement,
                                         GCHeapCollect, GCHeapAndTypeNames, GCHeapSnapshot,
                                         GCSampledObjectAllocationLow, PerfTrack, Stack, Default, All.
    [-KernelEvents:Default]              A comma separated list of windows OS kernel events to turn on. See Users guide
                                         for details. Legal values: None, Process, Thread, ImageLoad, ProcessCounters,
                                         ContextSwitch, DeferedProcedureCalls, Interrupt, SystemCall, DiskIO,
                                         DiskFileIO, DiskIOInit, Dispatcher, MemoryPageFaults, MemoryHardFaults,
                                         VirtualAlloc, VAMap, NetworkTCPIP, Registry, AdvancedLocalProcedureCalls,
                                         SplitIO, Driver, OS, Profile, Default, TheadTime, FileIO, FileIOInit, Verbose,
                                         All, ReferenceSet, PMCProfile.
    [-DotNetAlloc]                       Turns on per-allocation .NET profiling.
    [-DotNetAllocSampled]                Turns on per-allocation .NET profiling, sampling types in a smart way to keep
                                         overhead low.
    [-OSHeapProcess:0]                   Turn on per-allocation profiling of allocation from the OS heap for the process
                                         with the given process ID.
    [-OSHeapExe:STRING]                  Turn on per-allocation profiling of allocation from the OS heap for the process
                                         with the given EXE (only filename WITH extension).
    [-ForceNgenRundown]                  By default on a V4.0 runtime NGEN rundown is suppressed, because NGEN PDB are a
                                         less expensive way of getting symbolic information for NGEN images. This
                                         option forces NGEN rundown, so NGEN PDBs are not needed. This can be useful in
                                         some scenarios where NGEN PDB are not working properly.
    [-AcceptEULA]                        Accepts the EULA associated with PerfView.
    [-DataFile:STRING]                   FileName of the profile data to generate.
    [-NoView]                            Normally after collecting data the data is viewed. This suppresses that.
    [-UnsafePDBMatch]                    Allow the use of PDBs even when the trace does not contain PDB signatures.
    [-ShowUnknownAddresses]              Displays the hexadecimal address rather than ? when the address is unknown.
    [-VS:BOOLEAN]                        Turn on Visual Studio Providers. It also insures that theses providers are
                                         registered and can be decoded properly.
    [-NoGui]                             Use the Command line version of the command (like on ARM). Brings up a console
                                         window. For batch scripts use /LogFile instead.
    [-Process:STRING]                    A process name (exe file name without directory or extension).
    [-SafeMode]                          Turn off parallelism and other risky features.
    [-RestartingToElevelate:STRING]      Internal: indicates that perfView is restarting to get Admin privileges.
    [-SessionName:PerfViewSession]       Define the name for the user mode session, if kernel events are off.
    [-MaxNodeCountK:0]                   The maximum number of objects (in K or thousands) that will even be examined
                                         when dumping the heap. Avoids memory use at collection time. This is useful
                                         if heap dumping causes out of memory exceptions.

↧

Scratch folder for a unit test

December 28, 2013, 4:06 pm

≫ Next: GDI+ Bug in Windows 7 x86 and Windows 8.0/8.1 x86/x64

≪ Previous: Windows 8/8.1 Enhancements to ETW

Originally posted on: http://geekswithblogs.net/akraus1/archive/2013/12/28/154992.aspx

Yes I know. If a unit test accesses a file it is by definition no longer a unit test. My personal definition of a unit test is: If it is fast I can run many of them which is ok for the main goal of unit tests to provide fast feedback. Personally I am not big fan of repetitive code so I created a small class that takes care of creating a temp folder in the %TEMP%\utest\ directory which creates a directory named after your unit test method name which is quite handy for debugging failed unit tests.

[Test]publicvoid Can_Create_Temp_Folder()
{using (var tmp = TempDir.Create())
    {
        var txtfile = Path.Combine(tmp.Name, "test.text");
        File.WriteAllText(txtfile, "hi");
    }
}

This will give you a fresh directory for every test

which is deleted after the using statement has been left including its contents. You can create variations of that to e.g. leave the temp files in place when the unit test fails to make debugging easier (See here how to check in a finally block if you are in an exception unwind scenario). Or you can postfix every test with a guid to allow 100% separation of concurrently running unit tests and to make absolutely sure that you are not reusing old data of an unit test which had some files opened which could not be deleted last time. The simplest form is a small code snippet like this:

interface ITempDir : IDisposable
    {string Name
        {
            get;
        }
    }class TempDir : ITempDir
    {publicstring Name
        {
            get;private set;
        }/// <summary>/// Create a temp directory named after your test in the %temp%\uTest\xxx directory/// which is deleted and all sub directories when the ITempDir object is disposed./// </summary>/// <returns></returns>
        [MethodImpl(MethodImplOptions.NoInlining)]publicstatic ITempDir Create()
        {
            var stack = new StackTrace(1);
            var sf = stack.GetFrame(0);returnnew TempDir(sf.GetMethod().Name);
        }public TempDir(string dirName)
        {if( String.IsNullOrEmpty(dirName))
            {thrownew ArgumentException("dirName");
            }
            Name = Path.Combine(Path.GetTempPath(), "uTests", dirName);
            Directory.CreateDirectory(Name);
        }publicvoid Dispose()
        {if( Name.Length<10)
            {thrownew InvalidOperationException(String.Format("Directory name seesm to be invalid. Do not delete recursively your hard disc.", Name));
            }// delete all files in temp directoryforeach(var file in Directory.EnumerateFiles(Name, "*.*", SearchOption.AllDirectories))
            {
                File.Delete(file);
            }// and then the directory
            Directory.Delete(Name,true);
        }
    }

Small, simple but effective.

↧

GDI+ Bug in Windows 7 x86 and Windows 8.0/8.1 x86/x64

January 3, 2014, 5:19 pm

≫ Next: Windows SWS and WPAD Network Hangs

≪ Previous: Scratch folder for a unit test

Originally posted on: http://geekswithblogs.net/akraus1/archive/2014/01/03/155056.aspx

What is wrong with this code?

List<Bitmap> bitmaps = new List<Bitmap>();privatevoid button1_Click(object sender, EventArgs e)
{for (int i = 0; i < 10 * 1000; i++)
   {
       var bmp = Resources.SmallIcon;
       bitmaps.Add(bmp);
   }
}

From a correctness point of view nothing. You allocate 10 000 bitmaps which are e.g. a 32x32 icon which has about 1KB size. That would make another 10 MB of memory which is nothing out of the ordinary. You let your application run for a few hours and Task Manager shows you 1 GB of memory consumption and boom you run out of memory. The interesting thing is that you have not used up all available memory but your 32 bit application has run out of address space.

When I do execute this on my home machine with Windows 8.1 x64. I get the following numbers

We allocate 163 MB in private bytes. That is expected. But what about the 1.7 GB of reserved memory? Ok Windows might have a bad day. Lets try it with 5000 bitmaps.

# of Bitmaps	Reserved KB	KB Reserved per Bitmap KB	# 64 KB Blocks/Bitmap
10.000	1.709.096	171	2.7
5.000	986.024	197	3.1

When I do this for zero and one bitmap I get 3 64 KB file mappings for each bitmap which has one KB size. When all address space is used up Windows GDI seems to free some memory to leave place for future allocations (that explains the factor of 2,7).

Windows is reserving memory 192 times the size of our bitmaps!

This is true for 64 bit processes on Windows 8/8.1 as well but there it does not matter. The behavior is the same for all editions of Windows 7 x86 and Windows 8.x x86/x64. The only Windows edition where this does not happen is Windows 7 x64 where the GDI implementation is different. When I look at some applications which make use of bitmaps like Windows Live Writer I do see 80 MB of shareable memory where lots of 64 KB blocks are allocated. This is a real problem in all applications that read many bitmaps out of resources! Interestingly the most prominent example VS2013 on my machine has no such problems. It seems that MS is not using their own infrastructure…

Next time your managed application does run out of memory check your Shareable Reserved bytes count for many 64KB blocks. If you have many of them then you should check out your bitmap usage.

What is this Shareable Memory anyway? Good question. Shareable Memory are file mapping objects which are page file backed. Any file mapping that does point to a real file on disc is shown as Mapped File with the full file name. That is the reason why there is not much to see except the allocation address inside VMMap. That was an interesting bug which involved some time with Windbg and a memory dump of a crashed process.

I can recommend the !address extension of Windbg to dump of a process all pages and their allocation mode and export the data into excel from where it is easy to create pivot charts to find out that the Sharable Memory is all allocated as 64KB blocks. Once that was known I created some Windbg scripts to break into the CreateFileMapping and MapViewOfFile methods to get the allocation stacks. Unfortunately due to the many file mapping objects created by the process it became so slow that the application was unusable. A faster approach than Windbg was needed.

After a little thinking I thought why not intercept the calls with a hooking library like EasyHook? It is a great library which allows you to execute managed code in your hook while you can intercept nearly any Windows API call. Now I can get my hands on the mapping address (MapViewOfFile return the mapped address, UnMapViewOfFile receives the mapping address to unmap) and allocation size besides other things. But one thing is missing: Call Stacks.

Here it is good to know that .NET support ETW events by now. What is easier than to create an ETW provider in managed code which logs the all calls to MapViewOfFile and UnmapViewOfFile as structured events? The kernel knows how to walk the stack very fast. That's what I did and it did work perfectly. When you know how WPA displays custom ETW events you can even use it to analyze unbalanced map/unmap calls. The key insight here is that you need to put the mapping address of Un/MapViewOfFile into the same column. Now you can group by this field and get as grouping the mapping address. Now sort by count and look for all events which have the only one mapping call but no unmap. Select all unbalanced maping calls and filter for them.

Now you can earn the fruits of your grouping by adding the StackTrace column and look in which methods most unbalanced allocations are happening. Perhaps I can get permission to make the code public so you can use it for other purposes as well. The concept is general. You need to log the allocation and deallocation events into ETW and put the handle (e.g. the address or HANDLE) as first property into your ETW event. Then you can record all allocation and deallocation events with full call stacks. WPA will display the handle in the first column for your allocation and deallocation calls which allows further grouping and filtering.

You can skip that part and use wpaexporter to export your events to a CSV file and write your own tool to do more sophisticated analysis. Or you could use the TraceEvent Library and process the ETW events directly. It is up to you.

If you are having a hard time to find leaks related to GDI objects like Bitmap, Brush, Device Context, Font, Metafile, Pen, Region, Palette you can hook the calls and have an easy day. This approach is not new and has been described already 2003 in this MSDN magazine article. There are also the method names listed you might be interested in to hook. The new thing is that it was never so easy to get full managed and unmanaged call stacks with nearly no performance impact. The MSDN article relies on SetWindowsHookEx to inject code into other processes and dbghelp.dll of walk the stack with the debugging Apis. The combined ETW and EasyHook approach is much more performant and general (full mixed mode stack walking).

Without that it would have been much harder to find out the root cause. With ETW tracing you have access to first class information which is not based on guessing but exact measurements. That not only allows you to find bugs but you can now also make quantitative predictions if you fix the bitmap usage in this or that method how much address space and private bytes memory you will get back. It is kinda cool to go to the devs and tell them: If you fix that method you will free up 300MB of virtual address space. If we fix the top 5 then we can reduce the address space usage of Shareable Memory by 70%. Most of the time they say: "I do not believe that this innocent method is responsible for such a big amount of allocations. Here is the fix and check again." After measuring the gain of the fix I can compare the predicted results and tell them: "Looks like my prediction was correct and we have an exact match."

This is a real game changer. Now you can work focused on the most important methods which cause a leak based on accurate measurements. Thanks to the low impact ETW tracing it is possible to run all your tests with hooks enabled giving you in the case of leaks accurate data to find the leak without the need to repeat the tests! At least in an ideal world. But since setting the hooks up may cause random crashes it is safer to employ this technique only when a leak was found.

There is one drawback to hooking and ETW stack walking: If you are on Windows 7 stack walking will only work in 32 bit processes which (lucky me) was the case for this problem. The ETW stackwalker in Windows 7 stops at the first dynamically generated stack frame (the assembly code for the hooks) which will give you only a stack which ends in your trace call. Not terribly useful. This has been fixed in Windows 8+ where you can hook with full call stacks also in 64 bit processes.

Since Windows 7 will stay for some time I really would like to get a fix for the Windows 7 ETW infrastructure to get full call stacks in 64 bit processes. For managed code you can NGEN your assemblies which fixes the issue if you do not have much JIT generated code running in your process. The stackwalker of Windows 7 will stop when it encounters the first frame which is not inside any loaded module.

I can only speculate how many 32 bit applications are affected by this issue and run out of memory much earlier than they would need to. At least for regular .NET applications this is a real issue nobody did notice so far. The 64KB value is no accident. It is the OS allocation granularity which is the reason why always 64 KB chunks of memory are reserved although you can map smaller regions (e.g. 4KB as well). The OS allocation granularity seems to be used by GDI and put into the Maping calls explicitly. The called GDI method is GdipCreateBitmapFromStream which assumes an IStream interface which is converted to a file mapping region under the hood. Other GDI methods seem not to be affected by this issue but one can never know. If you have a bitmap intensive managed x32 application (many icons) you should better check with VMMap the amount of Shareable Memory when your application has run out of memory.

↧

Windows SWS and WPAD Network Hangs

January 8, 2014, 2:32 pm

≫ Next: ETWControler 1.0.0.0 Released!

≪ Previous: GDI+ Bug in Windows 7 x86 and Windows 8.0/8.1 x86/x64

Originally posted on: http://geekswithblogs.net/akraus1/archive/2014/01/08/155133.aspx

Many developers mainly know their application frameworks to rapidly deliver business driven applications. This is a good thing but still there are things like the OS, CPU, memory and the network which are critical to application performance.

The network is often viewed as a bidirectional connection between two computers

This is all well until things start to fail. The network topology then can quickly can change from two arrows to a black hole which tends to suck up all information and

Side Note: Black holes are not entirely black and will eventually fully disclose their stored information as an exploding white hole or a dim radiation originating from quantum effects where e.g. a virtual electron positron pair is separated by the gravitational force. One particle lands in the black hole and the other one has no chance than to become a real particle since there is no anti partner available anymore. If this effect is real no one knows but it is an interesting idea. Search for Hawking radiation if you want to know more.

never gives it back. Sudden hangs in network connections are exceptionally hard to analyze since it involves two computers and the magic between them.

In that case you need to check both computers and the network in one run to fully analyze the issue. To be fully able to analyze network glitches you need to use WPT (Windows Performance Toolkit) on both machines and Wireshark on one machine to capture all in/out packets. It is difficult to correlate both ETW traces with the Wireshark capture. Things become a lot easier if you have a remote interactive application like an online game were the server reacts on user input. In that case you can simply capture all mouse/keyboard events and save them to ETW locally and send them as well over a dedicated port over the wire and save the captured keyboard/mouse events on the remote machine as well. Then it is much easier to correlate network traces with ETW events because you have the keyboard/mouse events as markers in your ETW and network event stream. I had some time to play with this idea and I did try to learn something new during Christmas. In my case it was WPF so I created a small tool to put this idea to work.

Currently it looks like this.

Here I can capture mouse and keyboard events and save them to ETW and over the network on both machines. At the same time I can start on both machines ETW traces.

I think this is a powerful capability to correlate network and ETW events captured from different machines. You did perhaps notice the Slow button which defines a hot key which logs an additional Slow ETW event into the stream to mark an interesting condition e.g. a small hang. That makes it easy to correlate perceived user experience with a good marker event in the Wireshark and ETW traces. You can employ this concept in general to distributed systems as well if you know which action on one machine triggers operations on one or more other remote machines. If you log the start requests into your ETW and network stream (e.g. an diagnostics port unencrypted to make reading easier) there is no room for dark spots. The downside of course is that this type of analysis triples the amount of work since you need to check two machines and one network trace. Reading TCP traces is not the most natural thing to a developer but it is an interesting undertaking.

If you want to know more about the TCP protocol, HTTP, SPDY, …. you should have a look at High Performance Browser Networking which explains things from the bits up to the high level HTTP protocol and where you can tune things. Since the author is a performance engineer at Google he knows what he is talking about.

More details are the topic of a future post when I publish the code on CodePlex.

But back to the original topic of network hangs and the Silly Window Syndrome (SWS). If you are e.g. sending every keystroke over the network and you are fast at typing you can create the SWS since a lot of very small packets are going over the wire. In that case SWS avoidance algorithms do kick in which do delay the communication on purpose to force the receiver or the sender to generate bigger packets so the TCP window size can be increased from 0 bytes to a reasonable value. On Windows there is a Silly Window Syndrome avoidance timer used. The receiver of a small packet needs to ACK the small packet. The server will wait up to 5s for the next packet when the client does not respond to the packet with an ACK and an updated bigger tcp receive window size.

This all deep network stuff. Why should I care on my local machine and the loopback device? Well it happens that there is a bug in Windows 7 which causes your network connection to stall for 5s sometimes. The issue is that the SWS Timer is started AFTER the packet has been sent. If the ACK is received before the timer can start it will wait for the ACK (which will never come) for the full five seconds before sending the next packet.

This is known since 2011 but it took many support request to MS to release a hotfix for Windows 7 and Server 2008 R2 July 2013. Windows is used in many mission critical soft real time systems where inter process communication via sockets can break at any time for 5s. This issue has certainly caused much pain in the wild and very few people will have found the root cause of intermittent loopback device hangs.

If you have software communicating over the loopback device you should really get this hotfix to get rid of these sporadic errors. A workaround is to plug in a real loopback network cable which gives you enough delay so that the timer does start before the ACK is received. I am quite sure that this issue is the source of many failed UI automation tests where the UI did not react to user input sometimes within the usual time frame (e.g. 5s) but sometimes it did take much longer. It could be a good idea to add this fix to your test machines as well.

How does this issue look in reality? Lets have a look at some busy processes communication with each other:

The region should be red but that's the color how WPA shows highlighted region. All processes are busy but the whole machine hangs for 5s. There is one innocent looking event named TcpSWSAvoidanceEnd which signals that it has done its full 5s wait and it has fixed the non existing SWS issue. The TcpSwsAvoidanceBegin events are quite common in this scenario. From 200K packets 40K are TcpSwsAvoidanceBegin events which all spin up a 5s timer which is cancelled when the ACK from the client is received. Due to the heavy network load the race condition has a good chance to happen and it will happen from time to time.

Today is my side tracking day. I have another network hang which is until now not fully understood. If you enable in Internet Explorer Automatic Proxy Detection your http requests are routed by a java script downloaded from a proxy server. The mechanism behind this is the Web Proxy Autodiscovery Protocol. This uses your full qualified computer name as starting point to find the server which hosts the http url routing script. Your computer has e.g. the name

myComputer512.department.SuperDepartment.Company.com

The browser will download from the following urls the proxy redirection java script.

WPAD.department.bigdepartment.company.com/wpad.data

WPAD.bigdepartment.company.com/wpad.data

WPAD.company.com/wpad.data

WPAD.com/wpad.data

If you have enabled this it can happen that some http requests take a long time for no apparent reason. The fix is to disable automatic proxy detection and configure the proxy manually. This is a good thing anyway since for some reason only known the NSA a wpad.de server exists. This means if your company has no WPAD server configured you are downloading java script code from a server someone else has control over regardless of the web page you are trying to access …

Freaky.

↧

ETWControler 1.0.0.0 Released!

January 25, 2014, 2:20 pm

≫ Next: Desktop Heap Exhaustion / Hard Bugs

≪ Previous: Windows SWS and WPAD Network Hangs

Originally posted on: http://geekswithblogs.net/akraus1/archive/2014/01/25/155275.aspx

During the end of 2013 I had a little free time to create a tool to help diagnose performance problems in distributed systems. If you have a machine connected to a remote server and you want to find out why e.g. your game sometimes has hiccups you can run ETWControler on both boxes to get ETW tracing from both machines at the same time.

That alone of course could also be achieved with a shell script. But there is more to it. It comes with a full keyboard and mouse logger where all keyboard and mouse events are sent to ETW on the client, over the network to the server. There are the the received keyboard/mouse events also written to ETW.

That allows you to correlate in an easy way the local/remote and network traces with full timing information. More info's can be found at its CodePlex site.

https://etwcontroler.codeplex.com/

The main use cases are covered here

https://etwcontroler.codeplex.com/documentation

Here are some screenshots:

It was fun to learn WPF and MVVM by creating my first WPF application. WPF although no longer hip is a great way to create good looking desktop apps quite fast. Perhaps I was learning an old thing but old things are likely to stay for the next 20 years. If you read the code you will find that it uses an socket quite directly to send the keyboard events as fast and clean over the wire as possible which was also an interesting exercise to learn how hard it is to use plain sockets. Sockets have way to many options to configure them in funny ways where the connection will never time out and things like that.

If you have found creative uses for this tool or interesting ideas how to extend it drop me note. Currently I am thinking about to add more control about what should be recorded by using TraceEvent instead of WPR to have more ETW collection options.

↧

Desktop Heap Exhaustion / Hard Bugs

February 4, 2014, 2:05 pm

≫ Next: Boot Time Analysis

≪ Previous: ETWControler 1.0.0.0 Released!

Originally posted on: http://geekswithblogs.net/akraus1/archive/2014/02/04/155370.aspx

Recently I had an interesting issue to diagnose. We all (readers of my blog) know that forgotten unregister calls to events are a common source of managed memory leaks. The weak event pattern is a common cure to this plague and WPF has a nice implementation for it. With .NET 4.5 WeakEventManager has got a new cousin which does not require you to derive from it which is even nicer. Every time when I needed to deal with weak event registrations I used it and it worked fine. Since WeakEventManager is a low level class I have used it also in non ui applications which turned out to be a bad design decision.

What do you think will this fine sample application do?

class Program
    {publicstaticevent EventHandler OnSomething;staticvoid Main(string[] args)
        {
            var p = new Program();for(int i=0;i<1000*1000;i++)
            {
                ThreadStart func = () =>
                    {// Register a weak event handler 
                        WeakEventManager<Program,EventArgs>.AddHandler(p, "OnSomething", (o,e)=> {});
                    };// Do the registration every time on a new thread
                var thread = new Thread(func);
                thread.Start();
                thread.Join();
            }// lets see how far we get
            Console.ReadLine();
        }
    }

…

No not the answer ….

Still no

It will give you this strange looking exception:

System.ComponentModel.Win32Exception: Not enough storage is available to process this command

   at MS.Win32.UnsafeNativeMethods.RegisterClassEx(WNDCLASSEX_D wc_d)
   at MS.Win32.HwndWrapper..ctor(Int32 classStyle, Int32 style, Int32 exStyle, Int32 x, Int32 y, Int32 width, Int32 height, String name, IntPtr parent, HwndWrapperHook[] hooks)
   at System.Windows.Threading.Dispatcher..ctor()
   at System.Windows.Threading.Dispatcher.get_CurrentDispatcher()
   at MS.Internal.WeakEventTable..ctor()
   at MS.Internal.WeakEventTable.get_CurrentWeakEventTable()
   at System.Windows.WeakEventManager`2.CurrentManager(String eventName)
   at System.Windows.WeakEventManager`2.AddHandler(TEventSource source, String eventName, EventHandler`1 handler)
   at WhatsWrongWithThisCode.Program.<>c__DisplayClass3.<Main>b__0() in d:\Media\Blog\WhatsWrongWithThisCode\WhatsWrongWithThisCode\Program.cs:Zeile 23.
   at System.Threading.ThreadHelper.ThreadStart_Context(Object state)
   at System.Threading.ExecutionContext.RunInternal(ExecutionContext executionContext, ContextCallback callback, Object state, Boolean preserveSyncCtx)
   at System.Threading.ExecutionContext.Run(ExecutionContext executionContext, ContextCallback callback, Object state, Boolean preserveSyncCtx)
   at System.Threading.ExecutionContext.Run(ExecutionContext executionContext, ContextCallback callback, Object state)
   at System.Threading.ThreadHelper.ThreadStart()

What happened here? The WeakEventManager has a hidden dependency to a Dispatcher object which assumes that you live on a thread that has an active window message pump. My newly created threads did never pump messages which causes the scheduled background operations to cleanup garbage collected event subscribers to hang forever. If you execute this logic on only a few threads nothing bad will happen. But if you use this in a server application which runs for weeks on dynamically generated threads then your task manager will look like this:

All looks normal except that we seem to leak many handles. In fact we do leak thread handles. It seems that the RegisterClassEx causes the managed thread objects to get pinned. That is not nice but no big deal. Windows can handle many more handles. But here is the catch:

You just have screwed your whole machine!

All of a sudden you cannot open new applications. Windows cannot be created, the C++ compiler will stop with an error that no .NET Framework is installed and many other very odd messages from any application can happen. At this point you usually reboot the server and hope that the error does not show up too often. Did you let this application run? Congratulations. A reboot is required Zwinkerndes Smiley quite soon.

When you google for the error message you get some links back that a non paged pool allocation failed and you fix it by changing some magic registry keys. That's what I did but it did not help. My application did not run out of non paged pool memory but it ran out of Desktop Heap. If you have never heard about Desktop Heap you are in good company. This heap is global for all applications in your session and many Window related methods allocate stuff from it. According to the NT debugging blog it is mainly used by CreateWindow but other functions (which are not documented) can also cause Desktop Heap allocations. Apparently it seems that we have just found another method named RegisterClassEx which also allocates from the Desktop Heap. You can change the desktop heap size for your interactive session by changing the registry key

HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\Session Manager\SubSystems\Windows

which has on my Windows 8.1 box this value

%SystemRoot%\system32\csrss.exe ObjectDirectory=\Windows SharedSection=1024,20480,768 Windows=On SubSystemType=Windows ServerDll=basesrv,1 ServerDll=winsrv:UserServerDllInitialization,3 ServerDll=sxssrv,4 ProfileControl=Off MaxRequestThreads=16

The yellow marked value is the one you usually care about which is the size of your interactive desktop heap in KB. 20 MB is not so much but under Windows XP it was only 3MB where this error was much more common. By increasing this value to an even bigger one I was able to make this quite often happening issue to a very sporadic one which made this a not urgent issue. I tried to analyze the desktop heap usage of a kernel dump with !dheap but this Windbg command did never work so I was stuck until I did check the handle leak. When I have tracked down the root cause of the handle leak to the WPF WeakEventManager class it all made sense. The bad thing was that I always searched for high values of User Objects (window handles) which never happened when the machine ran out of desktop heap, leaving not many clues which process was responsible for allocating so much desktop heap.

What is the morale of this story? Never use WeakEventManager in threads which have no Window message pump running or you will run out of desktop heap sooner or later if you call it from many threads!

Reason found. Now what? When you search for Weak event patterns on the net you will find samples which

Do not work at all (at least the version in the book of Jeffrey Richter)
Have race conditions
Have bad performance

Although the concept is simple, a working implementation which works in all circumstances seems to be surprisingly hard to create. So far I have found FastSmartWeakEvent from Daniel Grunwald the best implementation which keeps the original code with += and -= to register and unregister a weak delegate and does even throw an exception at you when you try to register a delegate which contains closures which would cause the event never to fire because the compiler generated closure class is not referenced by anyone except the weak (too weak) delegate which is collected as soon as the method where the registration happened is left. Next time you see "Not enough storage is available to process this command" you know now where to look at. If the handle count is high the odds are good that someone did use the WPF WeakEventManager on non ui threads like I did.

↧

Boot Time Analysis

February 6, 2014, 2:07 pm

≫ Next: Enable Activity Tracing For Workflow Foundations

≪ Previous: Desktop Heap Exhaustion / Hard Bugs

Originally posted on: http://geekswithblogs.net/akraus1/archive/2014/02/06/155386.aspx

MS is investing heavily into its ETW infrastructure now and in the future. You can find very good online trainings at MS Virtual Academy: http://www.microsoftvirtualacademy.com/training-courses/windows-performance-jump-start#?fbid=g4hL6pwMe94. I have watched some videos today and they have really great content online about rare topics. Personally I found On/Off Transition Experience the most enlightening one. After this video you can rip your slow booting machine apart and make it boot much faster. With the content presented there I was able to cut down my boot time from over two minutes down to one minute, which is really great once you know where to look. The biggest issue on my machine was the DebugSvc was having over 90s of service init time which delayed the whole boot and the initialization of all other services a very long time. By simply disabling the service I got a really good improvement there. Another thing is to check if during your boot .NET 2 applications are still running. Most of them can run also under .NET 4.0 when you change the app.config file and add

<configuration><startup><supportedRuntime version="v4.0"/></startup></configuration>

you can convince them to use .NET 4.0. This will get rid of all hard disc accesses to the .NET 2.0 dlls which can cost you several seconds of hard disc access time. Another issue is SQL server. Here is a slide which I have shamelessly copied from Analysis at Scale.

This is real world data from over 50 000 computers. After seeing this graph I immediately changed my SQL Server instance to delayed boot so it gets out of the way of the other stuff which needs my hard disc more at this time. If you are working on a Laptop with a slow hdd and you go green with envy when you see how much faster your colleagues can work with their "old" desktop PCs. Here is a graph which shows boot time depending on the used hard disc:

If you need more proof why your money is well spent on a new SSD for your boot drive you can show this to your boss. It does really make a huge difference no only during boot time. What I really found fascinating in the On/Off Transition video is that there are companies out there which have Windows machines which need two hours to boot, and they were ok with that. Well not entirely. Finally they called in MS support to check their machines and they applied a hotfix to reduce the boot time from 2h down to 8 minutes. A few tweaks later they were down to two minutes. But the most shocking fact was that their workaround was that one member of the staff needed to arrive two hours earlier than all other ones to boot up all office machines. Incredible. I wonder if the company is still in business if they have equivalent efficient business processes…

↧

Enable Activity Tracing For Workflow Foundations

February 6, 2014, 2:21 pm

≫ Next: What Every Developer Must Know About Fast Garbage Collection (+ more)

≪ Previous: Boot Time Analysis

Originally posted on: http://geekswithblogs.net/akraus1/archive/2014/02/06/155387.aspx

Most people using Workflow Foundations only know TrackingParticipants to enable activity tracing. With WF 4 the Sql TrackingParticipant was deprecated and replaced by EtwTrackingParticipant. WF has been wasteful with system resources since its very beginning but it has become better with WF 4. If you think that a SQL tracking participant is a good idea think again why MS did deprecate it:

[ObsoleteAttribute("The System.Workflow.* types are deprecated.  Instead, please use the new types from System.Activities.*")]publicsealedclass SqlTrackingService : TrackingService,

You should not try to create your own version of it because it is such a brilliant idea. It never was and never will be a well performing tracing destination. Lets see how far we get with the new ETW support for WF. The easiest way for e.g. a WF Console application is to add it directly to your WorkflowInvoker:

staticvoid Main(string[] args)
{
    Activity workflow1 = new Workflow1();
    var invoker = new WorkflowInvoker(workflow1);  invoker.Extensions.Add(new EtwTrackingParticipant());
    invoker.Invoke();
}

Another option is to enable it indirectly via your App.config

<system.diagnostics><sources><sourcename="System.Activities"switchValue="Information, ActivityTracing"propagateActivity="true"><listeners><addname="console"type="System.Diagnostics.DefaultTraceListener"/></listeners></source></sources></system.diagnostics>

There is no mention of ETW in the used listeners but the ETW listener is added automatically when you have trace listeners for System.Activities configured.

Now you can enable ETW tracing with xperf like this:

xperf -on proc_thread+Loader -start workflow -on c651f5f6-1c0d-492e-8ae1-b4efd7c9d503

The guid is the ETW provider for WCF and WF tracing. Its name is "Microsoft-Windows-Application Server-Applications". This xperf command will create in your root folder a kernel.etl and user.etl file to which the events are logged to. The default buffer size is 20 MB which you can alter if the event rate is higher than the buffer can be flushed to disc.

To stop tracing you can use

xperf -stop -stop workflow -d c:\temp\workflow.etl

Finally you want to watch your traces WPA. Navigate to Generic Events and open the "Microsoft-Windows-Application Server-Applications" event provider:

There you see directly how your activities were scheduled and executed along with their activity arguments as long as they have a string representation. Since we have ETW we can add our own traces to ETW and the system wide profiling events as well to get the full picture in one file with a good analysis tool. Besides being central you have also chosen one of the fasted possible ways to write tracing information to an output device. The slowest thing is the DefaultTraceListener in your App.config which writes them via OutputDebugString. These events show up in VS in the output window which is nice but the output speed of OutputDebugString is not great. If you want to get maximum performance you need to write your own NullTraceListener which simply discards all traced data to enable writing only to ETW.

↧

What Every Developer Must Know About Fast Garbage Collection (+ more)

March 24, 2014, 4:24 pm

≫ Next: SIMD Support for .NET Arrived!

≪ Previous: Enable Activity Tracing For Workflow Foundations

Originally posted on: http://geekswithblogs.net/akraus1/archive/2014/03/24/155766.aspx

.NET is like Java a very productive environment. The increased convenience comes at the cost of less control over your execution environment. Especially the burden of manual memory management is much less of a problem in managed environments. Garbage collection does not prevent memory leaks (e.g. delete pData) but makes it considerable harder to introduce ones (statics, timers, forgotten event unregistrations and others still remain). A Garbage collector is a very nice thing but it can get into your way if you want to provide a responsive application. When your customers complain that the UI is frozen or that the UI redraw causes flickering all over it you need to take a deeper look under the .NET covers.

How does Garbage Collection affect application performance?

As the name collection suggests it needs to to something. You can expect increased CPU usage while a garbage collection is running in the background. .NET 4.5 has made significant progress to perform more tasks during a garbage collection on dedicated garbage collection threads. CPU usage is normally not a problem as it does not eat up all of your cores. The prime reason why people complain about garbage collection is that their application is stopped while a full garbage collection is running. This behavior is called stop world collector where the garbage collector stops all application threads, looks at all objects, finds the garbage and compacts memory to prevent memory fragmentation. For the last step it is very important that the application does not run since the address of the objects can change. It would be a disaster if your application would access a memory location that contains data of completely unrelated objects because the address of your object has changed but the garbage collector did forget to update your pointer while your thread was frozen. This process includes not only addresses on the managed heap but also in the thread stacks and CPU registers. Every object address your application has access to needs to be updated by the GC somehow. The first question you might ask: Is there a way around that the GC must stop my threads to be able to update object addresses? The answer is yes there are garbage collectors out there that do not stop all application threads like the Azul C4. It can collect heaps of hundreds of GB in size with maximal delays of less than 10ms. One secret ingredient to make this possible seems to be a kernel driver that locks portions of memory which are compacted to prevent writes to it. I am not sure how they handle reads but the docs talk about Loaded Value Barriers to ensure that object references are always valid. This looks like some overhead for every object reference access which can be substantial. Besides this it has given up the concept of generations for a fully asynchronous garbage collector.

Without having measured it I would conclude from the docs that you pay for low latency with slower object reference access and slower memory reclamation which increases the working set of your application. Every solution to manual memory management comes with its own set of drawbacks:

Manual Memory Management (e.g. C/C++)
- High Complexity
- Low CPU Consumption
- Low Memory Consumption
- Low Latency

Garbage Collected Environment (e.g. Java, .NET)
- Lower complexity
- Increased CPU Consumption
- Increased Memory Consumption
- Potentially High Latency

Is this conclusion correct? Only under the assumption that you need to allocate memory. What is the most efficient way to allocate memory? No allocation at all. The secret to real time performance in any environment is to make the hot path of your application allocation free. When you need hard real time (e.g. avionics) not only allocations are banned but also exceptions because you cannot predict how long exception handling will take. Very few systems need to cope with hard real time. For the 99% part of all software you can live with small millisecond pauses even if it means that your real time trading system will issue a million dollar order after your competitors. You can do this in C/C++ but there are also ways to become (mostly) allocation free in .NET.

In .NET and Java object pooling is an easy way to prevent future object allocations. This works to some extent but you cannot pool data types like strings which are immutable and can therefore not be changed later. Perhaps we will get some day mutable strings and UTF-8 strings to keep the memory footprint low and the throughput high. Besides becoming allocation free you can also try to allocate less objects (profile your app) or more value types to reduce the memory footprint which results in smaller heaps which can be collected faster. In .NET are also different GC flavors available (server and desktop). The default is the desktop GC. Starting with .NET 4.5 the both GC flavors are concurrent and employ a background GC thread that scans for garbage while your application threads are still running which significantly reduces latency already.

I have seen an improvement in run time behavior of larger apps of ca. 15% only by switching from .NET 4.0 to .NET 4.5. A good portion of it can be attributed to the improved GC. Lets check how the GC behaves on my Windows 8.1 box with .NET 4.5.1 with a simple sample application. Here is the code

class Program
{staticint N = 25 * 1000 * 1000;  // Execute allocations in a loop 25 million timesstaticint Clear = 1000 * 1000;   // Clear list after every 1 million allocations to give GC 25 chances to clear things upstaticvoid Main(string[] args)
    {// do some warmup
        AllocateRefContainer();
        GC.Collect();
        GC.Collect();

        var sw = Stopwatch.StartNew();
        ExecuteParallel(() => AllocateRefContainer(), "AllocateRefContainer");
        sw.Stop();
        Console.WriteLine("RefContainer Allocation {0:F2}s, {1:N0} Allocs/s", sw.Elapsed.TotalSeconds, N / sw.Elapsed.TotalSeconds);

        GC.Collect();
        GC.Collect();
        sw = Stopwatch.StartNew();
        ExecuteParallel(() => AllocateValueContainer(), "AllocateValueContainer");
        sw.Stop();
        Console.WriteLine("ValueContainer Allocation {0:F2}s, {1:N0} Allocs/s", sw.Elapsed.TotalSeconds, N / sw.Elapsed.TotalSeconds);

        GC.Collect();
        GC.Collect();
        sw = Stopwatch.StartNew();
        ExecuteParallel(() => AllocateValueTypes(), "AllocateValueTypes");
        sw.Stop();
        Console.WriteLine("ValueType Allocation {0:F2}s, {1:N0} Allocs/s", sw.Elapsed.TotalSeconds, N / sw.Elapsed.TotalSeconds);

        GC.Collect();
        GC.Collect();
        sw = Stopwatch.StartNew();
        ExecuteParallel(() => PooledRefContainer(), "PooledRefContainer");
        sw.Stop();
        Console.WriteLine("PooledRefContainer Allocation {0:F2}s, {1:N0} Allocs/s", sw.Elapsed.TotalSeconds, N / sw.Elapsed.TotalSeconds);
        Console.WriteLine("Private Bytes: {0:N0} MB", Process.GetCurrentProcess().PrivateMemorySize64 / (1024 * 1024));
        Console.ReadLine();
    }class ReferenceContainer // Class with one object reference
    {public ReferenceContainer Obj;
    }static ThreadLocal<List<ReferenceContainer>> RContainer = new ThreadLocal<List<ReferenceContainer>>(() => new List<ReferenceContainer>());staticvoid AllocateRefContainer()
    {
        var container = RContainer.Value;for(int i=0; i< N;i++)
        {
            container.Add(new ReferenceContainer());
            Calculate();if (i % Clear == 0)
            {
                container.Clear();
            }
        }
    }class ValueContainer // Class with one pointer sized value type
    {
        IntPtr Obj;
    }static ThreadLocal<List<ValueContainer>> VContainer = new ThreadLocal<List<ValueContainer>>(() => new List<ValueContainer>());staticvoid AllocateValueContainer()
    {
        var container = VContainer.Value;for (int i = 0; i < N; i++)
        {
            container.Add(new ValueContainer());
            Calculate();if (i % Clear == 0)
            {
                container.Clear();
            }
        }
    }// The object overhead is 8 or 16 bytes on x86 x64// We add the payload to it to stay close to be equal in size to the others to get a true comparison when // the actual size which needs to be moved is equal.struct ValueType
    {public IntPtr SyncBlock;public IntPtr MT;public IntPtr Payload;
    }static ThreadLocal<List<ValueType>> ValueTypeContainer = new ThreadLocal<List<ValueType>>(() => new List<ValueType>());staticvoid AllocateValueTypes()
    {
        var container = ValueTypeContainer.Value;for (int i = 0; i < N; i++)
        {
            container.Add(new ValueType());
            Calculate();if (i % Clear == 0)
            {
                container.Clear();
            }
        }
    }// Uses Reference Container but allocates the objects only once and does not clear the array but// overwrites the contents staticvoid PooledRefContainer()  
    {bool bAllocate = true;
        var container = RContainer.Value;for (int i = 0; i < N; i++)
        {if (bAllocate)
            {
                container.Add(new ReferenceContainer());
            }else
            {
                var tmp = container[i % container.Count]; // Grab on item from array inside the array bounds
                tmp.Obj = null; // set new value 
            }

            Calculate();
            if (i % Clear == 0)
            {
                bAllocate = false;
            }
        }
    }// Simulate some CPU only calculationstaticvoid Calculate()
    {long lret = 0;for (int i = 0; i <  100; i++)
        {
            lret++;
        }
    }// Execute a scenario on 4 different threads// The strange Directory.GetFiles calls are there track to track the start and stop of one scenario when File access ETW tracing is enabled staticvoid ExecuteParallel(Action acc, string scenario)
    {
        Directory.GetFiles("C:\\",String.Format("ETW ExecuteParallel Start {0}", scenario));
        Parallel.Invoke(Enumerable.Repeat(acc, 4).ToArray());
        Directory.GetFiles("C:\\", String.Format("ETW ExecuteParallel Stop {0}", scenario));
    }
}

The tests were performed on this machine:

We do basically allocate 100 million objects on 4 threads concurrently with various object designs.

Class with one object reference
Class with one pointer sized value
Struct of the same size like the class
Pooled class with one object reference

To enable the server GC you need to add to your app.config

<?xmlversion="1.0"encoding="utf-8" ?><configuration><runtime><gcServerenabled="true"/></runtime></configuration>

GC Behaviors (Server, Desktop in x86/x64)

When you measure the timings you get some interesting graphs.

The first obvious difference is that x64 is faster than x86. For value types up to a factor 2. In the profiler we can see that the main differences comes from the relocate phase where 64 bit achieves much higher throughput. The numbers on the x axis are the number of samples taken by the kernel.

A net win of 30% only by switching from x86 to x64 is nice. But it comes at the cost of higher memory footprint (factor 2). When you look closer you see that the effect of server GC is much higher than 30%. The same method did run nearly 6 times faster (e.g. check out RefContainer x64 Default GC vs x64 Server GC)! As with every miracle there are also some hidden costs associated with that. In this case it is memory consumption. When server GC is enabled the CLR allocates a managed heap for each logical core. If you have a 40 core machine with many processes running things concurrently you quickly will run out of memory if you are not careful in your application design.

The main difference between workstation and server GC are

One GC Thread per core
One GC Heap per core

The allocations which did happen previously on one heap are now happening on 4 of them. Each heap is compacted independently by its own GC thread. In theory we should have 4 times less GCs since each heap could become as big as the original one and since we now have 4 GC threads we should be able to allocate 16 times faster. This is not the reality since the memory consumption has some limits which seems to be with this workload a factor 3. Besides this the GC must also cross check other heaps if the objects have dependencies between different heaps. If no GC Heap dependencies would be checked we should be 12 times faster but we are "only" 6 times faster which shows that the heap cross checks are not free but necessary.

The test suite did consume nearly three times more memory while running nearly 6 times faster. Enabling server GC can be a good thing to do if you have plenty of free memory and you do not allocate huge amounts of data on many different threads.

When you look at the CPU consumption with default GC enabled in x64 Release

and here the same scenario with server GC

The GC suspension times have dropped dramatically and at the same time the summed CPU consumption has dropped by 30%. Except for the memory consumption this is a big win in terms of GC induced pause times and overall application performance. But you need to be aware that your now very scalable app will spike with 100% CPU from time to time perform full GCs which can become a problem if you have other work waiting to be done which also needs all cores.

A very useful visualization is to select the GC Suspension events and then right click in the graph to enable highlighting of the current selection (the green bars). Now you can search in the other graphs for e.g. no CPU activity to find out why your application was doing nothing without ever looking at GC induced latency issues. This helps a lot to differentiate between GC issues and other problems in your application logic.

I have beefed up my original GC regions file with GC suspension visualization which now looks like this:

GC_SuspendRegions.xml

<?xml version='1.0' encoding='utf-8' standalone='yes'?><?Copyright (c) Microsoft Corporation. All rights reserved.?><InstrumentationManifest><Instrumentation><Regions><RegionRoot Guid="{d8d639a0-cf4c-45fb-976a-0000DEADBEEF}" Name="GC" FriendlyName="GC Activity"><Region Guid="{d8d639a1-cf4c-45fb-976a-100000000101}" Name="GCs" FriendlyName="GCs"><Region Guid="{d8d639a1-cf4c-45fb-976a-000000000001}" Name="GCStart" FriendlyName="GC Start"><Start><Event Provider="{e13c0d23-ccbc-4e12-931b-d9cc2eee27e4}" Id="1" Version="1"/></Start><Stop><Event Provider="{e13c0d23-ccbc-4e12-931b-d9cc2eee27e4}" Id="2" Version="1"/></Stop></Region></Region><Region Guid="{d8d639a2-cf4c-45fb-976a-000000000003}" Name="GCSuspends" FriendlyName="GC Suspensions"><Region Guid="{d8d639a2-cf4c-45fb-976a-000000000002}" Name="GCSuspend" FriendlyName="GC Suspension"><Start><Event Provider="{e13c0d23-ccbc-4e12-931b-d9cc2eee27e4}" Id="9" Version="1"/></Start><Stop><Event Provider="{e13c0d23-ccbc-4e12-931b-d9cc2eee27e4}" Id="3" Version="1"/></Stop></Region></Region></RegionRoot></Regions></Instrumentation></InstrumentationManifest>

To be able to record the GC events with the easy to use WPRUI tool I have created an extra profile

GC.wprp

<?xml version="1.0" encoding="utf-8"?><WindowsPerformanceRecorder Version="1.0"><Profiles><EventCollector Id="EventCollector_MyEventSource" Name="GC"><BufferSize Value="2048" /><Buffers Value="1024" /></EventCollector><EventProvider Id=".NETCommonLanguageRuntimeGCTracking" Name="e13c0d23-ccbc-4e12-931b-d9cc2eee27e4"><Keywords><!-- GC 0x1 and 0x8000 Exceptions --><Keyword Value="8001"/></Keywords></EventProvider><Profile Id="GC.Verbose.File" Name="GC" Description=".NET GC Tracking"  Base="" LoggingMode="File" DetailLevel="Verbose"><Collectors Operation="Add"><SystemCollectorId Value="SystemCollector_WPRSystemCollectorInFile"><SystemProviderId Value="SystemProvider_Base"/></SystemCollectorId><EventCollectorId Value="EventCollector_MyEventSource"><EventProviders><EventProviderId Value=".NETCommonLanguageRuntimeGCTracking" /><EventProviderId Value="EventProvider_DotNetProvider" /></EventProviders></EventCollectorId></Collectors></Profile><Profile Id="GC.Verbose.Memory" Name="GC" Description=".NET GC Tracking" Base="GC.Verbose.File" LoggingMode="Memory" DetailLevel="Verbose" />    </Profiles></WindowsPerformanceRecorder>

When you load this file into WPRUI you enable the necessary GC ETW events to track GC pause times with the region file above.

Armed with these tools and visualization capabilities it should be easy to see in your own application if enabling server GC does help at critical points of your application. You will not measure any difference if your scenario is IO or network bound and very few GCs are happening. But if your scenario is a GC heavy operation you can check out if these spots can be improved by simply switching a configuration flag in your App.config or if you need to rewrite your allocation strategy.

Did I forget anything? Yes. How do you navigate in an ETL file when your application has no ETW provider enabled? You only need to know that the kernel has already ETW provider for all file IO operations. If you want to issue a ETW event at the start/stop of a scenario you are interested in it is sufficient to do a

Directory.GetFiles("C:\\",String.Format("ETW ExecuteParallel Start {0}", scenario));

to get with File IO tracing nice ETW events which you can search for. Searching for a not existing file is a quite fast operation which takes about 20us per call on my machine. The ETW collection overhead was within the usual differences that occur from measurement to measurement and not a visible effect at all. This is of course not always true. If you have a network intensive app running and you use PerfView you get also network correlation tracing enabled by default which adds quite some overhead to all socket operations. In the CPU profiling data networking operations look much more expensive than they really are. You need always to take a sharp look at your data if it looks reasonable.

When you are doing sampling profiling you should have a look at the article from Vance to be sure to have enough samples in your data to be sure that you are seeing a real effect and you are not chasing some ghost. I hope that I have convinced you that the CLR emits so much more useful data you were never be able to collect from simple performance counters which can give you only very limited insights.

↧

SIMD Support for .NET Arrived!

April 4, 2014, 4:46 pm

≫ Next: The Case Of Slow Visual Studio Startup

≪ Previous: What Every Developer Must Know About Fast Garbage Collection (+ more)

Originally posted on: http://geekswithblogs.net/akraus1/archive/2014/04/04/155858.aspx

Finally after long years of stagnation the JIT compiler got significant attention to support SIMD instructions. As an experienced .NET Developer your might think SIMD? What? Never heard of! Don´t worry: You will hear the same from most experienced C/C++ developers. SIMD is more a geek thing for performance addicts which stands for Single Instruction Multiple Data.

SIMD support means that the compiler (Intel, MSVC++ and now .NET JIT) can emit assembly instructions that take advantage of CPU features which can execute one operation (e.g. add) on multiple inputs at once. Why should I care? My Core I7 has 4 physical cores (8 with hyper threading). Now lets add to the mandelbrot sample SIMD support.

Multithreading stays important but when you can get with a single core a speedup of 3,4 simply by employing SIMD. That is something you cannot ignore. If your server today needs 40 cores and you can cut the number of cores down by a factor 3,4. You can spend the saved money on other things than hardware (e.g. paying developers to program SIMD enabled code Zwinkerndes Smiley ). This is all nice and well but how fast can it get? When I look at the generated assembler code

d:\Source\SIMDSample\C#\Mandelbrot\VectorFloatStrict.cs @ 162:
00007ffd`e99d2135 450f58f4        addps   xmm14,xmm12
00007ffd`e99d2139 4183461004      add     dword ptr [r14+10h],4
00007ffd`e99d213e 410f28c6        movaps xmm0,xmm14
00007ffd`e99d2142 410f28c8        movaps xmm1,xmm8
00007ffd`e99d2146 660fefd2        pxor    xmm2,xmm2
00007ffd`e99d214a 0f28d0          movaps xmm2,xmm0
00007ffd`e99d214d 0fc2d102        cmpleps xmm2,xmm1
00007ffd`e99d2151 660f76c0        pcmpeqd xmm0,xmm0
00007ffd`e99d2155 0f28ca          movaps xmm1,xmm2
00007ffd`e99d2158 660f76c8        pcmpeqd xmm1,xmm0
00007ffd`e99d215c 660f70d94e      pshufd xmm3,xmm1,4Eh

I see these xmm registers. These are SSE 128 bit wide registers which are the foundations for the current SIMD support of the JIT compiler. But this is still not the end. My Haswell CPU supports AVX2 which can cope with 256 bit wide registers which should give me a speedup of 8. AVX-512 will arrive in some time giving us 512 bit wide registers. We can achieve then speedup a single thread up to factor of 16! In theory we should be able to get this graph with an AVX-512 enabled CPU and a perfect .NET JIT compiler

For some problem domains it can happen that SIMD support may be more important than using more threads. You can cope with heavy load on a single core without the need to use many threads. That can help servers which are shared by many users a lot. Here are the facts how to get started with SIMD in .NET:

.NET uses SIMD automatically for my application?

No. You need to use a specialized SIMD library which offers SIMD data types which can be translated by the JIT compiler to SIMD instructions on CPUs supporting it.

Can I use the normal .NET Framework?

No. You need to download and install the new RyuJIT CTP3

Does it work on Debug Builds?

No. SIMD instructions are only used for Release builds with RyuJIT enabled (COMPLUS_AltJit=*) and SIMD instructions enabled (COMPLUS_FeatureSIMD=1) for this CTP on Windows 8 and above for x64 applications only.

How looks SIMD vs non SIMD code?

Here is the "normal" Modelbrot core method

C#\Mandelbrot\ScalarFloat.cs

// Render the fractal with no data type abstraction on a single thread with scalar floatspublicvoid RenderSingleThreadedNoADT(float xmin, float xmax, float ymin, float ymax, float step)
        {int yp = 0;for (float y = ymin; y < ymax && !Abort; y += step, yp++)
            {int xp = 0;for (float x = xmin; x < xmax; x += step, xp++)
                {float accumx = x;float accumy = y;int iters = 0;float sqabs = 0f;do
                    {float naccumx = accumx * accumx - accumy * accumy;float naccumy = 2.0f * accumx * accumy;
                        accumx = naccumx + x;
                        accumy = naccumy + y;
                        iters++;
                        sqabs = accumx * accumx + accumy * accumy;
                    } while (sqabs < limit && iters < max_iters);
                    DrawPixel(xp, yp, iters);
                }
            }
        }

And here is the SIMD version. Not very different but the performance differs a lot. If you use double types the benefit will be smaller (factor 2) because double uses 8 bytes of memory whereas single precision (float) values use up only 4 bytes. You might ask what is this version with the strange name NoADT? It means No Abstract Data Types. The other version uses its own Complex data type which is a lot slower. This should be improved in future CTPs either by getting our BCL enabled Complex data type or by optimizing custom data types which contain SIMD types in general.

C#\Mandelbrot\VectorFloat.cs

// Render the fractal on a single thread using raw Vector<float> data types// For a well commented version, go see VectorFloatRenderer.RenderSingleThreadedWithADT in VectorFloat.cspublicvoid RenderSingleThreadedNoADT(float xmin, float xmax, float ymin, float ymax, float step)
        {
            Vector<float> vmax_iters = new Vector<float>(max_iters);
            Vector<float> vlimit = new Vector<float>(limit);
            Vector<float> vstep = new Vector<float>(step);
            Vector<float> vxmax = new Vector<float>(xmax);
            Vector<float> vinc = new Vector<float>((float)Vector<float>.Length * step);
            Vector<float> vxmin = VectorHelper.Create(i => xmin + step * i);float y = ymin;int yp = 0;for (Vector<float> vy = new Vector<float>(ymin); y <= ymax && !Abort; vy += vstep, y += step, yp++)
            {int xp = 0;for (Vector<float> vx = vxmin; Vector.LessThanOrEqualAll(vx, vxmax); vx += vinc, xp += Vector<int>.Length)
                {
                    Vector<float> accumx = vx;
                    Vector<float> accumy = vy;

                    Vector<float> viters = Vector<float>.Zero;
                    Vector<float> increment = Vector<float>.One;do
                    {
                        Vector<float> naccumx = accumx * accumx - accumy * accumy;
                        Vector<float> naccumy = accumx * accumy + accumx * accumy;
                        accumx = naccumx + vx;
                        accumy = naccumy + vy;
                        viters += increment;
                        Vector<float> sqabs = accumx * accumx + accumy * accumy;
                        Vector<float> vCond = Vector.LessThanOrEqual<float>(sqabs, vlimit) &
                            Vector.LessThanOrEqual<float>(viters, vmax_iters);
                        increment = increment & vCond;
                    } while (increment != Vector<float>.Zero);

                    viters.ForEach((iter, elemNum) => DrawPixel(xp + elemNum, yp, (int)iter));
                }
            }
        }

This is a long awaited step into the right direction of the JIT compiler. Finally you can write performance sensitive code without much hassle in .NET as well. Still the JIT compiler could become smarter for many common code constructs to emit more efficient code.

↧

The Case Of Slow Visual Studio Startup

April 30, 2014, 4:29 pm

≫ Next: Fun With Process/Thread and IO Priorities

≪ Previous: SIMD Support for .NET Arrived!

Originally posted on: http://geekswithblogs.net/akraus1/archive/2014/04/30/156156.aspx

On my home machine VS2013 starts up in 2s from my SSD. But on another Windows 7 machine it takes over 13s. I never bothered really since I thought that some plugins and connecting to TFS simply takes its time. It stayed and itch until I installed Resharper 8 were things noticeably got slower. No big deal. Suspend Resharper and it will play dead. It works except that the startup time of VS still looked bad. Time for another profiling session with WPT. The approach is as simple as starting up WPRUI

Press Start.
Start Visual Studio.
Wait until it is has started.
Press "Save" button to stop profiling.
Enter a file name to which the data should be saved
Be patient while data is saved.
Press "Open in WPA" button to analyze the data.

When I look at the CPU consumption until the VS 2013 is usable it takes 13,1s to start up while consuming 9,6s of CPU time. The big gap in the middle looks like it needs to connect to the network. Lets ignore that for now.

To see what the CPU was doing I need sample profiling (CPU Usage Sampled). Many people are confused by the fact that WPT shows you two CPU usage graphs with nearly the same content. The graphs (should) look the same but the underlying data to get the graphs is fundamentally different.

CPU Usage Sampled

This graph comes to existence by parsing CPU sampling ETW events. When sample profiling is enabled the Windows Kernel will log the call stacks of all running threads every 1ms. The sample rate is configurable from seconds between a sample and up to 8kHz (every 0,125ms see xperf -SetProfInt for more infos). Sample profiling gives you an accurate view what your application was doing with the CPU. When you look into the call stacks you should have at least 10K of CPU sample events in your view selected. Then you can check the call stacks in your application down to stacks with at least 100 Samples which represents an error of 1% which is usually enough. If you want to optimize a method which shows up in the sample profiling with only 10 samples you are chasing ghosts.

CPU Usage Precise

Although the graphs look identical you have a lot more columns (54!) to choose from. This time you are not looking at sampled call stacks but at a parsed view of Context Switch ETW events. A context switch occurs when your thread stops or starts running on a CPU. A thread is switched in (NewThreadxxxx columns) and another thread stopped running (Oldxxxxxx columns) while a third thread has enabled your thread to run by releasing a lock, signaling an event, timer expiration and a few other reasons. The thread that allowed your thread to run is called Readying Thread/Process/Stack columns. This very fine granular view allows you to find out e.g. why your thread was sitting in ReadFile for 5s to read a 1KB file.

A more thorough explanation of context switch handling can be found in the WPA documentation for CPU Analysis. To make sense of the many columns you should memorize this picture which shows all important times of a context switch. The picture below is shamelessly copied from the WPA documentation.

Figure 10 CPU Usage Precise Diagram

The most important thing to remember is that with CPU Usage Precise you can find out

How long a thread was waiting until it was signaled and who signaled it.
If a thread was ready to run but it had to wait until other threads have ran on this CPU (CPU oversubscription).
How long a thread was ready to run but it could not run.
- Mostly because too many threads were burning CPU cycles
- It has a too low priority.
- …
Why a thread did not run due to CPU low power states were the CPU needed time to wake up.

Now back to the original question. We are after a CPU consumption issue. What do we see in CPU Usage (Precise)? Only context switches. A context switch occurs when your thread stops running because it is waiting for something (WaitForSingleObject) or the thread quantum was used up. It is 15ms by default but it may be more. On my Win 8.1 x64 box it is actually 32ms. The reason why a context switch event is logged that your thread is waiting (not CPU bound) for something or your thread has already executed code for 15-32ms and has used up its allotted time slice. In effect you will see in the stacks of CPU Usage Precise all of your calls that did block for some reason but you will almost never see a high CPU consuming operation.

How do I know that? Because I have tested it with a small CPU consuming application:

class Program
{
    [DllImport("kernel32.dll")]externstaticint GetCurrentThreadId();staticvoid Main(string[] args)
    {
        Action acc = () =>
        {
            Process proc = Process.GetCurrentProcess();
            var  currentThread = proc.Threads.Cast<ProcessThread>().Where(t => t.Id == GetCurrentThreadId()).Single();
            currentThread.ProcessorAffinity = (IntPtr)0x02; // Force thread affinity on core 2while (true)
            {
            }
        };

        Parallel.Invoke(Enumerable.Repeat(acc, 2).ToArray());  // Execute two concurrent threads and force execution on one CPU
    }
}

This little app does nothing but spawning two threads which run on the same CPU to give the thread scheduler some work. Before that it was a simple while(true); loop in the Main method which did consume CPU but after it was running it never caused and context switches for seconds. The first observation was that a context switch is only performed if another thread must run on this CPU. If not there is no reason to bring it to a halt it can run happily forever. Internally the thread dispatcher still checks if the thread quantum did expire but since no other thread is scheduled for this CPU its quantum is renewed without a context switch. It turns out that you need to cause actual contention on a CPU to make the thread dispatcher do some actual work. The very busy while loops show up in the profiler like this:

There we see that on my Win 8.1 machine when two threads compete for the same CPU

Every thread is run again after 32ms (TimeSinceLast).
It runs for 32ms (NewInSwitchTime).
While it did wait for the CPU for 32ms (Ready).
It did spend no time in some WaitForSingleObject synchronization primitive (Wait).
The previous thread is put to sleep because it has used up its thread quantum (OldWaitReason with value WrQuantumEnd).
During the selected time of 1,62s 54 context switch events did happen which is nearly optimal 1,62/0,032=50 context switches with 32ms time slices (Count).
The CPU was maxed out. 12,5% is one core on my 8 core machine (% CPU Usage).

With context switch events you can find out pretty much anything you ever wanted to know why your thread was not running. Thread Scheduling is explained in some free EBooks by a Windows Kernel Developer named David Probert. He was written quite a lot about the internal mechanics of a context switch. If you want to know more about the possible values of OldWaitReason you will find some kernel defines but no explanation. The best explanation for some values I have found in the .NET documentation for the ThreadWaitReason enum. In reality you will find for OldWaitReason almost always the value Executive which means that there thread was waiting inside a WaitForSingleObject and has given up its current quantum.

Below is a picture of a normal Visual Studio startup which prints the time gaps in us of the main thread when it had to wait wait for something. With wait I mean waiting for the CPU to become ready to run the thread (high Ready times) or it was waiting on something to complete or simply the next window message. As you can see we have a pretty busy message pump which has 3402 context switches where the total time gap was 450ms while no code did run on the main UI thread. Since there are so many of these small gaps it makes not much sense to search for useful work on the UI thread there. Instead you should check out single event where e.g. for 145ms nothing had happened. Perhaps there is room for improvement there. I did not see TimeSinceLast used often in in performance analysis sessions so far but it is definitely useful to check out where the largest execution gaps of a supposed to be busy thread are lurking because it is basically the sum of wait times and ready times which catches both thread scheduling issues and wait times for the completion of async operations.

After showing off so many details of CPU Usage (Precise) I need to come back (again) to our CPU consumption issue. With CPU Usage Precise we would see at most one call stack every 15ms-30ms because the thread was switched out because the thread quantum expired. This is usually a lousy sample rate which is polluted by the wait call stacks. In theory it should be possible to filter for WrQuantumEnd switch out reasons to get the same data over a longer period of time like with sampling. In practice this does not work since the the VS Studio startup example not a single thread ever is switched out because its quantum has expired. The table below shows a typical thread switch out distribution. Executive means that the thread did enter a WaitForSingleObject which is nearly the only reason why a thread is switched out.

OldWaitReason	Count
Executive	4377
WrQueue	100
WrDispatchInt	26
UserRequest	10
WrYieldExecution	8
WrLpcReceive	8
WrUserRequest	2
WrLpcReply	1
WrResource	1

After all of the cool context switch stuff we must downgrade to the second CPU view named CPU Usage (Sampled) since this is still the only view where we can reliable track down the call stacks which consume most CPU.

I ignore threads for now to view an accumulated call stack of CPU usage across all threads in Visual Studio. There it is immediately obvious that Resharper is using a lot of CPU while VS is starting up although it is suspended. I am not sure why in version 8 of their product they still think that startup time is not important and they seem to fully initialize their stuff on the main UI thread. To get a more complete picture I can select an expensive method and group it by function to see who is calling it and how much CPU is consumed down the call chain.

Interesting: Resharper has its own brand of MEF embedded which does the initialization. They also seem to use concepts of TPL DataFlow to trigger their components from multiple threads. That certainly looks sophisticated but the CPU graph shows that very little multi threading is happening during startup. On my 8 core machine very little action goes above single core usage. I do not know why they need to initialize so much of their stuff on the UI thread only to add some menus to Visual Studio while being suspended.

Looks like we have a suspect. In case of doubt lets uninstall Resharper 8 and check again.

The CPU consumption has dropped from 9,6s down to 2,8s. Startup time has dropped to 8,2s.

Time in s	With Resharper 8	Without Resharper	Difference
CPU Consumption	9,6	2,8	6,8
Startup Time	13,1	8,2	4,9

We have huge savings in terms of CPU consumption here which almost 1:1 translates to reduced startup time. The Dataflow stuff of Resharper is able during the startup of VS to move 14% of the CPU work onto background threads which helps a little bit but still the amount of consumed CPU is enormous for a disabled plugin. To be fair on my Windows 8.1 box with an SSD the startup time of VS 2013 with Resharper is only 3,1s vs 1,9s without it. On good dev machines this is not a big deal but on slower machines a disabled Resharper is quite noticeable. That's it for the CPU intensive stuff. Now why lets check out the flat lines in CPU consumption.

What is causing such long delays during the startup of VS? It looks like the main thread is blocked for a long time. After reading so far your natural reaction must be: Open CPU Usage Precise and check out the Wait and Ready times! In this case the UI thread was blocked by something not very complex:

While the visual studio is starting up it tries to initialize Direct3D as part of a WPF application. Then Nvidia kicks in (nvd3dum.dll) which calls into some nvSCPAPI.dll which seems to poll for something over 5,4s. I have no idea why this is happening but this feels very wrong. The machine in question has some NVidia Quadro card installed and only NVidia knows why on some machines (not all machines exhibit this behavior) support for stereoscopic gaming is tried to enable. Since NVidia does not release pdbs to their drivers I cannot dig any deeper here but this is a major issue which hits all WPF enabled applications running on NVidia boards under still unknown circumstances. To prove that that the nvSCPAPI.dll was to blame I simply renamed the folder where the dll is located

C:\Program Files (x86)\NVIDIA Corporation\3D Vision

to something different. Now the startup time is down to 2,6s which feels really good now. I kept the renamed folder since I do not have the need for stereoscopic gaming.

It is nice to see that WPT can help in these simple cases to find the root issues within minutes. I hope that the Jetbrains can make their good product still better by reducing the startup time more when it is suspended and that NVidia fix their drivers to prevent searching in vain for not installed hardware on some machines. When Nvidia fixes this issue it could help to convince some people that it is not WPF and .NET which is slow but that after their graphics card driver update some apps start much faster than before.

↧

Fun With Process/Thread and IO Priorities

June 30, 2014, 3:34 pm

≫ Next: Thread Contention Effects

≪ Previous: The Case Of Slow Visual Studio Startup

Originally posted on: http://geekswithblogs.net/akraus1/archive/2014/06/30/157203.aspx

When you want to schedule some not so important work you can have the option to lower the thread priority of a thread or you can change the base priority of the whole process. That can help to keep Windows responsive while CPU intensive tasks (e.g. some processes with Parallel.For instances) can use all cores if nothing else is going on. For that purpose you can change the process priority via SetPriorityClass or if you are using .NET you can use the Process.PriorityClass property to get and set it. Please note that these methods have nothing to do with classes and object oriented programming at all. The xxxx_Class values define the base priority for all threads in a process. This value can be changed at runtime as well via e.g. the task manager. The thread scheduler of Windows uses the base priority of your thread and "adds" your thread priority to your thread base priority which is then the effective thread priority. Thread priorities in Windows have can have values from 0-31 but the many APIs to influence the effective thread priorities make things more indirect. Below is a table that shows the effect of setting the process class priority (aka class) and the thread priority:

What is interesting that if you set a process to Below Normal it still can interfere with threads from a Normal process if it has as thread priority Highest set. A Normal thread in a Normal process gets as thread base priority the value 8 which is the same as a Below Normal process with a Highest thread priority set. Which thread in a managed application runs with Highest priority? Yes it is the finalizer. If you have much work in your finalizers in Below Normal processes then they compete with your Normal processes for CPU. This might catch you by surprise if you have thought that Below Normal processes do not interfere with Normal processes.

There is an even more severe way how threads of lower prio processes can compete with your foreground tasks. Windows has a nice service to keep your interactive application running by boosting its priority so your Windows which just receives the keyboard input is responsive. This is called priority boosting. Why should I care about priority boosts in a server application? Well there are actually three reasons why Windows can decide to boost your thread priority (copied from MSDN):

When a process that uses NORMAL_PRIORITY_CLASS is brought to the foreground, the scheduler boosts the priority class of the process associated with the foreground window, so that it is greater than or equal to the priority class of any background processes. The priority class returns to its original setting when the process is no longer in the foreground.
When a window receives input, such as timer messages, mouse messages, or keyboard input, the scheduler boosts the priority of the thread that owns the window.
When the wait conditions for a blocked thread are satisfied, the scheduler boosts the priority of the thread. For example, when a wait operation associated with disk or keyboard I/O finishes, the thread receives a priority boost.
You can disable the priority-boosting feature by calling the SetProcessPriorityBoost or SetThreadPriorityBoost function. To determine whether this feature has been disabled, call the GetProcessPriorityBoost or GetThreadPriorityBoost function.

I have marked the relevant sentence bold: If your server application has processes which are waiting for something their thread priority is also increased. The priority boost can add up to four (tested on my Win 8.1 x64 Box) to your current thread priority which can make your Lowest priority thread effectively a Highest priority thread!

Now we have all factors together to calculate the effective thread priority:

The effective thread priority is calculated by the process base priority and the current thread priority and an optional thread boost value. Although I have used the + operator in the formula it is not s simple addition of some values. Now you know how to read the thread priority and dynamic priority in process explorer. Base Priority is the thread priority when the process priority and the thread priority have been "added" without any priority boost.

Dynamic priority is then the effective boosted thread priority value the OS scheduler works with. Now we can check our knowledge with a my little GCAllocations tester from a previous post. As a normal process it should work with thread of priority 8 most of the time:

As it turns out this is not the case. Due to the many context switches (waits) the thread priorities were boosted to 9 and 10 mainly. This effect is also there if the console application is not a foreground window or has keyboard input. This is only due to the fact that the OS boosted your thread priorities because it thought it would be a good idea to give you some time to work after you were woken up. Can we prevent this somehow? Yes there is an API SetProcessPriorityBoost when you call it the second parameter as 1 then you disable the priority boost feature. Or you use

Process.GetCurrentProcess().PriorityBoostEnabled = false;

Now that looks more like what we would expect from a well behaved process. So should we turn off Priority Boosting and do we gain anything from it? To check this I have run GCAllocations

Single Instance
Two Instances which are bound to four cores where four threads allocate in parallel. One process has Normal prio the other BelowNormal.

One time with Priority Boosting enabled
Second time with Priority Boosting disabled.

I have measured the duration of each instance how long it did need to execute, how much CPU it used and the Ready Times. Below are the results:

What can we learn from these graphs? First if no contention on the Cores is happening we are fastest with 2,7s for the Single Instance which could run undisturbed on all four cores. When there is contention on the four cores (two instances are running at the same time) things become slower by ca. 20%. The normal priority process and the one with Priority Boost enabled show no difference in run time there. But the BelowNormal process with Priority Boosting enabled gains ca. 7% in run time which is sort of unexpected. The ready times are quite high with 6,5 and 6,3s for these processes but since 4 threads are running at the same time we have to divide the process ready time by the number of active thread to get an estimation of the increased run time. This is is shown in the Projected Duration graph where to the Single Instance Duration the ReadyTime/4 is added and graphed for each process. This approximation is not perfect either because it does neglect the CPU usage of each process which differs quite a lot. It is interesting that the process with BelowNormal had the lowest total CPU consumption of 5,8s. How can this be? I did check the Sample Profiler where the CPU consumption does differ. It turns out that with Priority Boosting enabled the page faults from the Windows memory manager are much less costly.

This is a nice bonus for boosting the thread priority. In this sample application priority boosting had no negative effect on the foreground thread but the background activity did become 7% faster. From these experiments I would say it is safe to leave CPU Priority boosting enabled. That's enough for thread and process priorities for now. There is another priority which was introduced in Windows Vista which is IO priority. There are no direct APIs to influence the IO priority except if you call SetThreadPriority:

enum ThreadPriority
    {
        THREAD_MODE_BACKGROUND_BEGIN = 0x00010000,
        THREAD_MODE_BACKGROUND_END = 0x00020000,
        THREAD_PRIORITY_ABOVE_NORMAL = 1,
        THREAD_PRIORITY_BELOW_NORMAL = -1,
        THREAD_PRIORITY_HIGHEST = 2,
        THREAD_PRIORITY_IDLE = -15,
        THREAD_PRIORITY_LOWEST = -2,
        THREAD_PRIORITY_NORMAL = 0,
        THREAD_PRIORITY_TIME_CRITICAL = 15
    }


    [DllImport("kernel32.dll", SetLastError = true)]externstaticint SetThreadPriority(IntPtr hThread, ThreadPriority nPriority);// Will set in a normal process the Thread Priority to 4 and the IO Priority to Very Low
     SetThreadPriority(GetCurrentThread(), ThreadPriority.THREAD_MODE_BACKGROUND_BEGIN);// Do Low Priority Work and do not forget to reset it when you want // normal scheduling
     SetThreadPriority(GetCurrentThread(), ThreadPriority.THREAD_MODE_BACKGROUND_END);

That can help to leave the hard disc spinning while you are indexing some data or running some low priority background task. There are actually 5 levels of IO priorities which can only be set for your thread by calling the undocumented function NtSetInformationProcess. You can set the IO priority from a normal process only to 0-2 but not higher than normal.

    [DllImport("kernel32.dll", SetLastError = true)]externstatic IntPtr GetCurrentProcess();

    [DllImport("ntdll.dll", SetLastError = true)]publicstaticexternint NtSetInformationProcess(IntPtr processHandle,
       PROCESS_INFORMATION_CLASS processInformationClass, ref IntPtr processInformation, uint processInformationLength);enum IOPriority : uint
    {
        VeryLow = 0,
        Low = 1,
        Normal = 2,
    }staticint SetIOPriority(IntPtr hProcess, IOPriority newPrio)
    {
        IntPtr ioPrio = (IntPtr) newPrio;int lret= NtSetInformationProcess(hProcess, PROCESS_INFORMATION_CLASS.ProcessIoPriority, ref ioPrio, 4);return lret;
    }

ProcessIOPriority is a constant with the value 21 if you want to try it out. There are not many use cases out there except if you want to penalize some processes which scan your hard disc for no real purpose except to make your IT department happy. In that case it could help to give such processes a lower IO and Process priority to teach them not to interfere with foreground processes.

If you design a server software which must stay responsive you must ensure that processes which require interactive performance are not held up by background jobs. It makes sense to ban these processes to lower process priorities to ensure that all cores can be utilized by your users if necessary. It is of no help if e.g. SQL Server uses all of your cores to index its database again because the interactive users processes will suffer from latency issues. Thread ready times can become arbitrary high if you are in a CPU oversubscription scenario. In that case all Normal processes should belong to interactive users. This way you get a scalable server which can support up to N concurrent users with varying background load. If latency issues should show up then the only valid reason is that too many users are logged onto your server. But it should never happen that background activities can disturb interactive tasks.

It is important to note that all priorities are used by the OS inside priority queues which are worked from the top down to the lowest priorities with some mechanisms to prevent starvation. But these priorities have absolutely no effect if you have e.g. no contention on a CPU. When only your task sits in a priority queue which is empty you will be served immediately. To see these priorities in action you need to load your system with >=100% CPU intensive tasks to watch how things are sorted out. SQL Server will not become slower if the system load never reaches 100% and you have lowered its CPU priority. But it can help to mitigate spikes of SQL server which can influence interactive tasks if the system is heavily loaded.

Todays software is often written in a scalable way by using all cores that are present. This is nice for a single process using all cores and no other activity. But if you have a system where more than one thing needs to run concurrently such "scalable" software is a recipe for disaster. You can try to run some batch scripts in parallel which use a multi core enabled compression algorithm and watch your system. It will not respond to user input at all anymore. Some time ago a colleague wanted to speed up a batch script by starting a tool 100 times concurrently. I told him before he pressed enter to start it that script could render the system unresponsive. After letting it run for 10 minutes and seeing no progress he pressed the power button. When he would have started the script with Idle process priority he might have had a chance to kill the offending processes.

If you start batch scripts parallel to your normal workload you should try out

start /LOW xxxx.cmd

to start it with Idle process priority. That can make a huge difference when you sit in front of the machine and you want to continue to work. You can even restrict it to run only on some cores with the /AFFINITY mask of 0xF to run your batch script only on the first four cores.

If you want to check out thread IO priorities you cannot count on Process Explorer. These internals are only displayed by Process Hacker which is a Process Explorer clone with more options and less bugs. The best feature of it is certainly that it supports Ctrl+C in all dialogs which is a big plus if you want to get data out the UI.

↧

Thread Contention Effects

September 14, 2014, 3:22 pm

≫ Next: Your Application The System And A Performance Bug

≪ Previous: Fun With Process/Thread and IO Priorities

Originally posted on: http://geekswithblogs.net/akraus1/archive/2014/09/14/159021.aspx

Recently I have seen an interesting performance question on SO which deals with Parallel.For in .NET. The code was basically

            Parallel.For(0,1000*1000*1000, 
            i => {});

The actual work is simulated by an empty delegate which is an easy measure the overhead of Parallel.For. When you let this run on some multicore machines in x86 Release you get number like

The running time is not constant as one might expect but sometimes there are slowdowns in the order of 6 times or more. With WPT is was quickly able to find as root cause the parallel for loop counter increment method:

clr.dll!COMInterlocked::ExchangeAdd64

It is used to increment the loop counter in a lock free way. With all cores in use we have massive memory contention on it but not all the time. By measuring the counts per thread I could see that sometimes one thread gets much less work than the others but there was no clear pattern which did relate to the slowdowns. Since we deal here with memory contention it should surface in L2/L3 cache misses because some other thread constantly manages it to overwrite the value while he is rarely able to update the counter.

The ExchangeAdd64 method uses basically the following approach:

staticvoid InterlockedAdd()
        {do
                {
                    old = Counter;
                    @new = Counter + Increment;
                }while( old != Interlocked.CompareExchange(ref Counter, @new, old));
        }

It adds some value to the existing counter and then tries to atomically exchange it. This seems to work in x64 all the time not if you let it run as 32 bit application.

The 64 bit code uses

lock xadd qword ptr [rdx+8],rax ds:xxxxx

whereas the x86 version uses

lock cmpxchg8b qword ptr [edi] ds:xxx

Both approaches use long (64 bit) values although the Parallel.For counter we did use is an integer counter only. This seems to cause funny side effects on the CPU cache lines. There exist no performance counters to measure cache miss rates which makes it hard to verify this claim. But Intel has released the free Performance Counter Monitor (pcm.exe) which offers much deeper insights what your processor is currently doing. To compile it and to get it working is not entirely straightforward since you need to compile a kernel driver and install it as test signed or you download a publicly signed version of it from the net. The Intel devs seems to be more Linux guys since the project files can be compiled with VS2013 but for every project the binaries end up in a sub folder of the project in the Debug folder (even for release). There is NO common bin folder were all binaries are compiled together. Anyway. The most interesting tool is pcm.exe which can show you per core statistics how many instruction have been executed and how the cache hit/miss counters are for the L2/L3 caches. The executed instructions per core show a sharp decrease which explains why the running time of the Parallel.For loop is sometimes soo slow. But still we do not see the reason for the drop in Execution performance.

But when I look at L3 Cache misses it becomes clear that one core seems to get into a death spiral where he consistently losses for several seconds to update this value which seems to distort the parallel execution of instructions because the data needs to be read from the far slower main memory many times while synchronizing the CPU caches.

I am not sure how to fix it since it should not happen at all and it will hopefully be fixed by a future CPU version. It makes no sense to emit extra code for such corner conditions. This example makes a nice demo what memory contention can do to your application performance. If you have no contention any lock free version will be fast. But if you have many concurrent accesses to the same memory locations you can get much longer running times while at the same time you waste enormous amounts of CPU cycles. Does this matter? Well yes it does if you use e.g. the lock keyword in .NET which internally first spins a bit in the hope that the lock will be released fast. If not it will take a real lock and stop wasting CPU cycles. If you have e.g. 5 threads updating some central variable and you use the lock keyword you will find many burned CPU cycles only to increment a counter. The Interlocked methods offer good alternatives to using locks if you have simple tasks. The interlocked methods are nice but they JIT compiler will not replace your code directly with the corresponding CPU code but you pay with an extra method call which can cost you in the worst cases a factor 10 in performance because you need to swap out quite a lot of data from registers onto the thread stack.

I really like the pcm tool from Intel because it gives me the ability to measure things. Previously I had to guess why this or that was slowing down my application. Now I can prove it. Here is some sample output of it so you can check if the tool might be useful to you as well.

EXEC : instructions per nominal CPU cycle
IPC : instructions per CPU cycle
FREQ : relation to nominal CPU frequency='unhalted clock ticks'/'invariant timer ticks' (includes Intel Turbo Boost)
AFREQ : relation to nominal CPU frequency while in active state (not in power-saving C state)='unhalted clock ticks'/'invariant timer ticks while in C0-state
L3MISS: L3 cache misses
L2MISS: L2 cache misses (including other core's L2 cache *hits*)
L3HIT : L3 cache hit ratio (0.00-1.00)
L2HIT : L2 cache hit ratio (0.00-1.00)
L3CLK : ratio of CPU cycles lost due to L3 cache misses (0.00-1.00), in some cases could be >1.0 due to a higher memory latency
L2CLK : ratio of CPU cycles lost due to missing L2 cache but still hitting L3 cache (0.00-1.00)
READ : bytes read from memory controller (in GBytes)
WRITE : bytes written to memory controller (in GBytes)
IO : bytes read/written due to IO requests to memory controller (in GBytes); this may be an over estimate due to same-cache-line partial requests
TEMP : Temperature reading in 1 degree Celsius relative to the TjMax temperature (thermal headroom): 0 corresponds to the max temperature

   0    0     0.08   1.05   0.08    0.31     355 K   1013 K    0.65    0.32    0.21    0.08     N/A     N/A     N/A     66
   1    0     0.00   0.20   0.01    0.31     111 K    226 K    0.51    0.17    0.53    0.13     N/A     N/A     N/A     66
   2    0     0.01   0.27   0.02    0.30     193 K    517 K    0.63    0.19    0.35    0.15     N/A     N/A     N/A     68
   3    0     0.08   1.00   0.08    0.53     706 K   1838 K    0.62    0.50    0.38    0.17     N/A     N/A     N/A     68
   4    0     0.01   0.38   0.03    0.32     247 K    667 K    0.63    0.22    0.32    0.13     N/A     N/A     N/A     66
   5    0     0.01   0.56   0.02    0.40     177 K    479 K    0.63    0.26    0.37    0.15     N/A     N/A     N/A     66
   6    0     0.01   0.35   0.03    0.32     235 K    500 K    0.53    0.24    0.37    0.10     N/A     N/A     N/A     70
   7    0     0.02   0.43   0.05    0.57     116 K    807 K    0.86    0.41    0.11    0.21     N/A     N/A     N/A     70
-----------------------------------------------------------------------------------------------------------------------------
SKT    0     0.03   0.70   0.04    0.38    2145 K   6051 K    0.65    0.37    0.29    0.14     N/A     N/A    0.00     63
-----------------------------------------------------------------------------------------------------------------------------
TOTAL *     0.03   0.70   0.04    0.38    2145 K   6051 K    0.65    0.37    0.29    0.14     N/A     N/A    0.00     N/A

Instructions retired: 917 M ; Active cycles: 1318 M ; Time (TSC): 3986 Mticks ; C0 (active,non-halted) core residency: 10.81 %

C1 core residency: 15.87 %; C3 core residency: 0.96 %; C6 core residency: 0.00 %; C7 core residency: 72.36 %;
C2 package residency: 0.00 %; C3 package residency: 0.00 %; C6 package residency: 0.00 %; C7 package residency: 0.00 %;

PHYSICAL CORE IPC : 1.39 => corresponds to 34.79 % utilization for cores in active state
Instructions per nominal CPU cycle: 0.06 => corresponds to 1.44 % core utilization over time interval

↧

Your Application The System And A Performance Bug

September 21, 2014, 9:25 am

≫ Next: Advanced Memory Dump Debugging

≪ Previous: Thread Contention Effects

Originally posted on: http://geekswithblogs.net/akraus1/archive/2014/09/21/159251.aspx

When you have a performance bug you usually look at your code why something takes longer than expected. In conjunction with a debugger this is a great approach for easy bugs. Things get much harder if the issue only surfaces sporadically at customer sites. Application logging which is always on helps you to see that there was a problem but since logging must not flood the hard disc you cannot log every method call with all parameters like IntelliTrace tries to do. The next level is application tracing which can generate enormous amounts of output is a good method to find production issues on customer sites. Now you see in your traces that something takes much longer than normally but you have no clue WHY it is taking so much longer. This is the time where system wide profiling with ETW (yes I keep coming back to it) is a nice way to drill deeper to solve otherwise impossible to solve bugs. Performance Counters are also a good approach to check if something strange is going on but then you still have (usually) no clue what the reason for the slowdown was. It is therefore the best to bring your application specific tracing and ETW in WPA together to find the slowdown in your application traces and then drill deeper in the other ETW profiling events.

As a rule of the thumb if you have carefully added tracing to your application you have also written the best trace viewer in the world to visualize your traces. No problem you can write an extractor tool to store your traces as text file in your own format which can be viewed by your good old trace viewer as usual (Tx is a nice library to extract events from an etl file). If you do want to stay within WPA you certainly want a better viewer than a plain list of your application traces. I use application traces to get an idea what the application was doing at a high level and then drill down to the details once I have found the slowdown in the traces. It would be cool to see in WPA your traces per thread with a duration timeline so I can see which methods take longest to execute. That allows a call stack like view where I can simply see in the graph which operations take most of the time.

The only issue with WPA is that in the graph there is no expand menu item (in the list view you can expand but there seems to be a limit how many items in the graph can be expanded) and there is no possibility to easily filter for regions e.g. with a method execution time >0,1s. But there are workarounds. You need to bring up the advanced graph settings dialog

and to force the graph to expand all items you can define in the Expansion tab a query that matches all regions.

Then you get in the UI graph a nice per thread expansion of all called methods. Once you have visually found an interesting hot spot it is easy to drill deeper in the CPU Usage Precise graph to find unexpected waits for virtually anything.

You can check it out for yourself when you write your traces to ETW like with this little ETW tracer which depends on the EventSource Nuget package.

[EventSource(Name = "ETWTracer", Guid="950FD22D-086C-4D16-98A0-FCC2B8DE2B07")]publicsealedclass Tracer : EventSource
{#region Singleton instancestaticpublic Tracer Log = new Tracer();#endregion

    [Event(1, Keywords = Keywords.InOut, Task = Tasks.InOut, Opcode = EventOpcode.Start)]
    publicvoid MethodEnter(string Method)
    { WriteEvent(1, Method); }

    [Event(2, Keywords = Keywords.InOut, Level = EventLevel.Verbose, Task = Tasks.InOut, Opcode=EventOpcode.Stop)]
    publicvoid MethodLeave(string Method, int DurationInus)
    { WriteEvent(2, Method, DurationInus); }

    [Event(3, Keywords = Keywords.Info, Task = Tasks.Info, Opcode = EventOpcode.Info)]
    publicvoid Info(string Method, string Message)
    { WriteEvent(3, Method, Message); }#region Keywords / Tasks / Opcodespublicclass Keywords   // This is a bitvector
    {publicconst EventKeywords InOut = (EventKeywords)0x0001;publicconst EventKeywords Info = (EventKeywords)0x0002;
    }publicclass Tasks
    {publicconst EventTask InOut = (EventTask)0x1;publicconst EventTask Info = (EventTask)0x2;
    }#endregion
}

With this little class we can define a method/enter leave tracer with just a few lines

class SimpleTracer : IDisposable
    {string Method;
        Stopwatch Watch = Stopwatch.StartNew();public SimpleTracer(string method)
        {
            Method = method;
            Tracer.Log.MethodEnter(method);
        }publicvoid Info(string message)
        {
            Tracer.Log.Info(Method, message);
        }publicvoid Dispose()
        {
            Watch.Stop();
            Tracer.Log.MethodLeave(Method, (int)Watch.Elapsed.TotalMilliseconds * 1000);
        }
    }

and use it right away

void Method1()
        {using(new SimpleTracer("Method1"))
            {
                Thread.Sleep(1000);using(new SimpleTracer("Method2"))
                {using(new SimpleTracer("Method3"))
                    {
                        Looper();using (new SimpleTracer("Method4"))
                        {
                        }
                    }
                }
            }
        }

With PerfView you can enable collection your brand new ETW event provider by specifying the Guid for it. I still prefer the Guid and register them in the normal way with EventRegister for my own ETW provider since only PerfView and WPA knows how do the correct rundown based on the provider name by calculating its Guid by convention when the ETW provider class name is used with a * before the name.

Now it is time to load the generated data and load our own region of interest file. Go to Trace - Trace Properties and add your region file to it.

Then you get a new graph in the System Activity diagram list that you can add to your current analysis pane as usual.

You might be interested how the regions file does look like. Here it is:

TraceRegions.xml

<?xmlversion='1.0'encoding='utf-8'standalone='yes'?><InstrumentationManifest><Instrumentation><Regions><RegionRootGuid="{65D8D484-90A9-4BA4-9247-51E7C159F000}"Name=""FriendlyName=""><!-- This guid is simply a randum number --><RegionGuid="{65D8D484-90A9-4BA4-9247-51E7C159F001}"Name=""FriendlyName=""><!-- This guid is simply a randum number --><Start><EventProvider="{950FD22D-086C-4D16-98A0-FCC2B8DE2B07}"Id="1"Version="0"/><!-- Method Enter event. The guid is your ETW Provider guid. Event id and version must match! --></Start><Stop><EventProvider="{950FD22D-086C-4D16-98A0-FCC2B8DE2B07}"Id="2"Version="0"/><!-- Method Leave event. The guid is your ETW Provider guid. Event id and version must match! --></Stop><Naming><PayloadBasedNameField="Method"/><!-- The regions get the method names. It is a pity that no concatenation of more than one field is supported. --></Naming><Match><ParentTID="true"/><!-- Parent event must be from the same thread otherwise strange durations and correlations would be displayed as region --><EventTID="true"><!-- The start stop events must originate from the same thread --><PayloadFieldName="Method"TargetFieldName="Method"/><!-- Region start/stop must come from the same method.                                                                                   You can use several payload rules here to match several conditions here. --></Event><!-- Nesting does not really work although it would be nice feature to get a call stack like view                          The main problem is that the region matching algo takes not the nearest possible event as parent but the first matching one which                          is almost always wrong if you call yourself in a loop the nesting is awkward.                         It would work if we would give each method enter leave a unique activity guid so we could nest based on matching activity guids                         but no one is writing traces out in this way.                     --><!--<SelfNest TID="true"></SelfNest>                     --></Match></Region></RegionRoot></Regions></Instrumentation></InstrumentationManifest>

As you can see in the comments there are still quite a few things to be desired to make regions even more user friendly but they are a nice way to get a visual "call stack" like view with durations as bars for one or all threads. The call stack view works only if you have no recursive or duplicate method names which would make the SelfNest feature work correctly. If more than one parent region is possible the first one in time and not the nearest on the current thread is chosen which is unfortunate to get a more call stack like view. Zooming into the graph works well for the x axis but there is no way to zoom out on they y axis to see where the time is spent when many threads at the same time are working together.

Although still many things are left to be desired the region based trace view is much better than a plain trace method enter/leave view as you get it for Generic events:

This can be a killer feature if you have complex application tracing in your application and you want to combine your application know how with the system response to the actions of your application.

↧

Advanced Memory Dump Debugging

September 29, 2014, 3:50 pm

≫ Next: Can You Execute Code While Waiting For A Lock?

≪ Previous: Your Application The System And A Performance Bug

Originally posted on: http://geekswithblogs.net/akraus1/archive/2014/09/29/159441.aspx

One common thing while debugging is to search who owns/references a handle. For x86 processes you can simply search the full address space but in x64 this approach no longer works. While debugging a hang from a memory dump I did find out that the process MainWindowHandle did change to a different window which was not the main window handle. Searching for the values of NativeWindow objects did also lead to no trace on the managed heap so it must be some unmanaged window handle.

First I look at the Process objects which could contain interesting data

0:033> !DumpHeap -type System.Diagnostics.Process -stat
Statistics:
              MT    Count    TotalSize Class Name
000007fef3c7d090        1           24 System.Diagnostics.ProcessModuleCollection
000007fef3c76470        1           24 System.Diagnostics.ProcessThreadCollection
000007fef3c62c70        1           64 System.Func`2[[System.Diagnostics.Process, System],[System.Boolean, mscorlib]]000007fef3c47718      8         2240 System.Diagnostics.Process
000007fef3c89140       57         3192 System.Diagnostics.ProcessThread
000007fef5b44458        2         3664 System.Object[]
000007fef3c71b90      291        16296 System.Diagnostics.ProcessModule
000007fef3c871c0      161        20608 System.Diagnostics.ProcessInfo

When I enable .prefer_dml 1 in Windbg I can click on the process instance and get a list of all 8 Process objects where I can look at each one. The -mt command with the Method Table (MT) as argument is the only way to get only Process objects and nothing else. This is exactly what the blue underlined links in Windbg do when you click on them.

0:033> !DumpHeap /d -mt 000007fef3c47718
         Address               MT     Size0000000003b52658 000007fef3c47718      280     
0000000003c29970 000007fef3c47718      280     
0000000003c8b680 000007fef3c47718      280     
0000000003e192e0 000007fef3c47718      280     
0000000003e19d80 000007fef3c47718      280     
0000000003e1c818 000007fef3c47718      280     
0000000003e1ce90 000007fef3c47718      280     
0000000003e520d0 000007fef3c47718      280

By clicking through the blue links I do not need to enter any command by hand which is quite convenient.

0:033> !DumpObj /d 0000000003b52658
Name:        System.Diagnostics.Process
MethodTable: 000007fef3c47718
EEClass:     000007fef38a2238
Size:        280(0x118) bytes
File:        C:\Windows\Microsoft.Net\assembly\GAC_MSIL\System\v4.0_4.0.0.0__b77a5c561934e089\System.dll
Fields:
              MT    Field   Offset                 Type VT     Attr            Value Name
000007fef5b97de0  4002c1e       f0       System.Boolean  1 instance                1 haveProcessId
000007fef5b992b8  4002c1f       d8         System.Int32  1 instance             5848 processId
000007fef5b97de0  4002c20       f1       System.Boolean  1 instance                0 haveProcessHandle
000007fef3c4f2c8  4002c21       20 ...SafeProcessHandle  0 instance 0000000000000000 m_processHandle
000007fef5b97de0  4002c22       f2       System.Boolean  1 instance                0 isRemoteMachine
000007fef5b96508  4002c23       28        System.String  0 instance 0000000002b347d0 machineName
000007fef3c871c0  4002c24       30 ...stics.ProcessInfo  0 instance 0000000000000000 processInfo
000007fef5b992b8  4002c25       dc         System.Int32  1 instance          2035711 m_processAccess
000007fef3c76470  4002c26       38 ...sThreadCollection  0 instance 0000000000000000 threads
000007fef3c7d090  4002c27       40 ...sModuleCollection  0 instance 0000000000000000 modules
000007fef5b97de0  4002c28       f3       System.Boolean  1 instance                1 haveMainWindow
000007fef5b9a338  4002c29       b8        System.IntPtr  1 instance          11505b4mainWindowHandle

The main question was who was owning the window handle. While examining all NativeWindow instances which wrap a window handle I could not find the handle in use by any managed window. In that case I want to search the complete address space which can be non trivial in a multi GB x64 application.

Luckily there exists some not so well known Windbg command to help you out. It is the !address command which shows you all allocated regions of memory (mapped files, loaded dlls, thread stacks, native heaps, …). It has also scripting support to execute a command for every address region. Now I only need to search for the window handle for every allocated memory region. %1 is the start address and %2 is the length of the memory region.

0:033> !address -c:"s -d %1 %2 11505b4"
00000000`00df6560  011505b4 00000000 00000004 00000000  ................
00000000`03b52710  011505b4 00000000 00000000 00000000  ................
00000000`28b57370  011505b4 00000000 0000200e 00000000  ......... ......
00000000`28ce4568  011505b4 00000000 000100d4 00000000  ................
00000000`4c603798  011505b4 00000000 00010052 00000000  ........R.......

Then I search one by one who references the pointers to the window handle

0:033> !address -c:"s -d %1 %2 28b57370"
00000000`1b6d2220  28b57370 00000000 000030f8 00000000  ps.(.....0......
00000000`1b6d2740  28b57370 00000000 000030f8 00000000  ps.(.....0......
00000000`1b6e5740  28b57370 00000000 00000000 00000000  ps.(............
00000000`1b706530  28b57370 00000000 00000003 00000000  ps.(............
00000000`203387e8  28b57370 00000000 00001fd7 00000000  ps.(............
00000000`20355450  28b57370 00000000 00000010 00000000  ps.(............
00000000`20448af8  28b57370 00000000 b60115d0 ffffffff  ps.(............
00000000`204854b0  28b57370 00000000 c15f5308 000007fe  ps.(.....S_.....
00000000`204d6a90  28b57370 00000000 00000000 00000000  ps.(............
00000000`28b53290  28b57370 00000000 00000000 00000000  ps.(............00000000`2bb8f7a0  28b57370 00000000 00000000 00000000  ps.(............
00000000`2c0ffc40  28b57370 00000000 00000001 00000000  ps.(............

Knowing that some thread must hold the window handle I have searched the call stacks with method arguments for all stack locations. The memory region marked red is a region inside a thread stack.

!address
+        0`2b790000        0`2bb8c000        0`003fc000        Stack      [~33; 16d8.2bbc]
+        0`2bb8c000        0`2bb90000        0`00004000        Stack      [~33; 16d8.2bbc]

The chances are high that this thread is the owner of the window handle. From the call stack it gets obvious:

0:051> ~33s
ntdll!NtWaitForMultipleObjects+0xa:
00000000`77c8186a c3              ret
0:033> kb
# RetAddr           : Args to Child                                                           : Call Site
00 000007fe`fdb01430 : 00000000`00000000 00000000`004dbe58 00000000`004dc200 00000000`77c8592e : ntdll!NtWaitForMultipleObjects+0xa
01 00000000`77a31723 : 00000000`2bb8f820 00000000`2bb8f810 00000000`00000000 00000000`2bb8f860 : KERNELBASE!WaitForMultipleObjectsEx+0xe8
02 00000000`77b48f7d : 00000000`2bb8f8a0 00000000`000927c0 00000000`00000001 00000000`000010f4 : kernel32!WaitForMultipleObjectsExImplementation+0xb3
03 00000000`77b462b2 : 00000000`2bb8f9f0 00000000`20481fb0 00000000`00000001 00000000`2bb8f9e0 : user32!RealMsgWaitForMultipleObjectsEx+0x12a
04 000007fe`ff67acd6 : 00000000`2bb8f9e0 00000000`00000000 00000000`2bb8f9f0 00000000`20481fb0 : user32!MsgWaitForMultipleObjectsEx+0x46
05 000007fe`ff79a2a2 : 00000000`80010115 00000000`2bb8faf0 00000000`80010115 00000000`203f1c22 : ole32!CCliModalLoop::BlockFn+0xc2
06 000007fe`c11c1e6d : 00000000`250ab220 00000000`20481f50 00000000`00000000 00000000`20481fb0 : ole32!CoWaitForMultipleHandles+0x102
07 000007fe`c11a33ba : 00000000`00000000 00000000`20481f50 00000000`00000000 00000000`00000000 : mshtml!CDwnTaskExec::ThreadExec+0x153
08 00000000`77a259ed : 00000000`00000000 000007fe`c10d0000 00000000`00000000 00000000`00000000 : mshtml!CExecFT::ThreadProc+0x4e
09 00000000`77c5c541 : 00000000`00000000 00000000`00000000 00000000`00000000 00000000`00000000 : kernel32!BaseThreadInitThunk+0xd
0a 00000000`00000000 : 00000000`00000000 00000000`00000000 00000000`00000000 00000000`00000000 : ntdll!RtlUserThreadStart+0x1d

It was a hosted Internet Explorer window which has become for some tenth of a second active causing some code to break which assumed that the MainWindowHandle the main window of an application causing sporadic hangs on some automated test agents.

There is one more thing. When you are working in a bigger software shop you have some form of continuous integration and automated testing after a build has succeeded. Usually the tests are automatically deployed on several test agents to run tests in parallel. This makes it a challenge to debug sporadically failing / hanging tests. What do you do? Try to repro on your machine, add tracing, run the tests in a loop in isolation? There is one thing you can do get down to the root cause of sporadically failing tests without all of that.

[Test]publicvoid Test()
{try
    {// blahh bad test 

    }
    catch(Exception)
    {
        ProcessDumper.DumpProcess(); // Take a dump of your own process or several more if you need tothrow;
    } 
}

Too easy? If you have some hung tests which only fail 1/50 times and you cannot attach a debugger or add tracing because the error goes away if you change the timing automatically taking memory dumps when the test fails can do miracles. This only makes sense if you can open the dump in Windbg. Even long time developers do not want to get into this arcane debugging stuff because "this happens only once a year I do not want to learn all that strange stuff!". I can assure you you will constantly take dumps once you know how to read them because it is the fastest and most efficient way to diagnose random failures without much guessing what the probable root cause could be.

For your test you can simply add procdump from the SysInternals suite as resource, write the file to %temp% and you are ready to create one or more dump files.

I am a huge fan of as much data as I can get to nail a hard to find issue. Memory dumps are an important part of gathering forensic data of crashes and hangs.

If you still do not like Windbg here is a secret: You can get quite far with Visual Studio 2013 and its latest Service Packs! The only thing you must have are matching pdbs from a symbol server or your drop folder. The parallel stacks window in Visual Studio is a great way to check a memory dump with many threads. Besides this Windbg is not really good at resolving STL containers correctly in release builds.

The following code will cause an unhandled exception in C++ in release builds because of a double free call:

#include "stdafx.h"
#include <comutil.h>

bstr_t *g_globalVar;

DECLSPEC_NOINLINE void Allocate()
{
    g_globalVar = new bstr_t(L"Hello world");
    delete g_globalVar;
    _tprintf(L"Created and deleted");
}

DECLSPEC_NOINLINE void FreeASecondTime()
{
    delete g_globalVar;
    _tprintf(L"Deleted a second time");
}int _tmain(int argc, _TCHAR* argv[])
{
    ::Sleep(10 * 1000); // Give some time to attach dumper
    Allocate();
    FreeASecondTime();return 0;
}

With procdump you can create a dump for e.g. first chance exceptions (you can event filter for specific exception types)

C:\>procdump -accepteula -ma -e 1  DoubleFree c:\temp\DoubleFreeDump.dmp

ProcDump v6.00 - Writes process dump files
Copyright (C) 2009-2013 Mark Russinovich
Sysinternals - www.sysinternals.com
With contributions from Andrew Richards

Process:               DoubleFree.exe (5872)
Number of dumps:       1
Hung window check:     Disabled
Exception monitor:     First Chance+Unhandled
Exception filter:      *
Terminate monitor:     Disabled
Dump file:             c:\temp\DoubleFreeDump_YYMMDD_HHMMSS.dmp


Press Ctrl-C to end monitoring without terminating the process.

[23:35:48] Exception: C0000005.ACCESS_VIOLATION

First-chance Exception.
Writing dump file c:\temp\DoubleFreeDump_140929_233548.dmp ...
Writing 10MB. Estimated time (less than) 1 seconds.
Dump written.

Dump count reached.

Now you use VS2013 with File - Open menu to open the dump file:

Now you need to press Debug with Native Only to start "debugging" and open the Call Stack Window to find the second free call:

This works with unmanaged, managed and mixed code as well. A very useful VS addition is the Parallel Stacks window which can help you a lot to find hangs in your code. For our single threaded application it looks not terribly useful but if you have 100 threads it can help you a lot.

What can you do if you do not have the pdbs because the build did run, test did fail, binaries and pdbs are deleted after a few days and you only have a memory dump left? It is not as bad as it sounds. Since you know the baseline of your software you can rebuild the dlls and pdbs on your local machine.

When you load the dump initially and try to get a stack trace you get:

0:000> kb
ChildEBP RetAddr  Args to Child              
WARNING: Stack unwind information not available. Following frames may be wrong.
00c8f430 003f1d05 00c8f5f0 00c8f7cc 7f82f000 DoubleFree+0x11b9c
00c8f510 003f180b 00c8f6d0 00c8f7cc 7f82f000 DoubleFree+0x11d05
00c8f5f0 003f191b 00c8f7c4 00c8f7cc 7f82f000 DoubleFree+0x1180b
00c8f6d0 003f1b0b 00000001 00c8f898 00c8f7cc DoubleFree+0x1191b
00c8f7c4 003f1d6c 00000000 00000000 7f82f000 DoubleFree+0x11b0b
00c8f898 003f2fe9 00000001 00e6cad8 00e6cb60 DoubleFree+0x11d6c
00c8f8e8 003f31dd 00c8f8fc 7717919f 7f82f000 DoubleFree+0x12fe9
00c8f8f0 7717919f 7f82f000 00c8f940 7744a22b DoubleFree+0x131dd
00c8f8fc 7744a22b 7f82f000 e2df5dd0 00000000 kernel32!BaseThreadInitThunk+0xe
00c8f940 7744a201 ffffffff 7743f20b 00000000 ntdll!__RtlUserThreadStart+0x20
00c8f950 00000000 003f10d2 7f82f000 00000000 ntdll!_RtlUserThreadStart+0x1b

Not terribly useful. But if you add to your symbol path your output folder where you newly built pdb resides.

For this test I have converted the project to a managed C++ project for reasons you will see shortly:

.sympath+ D:\Source\DoubleFree\Debug\

and then force Windbg to load it even if the pdb timestamps do not match:

.reload /i DoubleFree.exe

you get full fidelity with line numbers back:

0:000> !ClrStack
OS Thread Id: 0xd74 (0)
Child SP       IP Call Site
00e3f2f0 7743cd7c [InlinedCallFrame: 00e3f2f0] Unknown
00e3f260 00fe1522 <Module>._bstr_t.Data_t.Release(Data_t*) [c:\program files (x86)\microsoft visual studio 12.0\vc\include\comutil.h @ 773]
00e3f270 00fe1791 <Module>.FreeASecondTime() [c:\program files (x86)\microsoft visual studio 12.0\vc\include\comutil.h @ 639]
00e3f27c 00fe1181 <Module>.wmain(Int32, Char**) [d:\source\doublefree\doublefree.cpp @ 28]
00e3f2f0 00f5a135 [InlinedCallFrame: 00e3f2f0] 
00e3f2ec 00fe112e DomainBoundILStubClass.IL_STUB_PInvoke()
00e3f2f0 73dc2552 [InlinedCallFrame: 00e3f2f0] <Module>._wmainCRTStartup()
00e3f498 73dc2552 [GCFrame: 00e3f498]

Unfortunately Visual Studio has no way to load mismatched pdbs even if you have rebuilt the dll and pdb from the same source code base.

But all is not lost. You can get Visual Studio to load mismatched pdbs:

How? There is a little tool called chkmatch which modifies your "new" pdb to match and "old" binary. When you have a managed application you can save the dll to e.g. c:\temp with the managed debugger extension command !SaveModule in Windbg which is the reason why I switched from a pure C++ to a managed C++ application to make it possible to dump a dll from memory to disc. The .writeMem command did not work so I had to revert to a managed host process to debug a pure C++ issue.

Then you can make the "old" exe matching with your new pdb:

C:\Windows\system32>chkmatch -m c:\temp\DoubleFree.exe c:\temp\DoubleFree.pdb
ChkMatch - version 1.0
Copyright (C) 2004 Oleg Starodumov
http://www.debuginfo.com/

Executable: c:\temp\DoubleFree.exe
Debug info file: c:\temp\DoubleFree.pdb

Executable:
TimeDateStamp: 5429d867
Debug info: 2 ( CodeView )
TimeStamp: 5429d867  Characteristics: 0  MajorVer: 0  MinorVer: 0
Size: 68  RVA: 00007aec  FileOffset: 00006eec
CodeView format: RSDS
Signature: {36317c4a-a90c-482a-a8a7-c96ac8050a2e}  Age: 1
PdbFile: D:\Source\DoubleFree\Release\DoubleFree.pdb
Debug info: 12 ( Unknown )
TimeStamp: 5429d867  Characteristics: 0  MajorVer: 0  MinorVer: 0
Size: 20  RVA: 00007b30  FileOffset: 00006f30

Debug information file:
Format: PDB 7.00
Signature: {7cec4b85-0aca-4680-b797-82ead65f95ad}  Age: 1

Writing to the debug information file...
Result: Success.

Viola. You have matching pdbs and file and line numbers back to debug your dump file with Visual Studio. A common scenario where this is useful is that if your build stores only the binaries in a repository but does not upload for build performance reasons the pdbs. When you get dumps from failed test runs from some test machines you have perhaps still the binaries but not the pdbs. In that case Windbg or the chkmatch tool can help you out to get back file and line information from otherwise unreadable call stacks.

I know that the VS product group does not want a "load mismatched pdbs" feature in VS but it would be very useful if you have a dump but not exactly matching symbols. At least for the guys which do not want to learn Windbg for such "rare" cases.

↧

Can You Execute Code While Waiting For A Lock?

October 7, 2014, 3:40 pm

≫ Next: Unexpected Window Message Pumping Part 2

≪ Previous: Advanced Memory Dump Debugging

Originally posted on: http://geekswithblogs.net/akraus1/archive/2014/10/07/159558.aspx

Threading can sometimes lead to impossible bugs where you have a hard time to find out how you got into this state. Lets suppose you have a nice WPF application which you dutifully test via the UIAutomation provider which can alter some state in your application.

/// <summary>/// Interaction logic for TestWindow.xaml/// </summary>publicpartialclass TestWindow : Window
    {int Data;public TestWindow()
        {
            InitializeComponent();
        }publicvoid AlterData(int newValue)
        {Debug.Print("Executing AlterData {0}", Environment.StackTrace);
            Data = newValue;
        }protectedoverride AutomationPeer OnCreateAutomationPeer()
        {
            var lret = new OwnPeer(this);return lret;
        }class OwnPeer : WindowAutomationPeer
        {public OwnPeer(TestWindow owner)
                : base(owner)
            { }protectedoverride HostedWindowWrapper GetHostRawElementProviderCore()
            {
                Debug.Print("GetHostRawElementProviderCore called");((TestWindow)base.Owner).AlterData(500);returnbase.GetHostRawElementProviderCore();
            }
        }
}

When you look at the call stacks all is normal:

Executing AlterData    at System.Environment.GetStackTrace(Exception e, Boolean needFileInfo)
   at System.Environment.get_StackTrace()   at ThreadingIsHard.TestWindow.AlterData(Int32 newValue) in d:\Source\ThreadingIsHard\ThreadingIsHard\TestWindow.xaml.cs:line 52
   at ThreadingIsHard.TestWindow.OwnPeer.GetHostRawElementProviderCore() in d:\Source\ThreadingIsHard\ThreadingIsHard\TestWindow.xaml.cs:line 69
   at MS.Internal.Automation.ElementProxy.InContextGetHostRawElementProvider(Object unused)
   at MS.Internal.Automation.ElementUtil.<>c__DisplayClass1.<Invoke>b__0(Object unused)
   at System.Windows.Threading.ExceptionWrapper.InternalRealCall(Delegate callback, Object args, Int32 numArgs)
   at MS.Internal.Threading.ExceptionFilterHelper.TryCatchWhen(Object source, Delegate method, Object args, Int32 numArgs, Delegate catchHandler)
   at System.Windows.Threading.DispatcherOperation.InvokeImpl()
   at System.Threading.ExecutionContext.RunInternal(ExecutionContext executionContext, ContextCallback callback, Object state, Boolean preserveSyncCtx)
   at System.Threading.ExecutionContext.Run(ExecutionContext executionContext, ContextCallback callback, Object state, Boolean preserveSyncCtx)
   at System.Threading.ExecutionContext.Run(ExecutionContext executionContext, ContextCallback callback, Object state)
   at System.Windows.Threading.DispatcherOperation.Invoke()
   at System.Windows.Threading.Dispatcher.ProcessQueue()
   at System.Windows.Threading.Dispatcher.WndProcHook(IntPtr hwnd, Int32 msg, IntPtr wParam, IntPtr lParam, Boolean& handled)
   at MS.Win32.HwndWrapper.WndProc(IntPtr hwnd, Int32 msg, IntPtr wParam, IntPtr lParam, Boolean& handled)
   at MS.Win32.HwndSubclass.DispatcherCallbackOperation(Object o)
   at System.Windows.Threading.ExceptionWrapper.InternalRealCall(Delegate callback, Object args, Int32 numArgs)
   at MS.Internal.Threading.ExceptionFilterHelper.TryCatchWhen(Object source, Delegate method, Object args, Int32 numArgs, Delegate catchHandler)
   at System.Windows.Threading.Dispatcher.LegacyInvokeImpl(DispatcherPriority priority, TimeSpan timeout, Delegate method, Object args, Int32 numArgs)
   at MS.Win32.HwndSubclass.SubclassWndProc(IntPtr hwnd, Int32 msg, IntPtr wParam, IntPtr lParam)
   at MS.Win32.UnsafeNativeMethods.DispatchMessage(MSG& msg)
   at MS.Win32.UnsafeNativeMethods.DispatchMessage(MSG& msg)
   at System.Windows.Threading.Dispatcher.PushFrameImpl(DispatcherFrame frame)
   at System.Windows.Application.RunInternal(Window window)
   at System.Windows.Application.Run()
   at ThreadingIsHard.App.Main() in d:\Source\UIAutomationSample\UIAutomationTest\obj\Debug\App.g.cs:line 0

So far so expected. Now lets make our well behaving application at little more unexpected and take a lock for some time. To make it simple I deadlock the main UI thread by trying to enter a lock which is owned by another thread like this

privatevoid Button_Click(object sender, RoutedEventArgs e)
        {
            Deadlock();
        }publicvoid Deadlock()
        {
            Barrier b = new Barrier(2);
            Thread t = new Thread(() =>
            {lock (this)
                {
                    b.SignalAndWait(); // let main UI thread run into lock
                    Thread.Sleep(Timeout.Infinite); // never leave the lock
                }
            });
            t.Start();

            b.SignalAndWait();
            lock (this) // deadlock which never releases lock
            {
            }
        }

If you have never seen Barrier it is a class introduced with .NET 4.0 which is very handy to synchronize threads to a specific time point. Only when both threads have signaled then both threads are released at the same time. In our case this ensures a nice deadlock. Here is the quiz:

Can something bad happen while the main UI thread is deadlocked?

As it turns out it can. Lets have a look at this impossible call stack:

Executing AlterData    at System.Environment.GetStackTrace(Exception e, Boolean needFileInfo)
   at System.Environment.get_StackTrace()   at ThreadingIsHard.TestWindow.AlterData(Int32 newValue) in d:\Source\ThreadingIsHard\ThreadingIsHard\TestWindow.xaml.cs:line 25
   at ThreadingIsHard.TestWindow.OwnPeer.GetHostRawElementProviderCore() in d:\Source\ThreadingIsHard\ThreadingIsHard\TestWindow.xaml.cs:line 44
   at MS.Internal.Automation.ElementProxy.InContextGetHostRawElementProvider(Object unused)
   at MS.Internal.Automation.ElementUtil.<>c__DisplayClass1.<Invoke>b__0(Object unused)
   at System.Windows.Threading.ExceptionWrapper.InternalRealCall(Delegate callback, Object args, Int32 numArgs)
   at MS.Internal.Threading.ExceptionFilterHelper.TryCatchWhen(Object source, Delegate method, Object args, Int32 numArgs, Delegate catchHandler)
   at System.Windows.Threading.Dispatcher.LegacyInvokeImpl(DispatcherPriority priority, TimeSpan timeout, Delegate method, Object args, Int32 numArgs)
   at MS.Internal.Automation.ElementUtil.Invoke(AutomationPeer peer, DispatcherOperationCallback work, Object arg)
   at MS.Internal.Automation.ElementProxy.get_HostRawElementProvider()
   at MS.Internal.Automation.UiaCoreProviderApi.RawUiaReturnRawElementProvider(IntPtr hwnd, IntPtr wParam, IntPtr lParam, IRawElementProviderSimple el)
   at MS.Internal.Automation.UiaCoreProviderApi.RawUiaReturnRawElementProvider(IntPtr hwnd, IntPtr wParam, IntPtr lParam, IRawElementProviderSimple el)
   at System.Windows.Automation.Provider.AutomationInteropProvider.ReturnRawElementProvider(IntPtr hwnd, IntPtr wParam, IntPtr lParam, IRawElementProviderSimple el)
   at System.Windows.Interop.HwndTarget.CriticalHandleWMGetobject(IntPtr wparam, IntPtr lparam, Visual root, IntPtr handle)
   at System.Windows.Interop.HwndTarget.HandleMessage(WindowMessage msg, IntPtr wparam, IntPtr lparam)
   at System.Windows.Interop.HwndSource.HwndTargetFilterMessage(IntPtr hwnd, Int32 msg, IntPtr wParam, IntPtr lParam, Boolean& handled)
   at MS.Win32.HwndWrapper.WndProc(IntPtr hwnd, Int32 msg, IntPtr wParam, IntPtr lParam, Boolean& handled)
   at MS.Win32.HwndSubclass.DispatcherCallbackOperation(Object o)
   at System.Windows.Threading.ExceptionWrapper.InternalRealCall(Delegate callback, Object args, Int32 numArgs)
   at MS.Internal.Threading.ExceptionFilterHelper.TryCatchWhen(Object source, Delegate method, Object args, Int32 numArgs, Delegate catchHandler)
   at System.Windows.Threading.Dispatcher.LegacyInvokeImpl(DispatcherPriority priority, TimeSpan timeout, Delegate method, Object args, Int32 numArgs)
   at MS.Win32.HwndSubclass.SubclassWndProc(IntPtr hwnd, Int32 msg, IntPtr wParam, IntPtr lParam)
   at System.Threading.SynchronizationContext.WaitHelper(IntPtr[] waitHandles, Boolean waitAll, Int32 millisecondsTimeout)
   at System.Windows.Threading.DispatcherSynchronizationContext.Wait(IntPtr[] waitHandles, Boolean waitAll, Int32 millisecondsTimeout)   at System.Threading.Monitor.Enter(Object obj)   at ThreadingIsHard.TestWindow.Deadlock() in d:\Source\ThreadingIsHard\ThreadingIsHard\TestWindow.xaml.cs:line 69
   at ThreadingIsHard.TestWindow.Button_Click(Object sender, RoutedEventArgs e) in d:\Source\ThreadingIsHard\ThreadingIsHard\TestWindow.xaml.cs:line 52
   at System.Windows.EventRoute.InvokeHandlersImpl(Object source, RoutedEventArgs args, Boolean reRaised)
   at System.Windows.UIElement.RaiseEventImpl(DependencyObject sender, RoutedEventArgs args)
   at System.Windows.Controls.Button.OnClick()
   at System.Windows.Controls.Primitives.ButtonBase.OnMouseLeftButtonUp(MouseButtonEventArgs e)
   at System.Windows.RoutedEventArgs.InvokeHandler(Delegate handler, Object target)
   at System.Windows.EventRoute.InvokeHandlersImpl(Object source, RoutedEventArgs args, Boolean reRaised)
   at System.Windows.UIElement.ReRaiseEventAs(DependencyObject sender, RoutedEventArgs args, RoutedEvent newEvent)
   at System.Windows.RoutedEventArgs.InvokeHandler(Delegate handler, Object target)
   at System.Windows.EventRoute.InvokeHandlersImpl(Object source, RoutedEventArgs args, Boolean reRaised)
   at System.Windows.UIElement.RaiseEventImpl(DependencyObject sender, RoutedEventArgs args)
   at System.Windows.UIElement.RaiseTrustedEvent(RoutedEventArgs args)
   at System.Windows.Input.InputManager.ProcessStagingArea()
   at System.Windows.Input.InputProviderSite.ReportInput(InputReport inputReport)
   at System.Windows.Interop.HwndMouseInputProvider.ReportInput(IntPtr hwnd, InputMode mode, Int32 timestamp, RawMouseActions actions, Int32 x, Int32 y, Int32 wheel)
   at System.Windows.Interop.HwndMouseInputProvider.FilterMessage(IntPtr hwnd, WindowMessage msg, IntPtr wParam, IntPtr lParam, Boolean& handled)
   at System.Windows.Interop.HwndSource.InputFilterMessage(IntPtr hwnd, Int32 msg, IntPtr wParam, IntPtr lParam, Boolean& handled)
   at MS.Win32.HwndWrapper.WndProc(IntPtr hwnd, Int32 msg, IntPtr wParam, IntPtr lParam, Boolean& handled)
   at MS.Win32.HwndSubclass.DispatcherCallbackOperation(Object o)
   at System.Windows.Threading.ExceptionWrapper.InternalRealCall(Delegate callback, Object args, Int32 numArgs)
   at MS.Internal.Threading.ExceptionFilterHelper.TryCatchWhen(Object source, Delegate method, Object args, Int32 numArgs, Delegate catchHandler)
   at System.Windows.Threading.Dispatcher.LegacyInvokeImpl(DispatcherPriority priority, TimeSpan timeout, Delegate method, Object args, Int32 numArgs)
   at MS.Win32.HwndSubclass.SubclassWndProc(IntPtr hwnd, Int32 msg, IntPtr wParam, IntPtr lParam)
   at MS.Win32.UnsafeNativeMethods.DispatchMessage(MSG& msg)
   at MS.Win32.UnsafeNativeMethods.DispatchMessage(MSG& msg)
   at System.Windows.Threading.Dispatcher.PushFrameImpl(DispatcherFrame frame)
   at System.Windows.Application.RunInternal(Window window)
   at System.Windows.Application.Run()
   at ThreadingIsHard.App.Main() in d:\Source\UIAutomationSample\UIAutomationTest\obj\Debug\App.g.cs:line 0

Hm the lock is still held but how can AlterData be called while we are trying to enter the lock which was clearly not released? We do see it even in the call stack (green line)? To understand this you need to take a deep look back in history about STA threads and window message pumping. Monitor.Enter is not a replacement for WaitForSingleObject on a STA thread. Instead it calls into CoWaitForMultipleHandles which ultimately calls MsgWaitForMultipleObjectsEx. Here is the (simplified) call stack:

USER32!MsgWaitForMultipleObjectsEx
combase!CCliModalLoop::SetPointerInputModeAsAppropriateForQueueAttachmentConditions
combase!CCliModalLoop::BlockFn
combase!ClientCallRetryContext::StartTimer
combase!ClassicSTAThreadWaitForHandles
combase!CoWaitForMultipleHandles
System.Windows.Threading.DispatcherSynchronizationContext.Wait(IntPtr[], Boolean, Int32)
System.Threading.Monitor.Enter(System.Object) 
ThreadingIsHard.TestWindow.Deadlock() [d:\Source\ThreadingIsHard\ThreadingIsHard\TestWindow.xaml.cs @ 69] 
ThreadingIsHard.TestWindow.Button_Click(System.Object, System.Windows.RoutedEventArgs) [d:\Source\ThreadingIsHard\ThreadingIsHard\TestWindow.xaml.cs @ 52] 
System.Windows.EventRoute.InvokeHandlersImpl(System.Object, System.Windows.RoutedEventArgs, Boolean)

MsgWaitForMultipleObjectsEx is just like WaitForMultipleObjects with the important exception that the wait can also return when specific window messages are put into the window message queue. The official docs are silent about which window messages are pumped. I have tried

Keyboard press
Mouse move
Mouse button
WM_PAINT
BeginInvoke
Invoke
Dispatcher.Invoke
…

NONE of these window messages are pumped while holding a lock. Chris Brumme says about this in his blog

On lower operating systems, the CLR uses MsgWaitForMultipleHandles / PeekMessage / MsgWaitForMultipleHandlesEx and whatever else is available on that version of the operating system to try to mirror the behavior of CoWaitForMultipleHandles. The net effect is that we will always pump COM calls waiting to get into your STA. And any SendMessages to any windows will be serviced. But most PostMessages will be delayed until you have finished blocking.

Although the blog post is over 10 years old the information is still relevant. What I have seen so far all Window messages will be blocked except for COM calls as it is the case for UIAutomation calls which use COM as transport vehicle to marshal your UI automation calls across processes. If you have a custom UI Automation provider you should check twice if you are calling into state altering methods of your UI objects which can be reentrant called while your application was trying get access to a lock. While you were waiting for the lock your window could already be disposed! Bad things can and will happen. UIAutomation is not only used for integration testing but also by screen readers and tools for people who have problems reading a computer screen. It is therefore possible that these issues not only show up during tests but also on customer machines. If you get mysterious crashes at places where you would not have expected them check the call stack for AutomationInteropProvider calls. If there are some you most likely have hit a problem due to unexpected "continuations" while taking a lock. Unfortunately there is not really a good solution known to me except to pay vigilant attention to your UI Automation provider objects which must ensure not to call any methods which could alter the state of objects. If you have pending Begin/Invoke calls pending in your message queue these messages will still not get pumped. Only the direct COM calls and their side effects can tunnel through blocking calls. Perhaps I am telling you things you already knew but this one was quite surprising. Perhaps it would be the best to stop pumping COM messages at all so your calls nicely line up with the other window messages as it should be. But I guess there is still too much legacy code out there which would be deadlocked if no message pumping occurs anymore. The current Windows 10 Build still pumps messages as always. Perhaps we could get a flag to switch over to the "new" behavior to prevent such nasty problems.

↧

Unexpected Window Message Pumping Part 2

October 12, 2014, 3:02 pm

≫ Next: Windows Draws Some Unicode Characters Wrong (Windows 7-8.1)

≪ Previous: Can You Execute Code While Waiting For A Lock?

Originally posted on: http://geekswithblogs.net/akraus1/archive/2014/10/12/159645.aspx

Although there was no part one my other post "Can You Execute Code While Waiting For A Lock?" can be considered as part one. I hope that this will not become a larger series where code is executed in the wrong order. Suppose you want to execute the commands

Init
Action
OtherAction
Dispose
Shutdown

in this order on the UI thread of an application from another thread. Then you would do something like this:

using System;using System.Diagnostics;using System.Threading;using System.Windows.Forms;namespace MessagePumping
{publicpartialclass Form1 : Form
    {public SynchronizationContext WindowsFormsSynchronizationContext;public Form1()
        {
            InitializeComponent();
            var tmp = this.Handle;

            WindowsFormsSynchronizationContext = System.Windows.Forms.WindowsFormsSynchronizationContext.Current;
            BlockFor();
        }

        void BlockFor()
        {
            Barrier b = new Barrier(2);
            Thread t = new Thread(() =>
            {lock (this)
                {
                    b.SignalAndWait(); // let main UI thread run into lock                    
                    Thread.Sleep(300); // enter the lock on the UI thread

                    Action otherAction = () => Debug.Print("OtherAction");
                    WindowsFormsSynchronizationContext.Post((o) => Debug.Print("Init"), null);
                    WindowsFormsSynchronizationContext.Post((o) => Debug.Print("Action"), null);
                    WindowsFormsSynchronizationContext.Post((o) => otherAction(), null);
                    WindowsFormsSynchronizationContext.Post((o) => Debug.Print("Dispose"), null);
                    WindowsFormsSynchronizationContext.Post((o) => Debug.Print("Shutdown"), null);
                }
            });
            t.Start();

            b.SignalAndWait();
            lock (this) // deadlock which never releases lock
            {
            }
        }

    }

This will print the messages in order just as you would expect it:

Init
Action
OtherAction
Dispose
Shutdown

Now lets slightly change the code of BlockFor

void BlockFor()
        {
            Barrier b = new Barrier(2);
            Thread t = new Thread(() =>
            {lock (this)
                {
                    b.SignalAndWait(); // let main UI thread run into lock                    
                    Thread.Sleep(300); // enter the lock on the UI thread

                    Action otherAction = () => Debug.Print("OtherAction");
                    WindowsFormsSynchronizationContext.Post((o) => Debug.Print("Init"), null);
                    WindowsFormsSynchronizationContext.Post((o) => Debug.Print("Action"), null);
                    WindowsFormsSynchronizationContext.Post((o) =>
                    {
                        WindowsFormsSynchronizationContext.Send((o2) => otherAction(), null);
                    },null);
                    WindowsFormsSynchronizationContext.Post((o) => Debug.Print("Dispose"), null);
                    WindowsFormsSynchronizationContext.Post((o) => Debug.Print("Shutdown"), null);
                }
            });
            t.Start();

            b.SignalAndWait();
            lock (this) // deadlock which never releases lock
            {
            }
        }

and check if during a post a send operation is working as expected:

Init
Action
OtherAction
Dispose
Shutdown

The output is till the same. So far so good. Now keep the code but change something else and now we get:

Init
Action
Dispose
Shutdown
OtherAction

What has changed?

…

Still not it.

The SynchronizationContext! You were thinking you were using a WindowsFormsSynchronizationContext? No you weren´t. The WindowsFormsSynchronizationContext.Current is returning not a WindowsFormsSynchronizationContext but a SynchronizationContext instance. The reason is that the static Current property comes from SynchronizationContext. You were using the current SynchronizationContext whatever it was. In the first run it was actually a WPF DispatcherSynchronizationContext instance which did work just as expected. But if we actually force the usage of an WindowsFormsSynchronizationContext by setting it explictely

public Form1()
        {
            InitializeComponent();
            var tmp = this.Handle;

            SynchronizationContext.SetSynchronizationContext(new WindowsFormsSynchronizationContext());
            WindowsFormsSynchronizationContext = System.Windows.Forms.WindowsFormsSynchronizationContext.Current;
            BlockForThreeSeconds2();
        }

Then we get the faulty behavior that shutdown is executed before our OtherAction which was not our intention. The first question is why are we getting this? When using window messages you are adding messages to a queue. When you post a message you add it at the end of the queue. The same happens if you Send a messages with the only exception that the call blocks until all pending messages and your message have been processed. Our OhterAction message is special since it calls Send to the message pump while it is being processed. The logical picture you have in mind is

but what actually happens is that while your wrapper call is executed it enqueues OtherAction at the end of the message queue and waits until OtherAction has been called. But since the Dispose and Shutdown messages come first our OtherAction message will be called far too late.

Ok that explains why WinForms suck. But why did it work with the WPF Dispatcher? The WPF dispatcher uses a little optimization: If you try to enqueue a message on the thread that runs the message pump then it will NOT be enqueued but directly executed. The WPF Dispatcher model is therefore much saner thane Winforms and prevents you from shooting yourself in the foot. If you do not use the WinForms synchronization context but you call directly on the Control.Begin/Invoke you can fix this faulty model by calling InvokeRequired

                    Action otherAction = () => Debug.Print("OtherAction");this.BeginInvoke((Action) (() => Debug.Print("Init")));this.BeginInvoke((Action) (() => Debug.Print("Action")));this.BeginInvoke((Action) (() =>
                    {if (this.InvokeRequired) // only enqueue to message queue if we are beeing called from a different thread
                        {this.Invoke((Action)(() => otherAction()));
                        }else
                        {
                            otherAction(); // otherwise execute the callback directly
                        }
                    }));this.BeginInvoke((Action) (() => Debug.Print("Dispose")));this.BeginInvoke((Action) (() => Debug.Print("Shutdown")));

The rule to check with InvokeRequiredworks before actually enqueuing messages works well and helps to alleviate the problem of unintended message reordering. Unfortunately this does not work with WindowsFormsSynchronizationContexts since it has no InvokeRequired method to do the check by yourself. Until then your best bet would be to abandon the usage of WindowsFormsSynchronizationContext because of this or you simply use the WPF DispatcherSynchronizationContext which works well. But you should be aware that you then have up to 3 message queues in parallel. First it is the window message queue of your window. Then comes the Winforms message queue which uses its own window message to start working on queued messages. And as last queue you have the WPF message queue which also has its own private message queue and window message.

If you interleave WPF and WinForms messages in some calling code like this

public WindowsFormsSynchronizationContext WinFormsContext;public DispatcherSynchronizationContext WPFContext;public Form1()
        {
            InitializeComponent();
            var tmp = this.Handle;

            WinFormsContext = new System.Windows.Forms.WindowsFormsSynchronizationContext();
            WPFContext = new DispatcherSynchronizationContext();
            BlockFor();
        }void BlockFor()
        {
            Barrier b = new Barrier(2);
            Thread t = new Thread(() =>
            {lock (this)
                {
                    b.SignalAndWait(); // let main UI thread run into lock                    
                    Thread.Sleep(300); // enter the lock on the UI thread

                    Action otherAction = () => Debug.Print("OtherAction");
                    WPFContext.Post((o) => Debug.Print("Init"), null);
                    WinFormsContext.Post((o) => Debug.Print("WinForms Init"),null);
                    WPFContext.Post((o) => Debug.Print("Action"), null);
                    WinFormsContext.Post((o) => Debug.Print("WinForms Action"), null);
                    WPFContext.Post((o) =>
                       WPFContext.Send((o2) => otherAction(), null), null);
                    WPFContext.Post((o) => Debug.Print("Dispose"), null);
                    WinFormsContext.Post((o) => Debug.Print("WinForms Dispose"), null);
                    WPFContext.Post((o) => Debug.Print("Shutdown"), null);
                    WinFormsContext.Post((o) => Debug.Print("WinForms Shutdown"), null);
                }
            });
            t.Start();

then you do not get interleaved messages as the calling code suggests but this:

Init
WinForms Init
WinForms Action
WinForms Dispose
WinForms Shutdown
Action
OtherAction
Dispose
Shutdown

This is another subtle difference between WPF and Windows Forms message pumping. WPF will execute only one posted message whereas Winforms will execute all pending posted messages at once until there are no more left to execute. If you have a mixed WPF/Winforms application and you want to get rid of Invokes in favor of BeginInvokes because message pumping is evil you just are about to find the next trap door late during your system integration test. If you can you should stick to only one message queue and prefer Invoke/Send where you can. If you are using Winforms BeginInvoke you not only do not know when it will be run (perhaps too late) but if you need coordination with WPF window messages you are out of luck in this scenario. I am sure you will not design a system this way but at some day someone will add a hidden dependency between both message pumps leaving you little chances to design you out of this mess.

I hope that MS will address these (obvious?) deficiencies in Winforms message pumping so we do not have to worry about stuff we never wanted to know in such great detail in the first place.

Update 1

No I will not make a different post out of it. The last example has an interesting twist if you are using the WPF Dispatcher. It has changed between .NET Framework 4.0 and 4.5. Depending on the target platform you are compiling for (.NET 4.0 or 4.5+) you will get different behavior. The .NET 4.0 Dispatcher WILL pump messages just like WinForms. This means the code above for the WPF context post messages only returns for .NET 4.5:

Init
Action
OtherAction
Dispose
Shutdown

But if you compile for .NET 4.0

you get

Init
Action
Dispose
Shutdown
OtherAction

just like in the bad case of Windows Forms. The magic happens in the DispatcherSynchronizationContext class at:

publicoverridevoid Send(SendOrPostCallback d, object state)
{if (BaseCompatibilityPreferences.GetInlineDispatcherSynchronizationContextSend() && this._dispatcher.CheckAccess())
    {this._dispatcher.Invoke(DispatcherPriority.Send, d, state);return;
    }this._dispatcher.Invoke(this._priority, d, state);
}

This returns true or false depending on what is set in the TargetFramwork attribute of your currently executing assembly.

[assembly: TargetFramework(".NETFramework,Version=v4.5.2", FrameworkDisplayName = ".NET Framework 4.5.2")]

Interesting. You can find the future .NET Framework version numbers (but not their release dates) in the BinaryCompatiblityMap which contains the promising name AddQuirksForFramework.

if (buildAgainstVersion >= 50000)
        {this.TargetsAtLeast_Desktop_V5_0 = true;
        }if (buildAgainstVersion >= 40504)
        {this.TargetsAtLeast_Desktop_V4_5_4 = true;
        }if (buildAgainstVersion >= 40503)
        {this.TargetsAtLeast_Desktop_V4_5_3 = true;
        }if (buildAgainstVersion >= 40502)
        {this.TargetsAtLeast_Desktop_V4_5_2 = true;
        }if (buildAgainstVersion >= 40501)
        {this.TargetsAtLeast_Desktop_V4_5_1 = true;
        }if (buildAgainstVersion >= 40500)
        {this.TargetsAtLeast_Desktop_V4_5 = true;this.AddQuirksForFramework(TargetFrameworkId.Phone, 70100);this.AddQuirksForFramework(TargetFrameworkId.Silverlight, 50000);return;
        }…

So we can expect a .NET 4.5.3, 4.5.4 and .NET 5.0 in the future with some confidence now.

↧

Windows Draws Some Unicode Characters Wrong (Windows 7-8.1)

December 1, 2014, 12:03 pm

≫ Next: What Is The Difference Between FileFlags.Random (FILE_FLAG_RANDOM_ACCESS) And FileFlags.SequentialScan (FILE_FLAG_SEQUENTIAL_SCAN)?

≪ Previous: Unexpected Window Message Pumping Part 2

Originally posted on: http://geekswithblogs.net/akraus1/archive/2014/12/01/160433.aspx

Recently I got an interesting drawing issue which looks like a Windows bug. Let´s suppose you want to draw a string in Chinese in a Windows Forms or an WPF application. There you have two options.

The problem is both methods lead to crappy rendered text when the used Font is bold. The "old" method by using Graphics.DrawString renders some Chinese glyphs wrong. If you try a new shiny WPF application you will find out that WPF uses the equivalent to Graphics.DrawString since the text looks the same. Graphics.DrawString and TextRenderer draw the following string with an Arial font like this:

Did you notice that the last 3 characters are slightly shifted above and rendered in some bold font? This is normally a sign that the used font (Arial) in this case did not contain all Unicode characters of the string to draw. Before drawing empty rectangles Windows uses the next suitable font which looks like Microsoft YaHei in this case.

To fix this you need to tweak the registry and add a key for Arial to prefer instead a different font to render the complete string with the same font. With the following setting you switch Arial to SimSun as first preferred Font (the Arial key did not even exist on my 8.1 machine).

After doing this modification you need to log out and log in again to enable this change. The only issue is that Graphics.DrawString draws the string just like before. No change here. Either it ignores Font Linking in favor of some awkward switch like if (char > dddd) "Microsoft YaHei" else "Arial" or it has a plain bug in it. The usual suspects on StackOverflow recommend to switch to TextRenderer which works much better. But it breaks if you try to render bold fonts which look quite bad if you do not set explicitly a background color. Here are some pictures how the text changes if the font size grows:

In my application I cannot set a background color since I am drawing onto a Bitmap which is overlayed with other graphics which must remain visible behind the rendered string. The other option is to use as background e.g. Magenta and use this a transparent color. Unfortunately this doesn't work due to the anti aliasing which leaves some purple shadows on the drawn text which looks really ugly. So far it looks like Windows leaves you with two bad options if you want to make a professional looking graphical localizable application:

Either the Chinese characters are drawn with two different fonts. (TextRenderer).
Or you end up with barely readable text (Graphics.DrawString).

If you have solved the issue without setting an explicit background color I would love to hear about your solution. The code to draw the text in both Picture boxes is like this:

        const string combined = "测试标记好不好This is a aber long";

privatevoid cFontSize_ValueChanged(object sender, EventArgs e)
        {
            Bitmap bmp = new Bitmap(750, 140);
            Font f = new Font("Arial", (float)cFontSize.Value, IsBold ? FontStyle.Bold : FontStyle.Regular);using (var g = Graphics.FromImage(bmp))
            {
                g.CompositingMode = System.Drawing.Drawing2D.CompositingMode.SourceOver;// Rendering with a Background Color solves the issue but // this would produce boxes of white or black color in the displayed image which looks even worse
                TextRenderer.DrawText(g, combined, f, new Point(0, 0), FontColor);
            }
            cPictureBox.Image = bmp;

            Bitmap bmp2 = new Bitmap(750, 140);using (var g = Graphics.FromImage(bmp2))
            {
                g.CompositingMode = System.Drawing.Drawing2D.CompositingMode.SourceOver;
                g.DrawString(combined, f, FontBrush, 0, 0);
            }

            cPicture2.Image = bmp2;
        }

Pretty straightforward code I would think. WPF is also not the solution since it draws the text in the same way like Windows Forms. I wonder what market share Chinese people currently have that this issue is even in Windows 8.1 not fixed.

Update 1

~~After adding Arial Bold as Font Link with the settings~~

It looks like Graphics.DrawString is the winner with the best rendered string again.

WPF still does not love my Font Linking settings. Whatever. Graphics.DrawString looks good now in all languages which is just what I need for now.

Update 2

It turns out that a German Windows 8.1 behaves differently than a English Windows 8.1 although I have installed the language pack for Chinese for it.

You do not need to log out and log in again to see the changes in effect and the Arail Bold setting was not required at all. It is sufficient to have one Font Linking setting in place for a German Windows 8.1
On an English Windows 8.1 the bold fonts are still rendered wrong. Windows 7 dito. This is not a solution.

The only real solution I have come up with was to disable bold fonts for Chinese UI languages and live with non bold fonts. The Old New Thing has an "old" post on it called The danger of using boldface for Chinese characters which also points towards the direction that there is no way to render small bold fonts in Chinese which remain readable.

↧