Originally posted on: http://blog.sumon.net/akraus1/archive/2016/02/27/172993.aspx
I had an interesting case where a new WPF control was added to a legacy WinForms application. The WPF control worked perfectly in a test application but for some strange reason it was very slow in final WinForms application where it was hosted with the usual System.Windows.Forms.Integration.ElementHost. The UI did hang and one core was always maxed out. Eventually it built up after some minutes but even simple button presses did cause 100% CPU on one core for 20s. If you have high CPU consumption the vegetative reaction of a developer is to attach a debugger and break into the methods to see where the issue is. If you use a real debugger like Windbg you can use the !runaway command to find the threads with highest CPU usage
0:006> !runaway
User Mode Time
Thread Time
0:368 0 days 0:00:11.625
4:13a0 0 days 0:00:06.218
6:301c 0 days 0:00:00.000
5:21c8 0 days 0:00:00.000
3:3320 0 days 0:00:00.000
2:20e4 0 days 0:00:00.000
1:39a0 0 days 0:00:00.000
but when I tried to break into the I was always just waiting for window messages:
# ChildEBP RetAddr Args to Child
00 0018f0d0 5cb1e13a dcd9ca4a 72f7a424 0018f370 USER32!NtUserWaitMessage+0xc
01 0018f154 5cb1db39 00000000 ffffffff 00000000 System_Windows_Forms_ni+0x1ae13a
02 0018f1a8 5cb1d9b0 024d09e4 1bda0002 00000000 System_Windows_Forms_ni+0x1adb39
03 0018f1d4 5cb06129 024d09e4 0018f29c 00000000 System_Windows_Forms_ni+0x1ad9b0
04 0018f1ec 00bc048b 02444410 0018f204 72f71396 System_Windows_Forms_ni+0x196129
Eventually I would find some non waiting stacks but it was not clear if these were the most expensive ones and why. The problem here is that most people are not aware that the actual drawing happens not in user mode but in an extended kernel space thread. Every time you wait in NtUserWaitMessage the thread on the kernel side can continue its execution but you cannot see what's happening as long as you are only looking at the user space side.
If debugging fails you can still use a profiler. It is about time to tell you some well hidden secret of the newest Windows Performance Toolkit. If you record profiling data with WPR/UI and enable the profile Desktop composition activity new views under Video will become visible when you open the trace file with WPA. Most views seem to be for kernel developers but one view named Dwm Frame Details Rectangle By Type is different. It shows all rectangles drawn by Dwm (the Desktop Window Manager). WPA shows not only the flat list of updated rectangles and its coordinates but it draws it in the graph for the selected time region. You can use this view as poor mans screenshot tool to visually correlate the displayed message boxes and other windows with the performed user actions. This way you can visually navigate through your ETL and see what windows were drawn at specific points in your trace!
That is a powerful capability of WPA which I was totally unaware until I needed to analyze this WPF performance problem. If you are more an xperf fan you need to add to your user mode providers list
- Microsoft-Windows-Dwm-Core:0x1ffff:0x6
and you are ready to record pretty much any screen rectangle update. This works only on Windows 8 machines or later. Windows 7 knows the DWM-Core provider but it does not emit the necessary events to draw the dwm rectangles in WPA. The rectangle drawing feature of WPA was added with the Win10 SDK Release of December 2016. Ok so we see more. Now back to our perf problem. I could see that only two threads are involved consuming large amounts of CPU in the UI thread and on the WPF render thread for a seemingly simple screen update. A little clicking around in the UI would cause excessive CPU usage. Most CPU is used in the WPF rendering thread
ntdll.dll!_RtlUserThreadStart
ntdll.dll!__RtlUserThreadStart
kernel32.dll!BaseThreadInitThunk
wpfgfx_v0400.dll!CPartitionThread::ThreadMain
wpfgfx_v0400.dll!CPartitionThread::Run
wpfgfx_v0400.dll!CPartitionThread::PresentPartition
wpfgfx_v0400.dll!CComposition::Present
wpfgfx_v0400.dll!CSlaveHWndRenderTarget::Present
wpfgfx_v0400.dll!CDesktopHWNDRenderTarget::Present
wpfgfx_v0400.dll!CDesktopRenderTarget::Present
wpfgfx_v0400.dll!CSwRenderTargetHWND::Present
wpfgfx_v0400.dll!CSwPresenter32bppGDI::Present
wpfgfx_v0400.dll!CMILDeviceContext::BeginRendering
user32.dll!NtUserGetDC
ntdll.dll!LdrInitializeThunk
ntdll.dll!_LdrpInitialize
wow64.dll!Wow64LdrpInitialize
wow64.dll!RunCpuSimulation
wow64cpu.dll!Thunk0Arg
wow64cpu.dll!CpupSyscallStub
ntoskrnl.exe!KiSystemServiceCopyEnd
win32kbase.sys!NtUserGetDC
ntoskrnl.exe!ExEnterPriorityRegionAndAcquireResourceShared
win32kbase.sys!_GetDCEx
wpfgfx_v0400.dll!CMILDeviceContext::EndRendering
user32.dll!ReleaseDC
user32.dll!NtUserCallOneParam
ntdll.dll!LdrInitializeThunk
ntdll.dll!_LdrpInitialize
wow64.dll!Wow64LdrpInitialize
wow64.dll!RunCpuSimulation
wow64cpu.dll!ServiceNoTurbo
wow64.dll!Wow64SystemServiceEx
wow64win.dll!whNtUserCallOneParam
wow64win.dll!ZwUserCallOneParam
ntoskrnl.exe!KiSystemServiceCopyEnd
win32kfull.sys!NtUserCallOneParam
ntoskrnl.exe!ExReleaseResourceAndLeavePriorityRegion
ntoskrnl.exe!KiCheckForKernelApcDelivery
ntoskrnl.exe!KiDeliverApc
win32kfull.sys!NormalAPCInvalidateCOMPOSITEDWnd
win32kbase.sys!EnterCrit …
If that does not make much sense to you, you are in good company. The WPF rendering thread is rendering a composite window (see CComposition::Present) which seems to use a feature of Windows which also knows about composite Windows. After looking with Spy on the actual window creation parameters of the hosting WinForms application
it turned out that the Windows Forms window had the WS_EX_COMPOSITED flag set. I write this here as if this is flat obvious. It is certainly not. Solving such problems always involves asking more people about their opinion what could be the issue. The final hint that the WinForms application had this extended style set was discovered by a colleague of me. Nobody can know everything but as a team you can tackle pretty much any issue.
A little googling reveals that many people before me had also problems with composite windows. This flag does basically inverse the z-rendering order. The visual effect is that the bottom window is rendered first. That allows you to create translucent windows where the windows below your window shine through as background. WPF uses such things for certain visual effects.
That is enough information to create a minimal reproducer of the issue. All I needed was a default Windows Forms application which hosts a WPF user control. The complete zipped sample can be found on Onedrive.
publicpartialclass Form1 : Form {protectedoverride CreateParams CreateParams { get { CreateParams cp = base.CreateParams; cp.ExStyle |= 0x02000000; // Turn on WS_EX_COMPOSITEDreturn cp; } }public Form1() { InitializeComponent(); cWPFHost.Child = new UserControl2(); } }
The WPF user control is also very simple
publicpartialclass UserControl2 : UserControl {public UserControl2() { InitializeComponent();this.Loaded += UserControl2_Loaded; }void UserControl2_Loaded(object sender, RoutedEventArgs e) { HwndSource hwnd = System.Windows.PresentationSource.FromVisual(this) as HwndSource; HwndTarget target = hwnd.CompositionTarget; target.RenderMode = RenderMode.SoftwareOnly; } }
To get the high CPU issue three things need to come together
- Hosting window must have set the WS_EX_COMPOSITED window style.
- WPF child window must use composition.
- WPF child window must use software rendering.
When these three conditions are met then you have a massive WPF redraw problem. It seems that two composite windows cause some loops while rendering inside the OS deep in the kernel threads where the actual rendering takes place. If you let WPF use HW acceleration it seems to be ok but I have not measured how much GPU power is then wasted. Below is a screenshot of the sample Winforms application:
After was found the solution was to remove the WS_EX_COMPOSITED window style from the WinForms hosting window which did not need it anyway.
Media Experience Analyzer
The problem was solved but it is interesting to see the thread interactions happening while the high CPU issue is happening. For that you can use a new tool of MS named Media Experience Analyzer (XA) which was released in Dec 2015 but the latest version is from Feb 2016. If you have thought that WPA is complex then you have not yet seen how else you can visualize the rich ETW data. This tool is very good at visualizing thread interactions in a per core view like you can see below. When you hover over the threads the corresponding context switch and ready thread stacks are updated on the fly. If you zoom out it looks like a star field in Star Trek just with more colors.
If you want to get most out of XA you can watch the videos a Channel 9 which give you a pretty good understand how Media Experience Analyzer (XA) can be used
- https://channel9.msdn.com/Shows/Defrag-Tools/DefragTools-149-Media-eXperience-Analyzer-part-1
- https://channel9.msdn.com/Shows/Defrag-Tools/Defrag-Tools-150-Media-eXperience-Analyzer-part-2-Video-Playback-Power-Saving
- https://channel9.msdn.com/Shows/Defrag-Tools/Defrag-Tools-151-Media-eXperience-Analyzer-part-3-Audio-Glitch-Analysis
- https://channel9.msdn.com/Shows/Defrag-Tools/Defrag-Tools-152-Media-eXperience-Analyzer-part-4-Video-Glitch-Analysis
- https://channel9.msdn.com/Shows/Defrag-Tools/Defrag-Tools-153-Media-eXperience-Analyzer-part-5-Audio-Glitch-Analysis-II
When should you use WPA and when Media Experience Analyzer?
So far the main goal of XA seems to be to find hangs and glitches in audio and video playback. That requires a thorough understanding of how the whole rendering pipeline in Windows works which is huge field on its own. But it can also be used to get a different view on the data which is not so easy to obtain in WPA. If threads are ping ponging each other this tool makes it flat obvious. XA is already powerful but I am not following entirely its UI philosophy where you must visually see the issue in the rendered data. Most often tabular data like in WPA is more powerful because you can sort by columns and filter away specific call stacks which seems not to be possible with XA. What I miss most in XA is a simple process summary timeline like in the first screenshot. XA renders some nice line graphs but that is not very helpful to get a fast overview of the total CPU consumption. If you look at the complete trace with the scheduler events and the per process CPU consumption in
XA
WPA
I am having a much easier time in WPA to identify my process with the table and color encoding. In XA you always need to hover over the data to see it actual value. A killer feature in XA would be a thread interaction view for a specific process. Ideally I would like to see all threads as bars and the bar length is either the CPU or wait time. Currently I can only see one thread color encoded on which core it is running. This is certainly the best view for device driver devs but normally I am not interested in a per core view but a per thread timeline view. Each thread should have a specific y-value and the horizontal bar length should show either its running or waiting time (or both) and with a line the readying thread as it is already done today.
That would be the perfect thread interaction view and I hope that will be added to XA. The current version is still a 1.0 so expect some crashes and bugs but it has a lot of potential. The issues I encountered so far are
- If you press Turn Symbol Off while is still loading it crashes.
- The ETL file loading time is very high because it seem to include some private MS symbol servers where the UI hangs for several minutes (zero CPU but a bit network IO).
- UI Redraws for bigger (>200MB) ETL files are very slow. Most time seems to be spent in the GPU driver.
- Spelling error in Scheduler view: Drivers, Processes, Threads per Core with Reaady threads.
XA certainly has many more features I have not yet found. The main problem with these tools is that the written documentation only touches the surface. Most things I have learned by playing around with the tools. If you want share your experiences with WPA or XA please sound off in the comments. Now stop reading and start playing with the next cool tool!