Debugging MS Word with WinDbg
The issue
One of the more exotic things we’ve had to deal with recently was debugging MS Word hanging in production. We use Word via COM Interop (it’s not supported and very ugly. We’re trying to move away from it) to convert word documents to various formats. The service responsible for printing runs a pool of Word instances and uses one of them when needed.
At some point a replica of the service exhausted all instances of MS Word in its’ pool by requests that acquired an instance from the pool, invoked into Word and never completed. All requests coming after that would wait to acquire an instance of Word indefinitely. On the front end server side, the request to the print service would eventually time out, but the print service thread would continue waiting.
An obvious bug was that we did not have timeouts for operations calling into MS Word and no mechanism to force a request to go to the next replica if a replica is reachable but not healthy. So we went to fix that and it reduced the number of errors our clients saw. But a single replica still continued to fail by exhausting its’ pool and becoming unavailable.
So we needed to figure out what was happening inside MS Word. We have a memory dump of MS Word (when dealing with something that happened in production get a memory dump first, ask questions later), so we can analyse that.
Using !uniqstack
immediately gives the reason why Word is hanging: it’s waiting for a modal dialog. Cool. MS warned that we shouldn’t be using Office on a server for this exact reason.
0:008> k
# Child-SP RetAddr Call Site
00 00000000`0676f708 00000000`776f4bc4 user32!NtUserWaitMessage+0xa
01 00000000`0676f710 00000000`776f4edd user32!DialogBox2+0x274
02 00000000`0676f7a0 00000000`77742920 user32!InternalDialogBox+0x135
03 00000000`0676f800 00000000`77741c15 user32!SoftModalMessageBox+0x9b4
04 00000000`0676f930 00000000`7774146b user32!MessageBoxWorker+0x31d
05 00000000`0676faf0 00000000`77741362 user32!MessageBoxTimeoutW+0xb3
06 00000000`0676fbc0 000007fe`e3e5ff93 user32!MessageBoxW+0x4e
07 00000000`0676fc00 00000000`775c59ed WWLIB!GetAllocCounters+0xafe8bf
08 00000000`0676fc40 00000000`777fc541 kernel32!BaseThreadInitThunk+0xd
09 00000000`0676fc70 00000000`00000000 ntdll!RtlUserThreadStart+0x1d
The signature of MessageBoxW is:
int WINAPI MessageBox(
_In_opt_ HWND hWnd,
_In_opt_ LPCTSTR lpText,
_In_opt_ LPCTSTR lpCaption,
_In_ UINT uType)
. It seems that it could be useful to know what the message box is showing, so we need to find the values of parameters lpText and lpCaption. kp
seems to be able to show parameter values for a frame, but it does not work without private symbols. Too bad. kb
can show the first three arguments. Let’s try that.
0:008> kb
# RetAddr : Args to Child : Call Site
...
06 000007fe`e3e5ff93 : 00000000`00236c10 00000000`00000000 00000000`00000000 00000000`00000030 : user32!MessageBoxW+0x4e
07 00000000`775c59ed : 00000000`00000000 00000000`00000000 00000000`00000000 00000000`00000000 : WWLIB!GetAllocCounters+0xafe8bf
08 00000000`777fc541 : 00000000`00000000 00000000`00000000 00000000`00000000 00000000`00000000 : kernel32!BaseThreadInitThunk+0xd
09 00000000`00000000 : 00000000`00000000 00000000`00000000 00000000`00000000 00000000`00000000 : ntdll!RtlUserThreadStart+0x1d
This doesn’t make sense because it seems to show that MessageBoxW was called with null strings. But we tried to plug some of the values we saw in args into the ‘Memory’ window of WinDbg, and saw some interesting strings near one of the addresses above.
00000000`00236d12 W o r d c o u l d n o t c r e a t e t h e
00000000`00236d46 w o r k f i l e . C h e c k t h e t e m p
00000000`00236d7a e n v i r o n m e n t v a r i a b l e . . . . .
...
00000000`00236e7e . . C : \ U s e r s \ u s e r \ A p p D a t a \ L o
00000000`00236eb2 c a l \ M i c r o s o f t \ W i n d o w s \ T e m p
00000000`00236ee6 o r a r y I n t e r n e t F i l e s \ C o n t e
00000000`00236f1a n t . W o r d \ ~ W R S { 7 9 A C F D C 1 - F E F 4
00000000`00236f4e - 4 A 9 C - B 3 1 0 - C 2 C 9 5 F 9 6 6 2 9 C } . t
00000000`00236f82 m p . . . . . . . . . . . . . . . . . . . . . .
I realise that being near an address is not scientific. But at this point we jumped to the conclusion that these strings are too specific to be just generic strings stored in resources ‘just in case’ so must have been generated at some point during the process’s lifetime and must have been the values shown in the message box.
We confirmed with procexp that a working installation of MS Word does actively use the Content.Word directory for temporary files, looked at the directory on the misbehaving machine and saw that it had incorrect permissions that prevented us from writing to it with normal Windows means. (Why did Word fail only intermittently and how did it manage to work the rest of the time?). Tried to reproduce this on a different machine by removing all permissions to the folder and got the expected result - Word hung at startup.
So we reset permissions on the folder, and the issue has not reoccured.
Fixed!
We need to go deeper
Some time later I thought it could be interesting to figure out where we took a wrong turn in our analysis, since it seemed like we stumbled onto the cause of the issue completely by accident. Time to get out our crash dump.
Turns out Windows uses different calling conventions on x86 and x64. WinApi x86 uses _stdcall, which means all function parameters are pushed onto the stack, so parameters of all functions above the current frame stay on the stack (where they can be viewed). On x64, the calling convention is a variation of _fastcall, so the first four parameters are passed using the registers rcx, rdx, r8 and r9. In non-optimised builds, the compiler generates code to store these registers on the stack in 20h bytes of ‘home space’ which must be allocated by a function’s caller. With optimisation on, it does whatever it wants to do. This article (see the section “The Nightmare Begins”) describes some debugging techniques.
Let’s try to find the values passed to MessageBoxW. First, dump its’ code.
0:008> uf user32!MessageBoxW user32!MessageBoxW: 00000000`77741314 4883ec38 sub rsp,38h 00000000`77741318 4533db xor r11d,r11d 00000000`7774131b 44391d1a0e0200 cmp dword ptr [user32!gfEMIEnable (00000000`7776213c)],r11d 00000000`77741322 742e je user32!MessageBoxW+0x3e (00000000`77741352) Branch user32!MessageBoxW+0x10: 00000000`77741324 65488b042530000000 mov rax,qword ptr gs:[30h] 00000000`7774132d 4c8b5048 mov r10,qword ptr [rax+48h] 00000000`77741331 33c0 xor eax,eax 00000000`77741333 f04c0fb115ec240200 lock cmpxchg qword ptr [user32!gdwEMIThreadID (00000000`77763828)],r10 00000000`7774133c 4c8b15dd240200 mov r10,qword ptr [user32!gpReturnAddr (00000000`77763820)] 00000000`77741343 418d4301 lea eax,[r11+1] 00000000`77741347 4c0f44d0 cmove r10,rax 00000000`7774134b 4c8915ce240200 mov qword ptr [user32!gpReturnAddr (00000000`77763820)],r10 user32!MessageBoxW+0x3e: 00000000`77741352 834c2428ff or dword ptr [rsp+28h],0FFFFFFFFh 00000000`77741357 6644895c2420 mov word ptr [rsp+20h],r11w 00000000`7774135d e856000000 call user32!MessageBoxTimeoutW (00000000`777413b8) 00000000`77741362 4883c438 add rsp,38h 00000000`77741366 c3 ret
Nothing interesting happening here to rdx or r8. Just proxying the call to user32!MessageBoxTimeoutW
with two additional arguments (they are in [rsp+20h] and [rsp+28h] because 20h bytes are reserved for home space). Moving along.
0:008> uf user32!MessageBoxTimeoutW user32!MessageBoxTimeoutW: 00000000`777413b8 48895c2408 mov qword ptr [rsp+8],rbx 00000000`777413bd 48896c2410 mov qword ptr [rsp+10h],rbp 00000000`777413c2 4889742418 mov qword ptr [rsp+18h],rsi 00000000`777413c7 57 push rdi 00000000`777413c8 4881ecc0000000 sub rsp,0C0h 00000000`777413cf 488bd9 mov rbx,rcx 00000000`777413d2 498bf0 mov rsi,r8 00000000`777413d5 488bfa mov rdi,rdx 00000000`777413d8 488d4c2420 lea rcx,[rsp+20h] 00000000`777413dd 33d2 xor edx,edx 00000000`777413df 41b898000000 mov r8d,98h 00000000`777413e5 418be9 mov ebp,r9d 00000000`777413e8 e8ab83faff call user32!memset (00000000`776e9798) 00000000`777413ed 0fb78424f0000000 movzx eax,word ptr [rsp+0F0h] 00000000`777413f5 488364243000 and qword ptr [rsp+30h],0 00000000`777413fb 833d3a0d020000 cmp dword ptr [user32!gfEMIEnable (00000000`7776213c)],0 00000000`77741402 6689442478 mov word ptr [rsp+78h],ax 00000000`77741407 8b8424f8000000 mov eax,dword ptr [rsp+0F8h] 00000000`7774140e c744242050000000 mov dword ptr [rsp+20h],50h 00000000`77741416 48895c2428 mov qword ptr [rsp+28h],rbx 00000000`7774141b 48897c2438 mov qword ptr [rsp+38h],rdi 00000000`77741420 4889742440 mov qword ptr [rsp+40h],rsi 00000000`77741425 896c2448 mov dword ptr [rsp+48h],ebp 00000000`77741429 8984249c000000 mov dword ptr [rsp+9Ch],eax ... user32!MessageBoxTimeoutW+0xa9: 00000000`77741461 488d4c2420 lea rcx,[rsp+20h] 00000000`77741466 e88d040000 call user32!MessageBoxWorker (00000000`777418f8) 00000000`7774146b 4c8d9c24c0000000 lea r11,[rsp+0C0h] ... 00000000`77741483 c3 ret
Okay, so here rdx and r8 are copied to registers and then passed on to user32!MessageBoxWorker
. Fortunately, this time they are passed as rsp+38h and rsp+40h. The stacktrace from windbg gives us Child-SP, the value of SP before a frame (MessageBoxTimeoutW) transfers control to its’ child (MessageBoxWorker). We can inspect the value of [rsp+38h] and [rsp+40h].
0:008> dq 00000000`0676faf0 00000000`0676faf0 01d1e926`00000000 00000000`00000000 00000000`0676fb00 00000000`00000030 000007fe`e46542f0 00000000`0676fb10 00000000`00000050 00000000`00000000 00000000`0676fb20 00000000`00000000 00000000`00236d12 00000000`0676fb30 000007fe`e46542f0 00000000`00000030 0:008> du (poi(00000000`0676faf0+38h)) 000007fe`e46542f0 "Microsoft Word 0:008> du (poi(00000000`0676faf0+40h)) 00000000`00236d12 "Word could not create the work file. Check the temp environment variable."
We can confirm that these actually are the strings shown in the message box by messing up permissions to Word’s temp folder and trying to create a new document via the GUI. The path to the temp folder does not appear anywhere in the GUI, but it is Googleable given the messages from the message box, so if we did not stumble onto it accidentally, we probably would have still been able to fix the issue.
Conclusion
We need to work on making our service more resilient to failures of individual machines or components
I learned something new: Using WinDbg and reading ASM code is actually not as scary as I thought. WinDbg is a useful tool. Debugging optimised x64 code without symbols is… possible, but requires luck.
Thanks to gusev_p for telling of his experience dealing with different calling conventions and shparun for suggesting WinDbg as a tool.