Life as a Windows developer has its ups and downs. On the positive side, you’re associated with the most popular computing platform in history, which translates into lots of potential clients. But this also means you’re subject to the design whims of a notoriously proprietary software company. And as often as not, these changes come back to bite you in the most unusual places.
Take my most recent case: Our largest commercial client, a financial services firm, asked us to modify our DMS Clarity Tracker agent to collect the GDI Object count for each running process. FYI, Tracker already collects a variety of process metrics, including critical CPU utilization and memory counters. However, since this client had some bad experiences with GDI Object handle leaks in the past, they were eager to see this metric added to our collection pool. And since this firm is our best commercial customer, with thousands of seats licensed across one of their largest divisions, we were eager to assist them.
And thus began my odyssey into the wonderful world of forgotten Win32 APIs. It began when I started researching how to collect the GDI Object counter value. Since it’s not part of the regular process object in Performance Monitor, I was forced to step outside our normal methodology (i.e. use PDH.DLL to create a handle to the desired perfmon counter) and look at alternatives.
The first (and really, the only practical) suggestion I encountered was to use the Win32 API’s GetGuiResources function to read the value directly from the target process in memory. However, since our current agent architecture requires sampling the surrounding environment once every second (and then averaging the collected values every 15 seconds), I was understandably concerned about overhead. The idea of executing multiple (50-90 or more, depending on the task list) OpenProcess and GetGuiResources calls in quick succession, every second, gave me pause. After all, these calls aren’t necessarily optimized for low-overhead, like the aforementioned PDH calls are, and I thought I might have to back off on the granularity and simply collect the values once every 15 seconds as an instant value.
Fortunately, the APIs proved to be quite lightweight, and I was able to quickly construct a routine that paralleled our normal PDH collection method, but calling the OpenProcess and GetGuiResources functions instead of the various PdhQuery functions we did for the other, PDH-based counters. The net result was an elegant solution that grabbed the data we needed and integrated it seamlessly with our existing collection model.
And more importantly, it worked – at least at first. Running from within the Visual Studio IDE, our new GDI Object collection logic functioned flawlessly. However, when we took the agent out of the IDE and compiled it to run in its native environment – as a service executing under the LocalSystem account – the GDI Object logic broke down. Instead of getting the desired count values, the GetGuiResources function returned zeros for nearly every process.
I say nearly every process because, for a handful of tasks – most notably, those running as services and which also consumed GDI Objects (i.e. not very many – GDI is mostly for interactive apps) – the function returned what seemed to be valid data. Worse still, the collection code worked perfectly under Windows XP, both interactively and as a service. It only broke down when we deployed the agent under Vista or Windows 7, and then only if we ran the agent as a service under the LocalSystem account.
I didn’t know it at the time, but I was about start down a slippery slope into Win32 API debugging hell. My first theory was that it was a permissions issue. My OpenProcess calls must have been failing due to Vista/7’s tighter IPC security. However, a check of the LastError value showed no faults. And when I subsequently tested to see if I could read other, non-GDI metrics from the process – for example, using GetProcessMemoryInfo to read its Working Set counter – my call’s succeeded each time, using the same handle that was failing with GetGuiResources.
I could even terminate the target process – running as LocalSystem gave me free reign over the system. However, no matter what I tried, I could not get GetGuiResources to return valid data. Another check of LastError, this time for the GetGuiResources call itself, left me even more confused. It reported a result code of INVALID PARAMETER, which made no sense since the only two parameters that the function accepts were the (now confirmed valid) process handle and the requested resource type (GDI or User object count). It was a real hair-pulling moment.
Eventually, I tried enough variations of the above methodology that a pattern began to emerge from the madness. For example, if I ran the code interactively on the desktop, it would dutifully record the GDI Object counts for all of the interactive tasks (e.g. explorer.exe and whatever else was running on the task bar or in the system tray). And when I ran the code as a service – either under the LocalSystem account or using an Administrator-level user account – it would record GDI Object count values only for tasks that were running as non-interactive services.
It was then that the light bulb finally came on. I remembered reading how Vista (and Windows 7) tighten security by moving all interactive user tasks into a second console session (Session 1) and away from the primary console session (Session 0), which was now dedicated solely to non-interactive services. The idea was to eliminate the kind of backdoor vector that led to the infamous “shatter attack” exploit under Windows XP. By isolating service processes to a separate console session, and prohibiting them from interacting with the user’s desktop (which was now running in a different console session), they could suppress such attacks and reduce Windows’ exposed surface area.
Of course, such a radical change introduced some notable compatibility issue. For starters, services that relied on the “Allow service to interact with desktop option” were immediately cut-off from the user’s they were trying to interact with. And, apparently, the move to a dedicated services Session 0 also had the effect of breaking the GetGuiResources API call when executing across session boundaries. So while my agent service running in Session 0 could attach to the processes of, and read data from, tasks running in Session 1 (or any other user session), any attempt to read the GDI Object counter data off of these processes failed – ostensibly because the User and GDI resources these tasks rely on exist solely inside of the separate, isolated user session.
At least that’s my theory so far. The truth is that I’m not sure what the real problem is. While the above analysis seems to fit the facts, there is a dearth of information on the subject. Google searches for “GetGuiResources error” return lots of references to permissions issues and other false leads, but nothing about the call failing across session boundaries.
Fortunately for me, my financial services client is still running Windows XP. They have no plans to move to Windows 7 for at least another year, in large part due to the myriad undocumented incompatibilities they will have to mitigate – like the one I’ve outlined above.
Perhaps I’ll find a workaround some day and finally get Tracker’s GDI Object count logic working under Vista and Windows 7 (I’m open to suggestions – leave your ideas in the Comments section of this post). But regardless, this whole affair was a learning process for me. I gained some valuable new skills, and mastered a few unfamiliar techniques – all part of my quest to quash one very mysterious bug.
So count this as one instance in which a developer took one of those classic Microsoft lemons (i.e. the company breaking Windows in a way that’s both unobvious and difficult for 3rd parties to trace) and turned it into lemonade.
Figure 1 – Get This and Similar Charts at www.xpnet.com