Monday, March 8, 2010

(Editorial) When Microsoft Breaks Windows

Life as a Windows developer has its ups and downs. On the positive side, you’re associated with the most popular computing platform in history, which translates into lots of potential clients. But this also means you’re subject to the design whims of a notoriously proprietary software company. And as often as not, these changes come back to bite you in the most unusual places.

Take my most recent case: Our largest commercial client, a financial services firm, asked us to modify our DMS Clarity Tracker agent to collect the GDI Object count for each running process. FYI, Tracker already collects a variety of process metrics, including critical CPU utilization and memory counters. However, since this client had some bad experiences with GDI Object handle leaks in the past, they were eager to see this metric added to our collection pool. And since this firm is our best commercial customer, with thousands of seats licensed across one of their largest divisions, we were eager to assist them.

And thus began my odyssey into the wonderful world of forgotten Win32 APIs. It began when I started researching how to collect the GDI Object counter value. Since it’s not part of the regular process object in Performance Monitor, I was forced to step outside our normal methodology (i.e. use PDH.DLL to create a handle to the desired perfmon counter) and look at alternatives.

The first (and really, the only practical) suggestion I encountered was to use the Win32 API’s GetGuiResources function to read the value directly from the target process in memory. However, since our current agent architecture requires sampling the surrounding environment once every second (and then averaging the collected values every 15 seconds), I was understandably concerned about overhead. The idea of executing multiple (50-90 or more, depending on the task list) OpenProcess and GetGuiResources calls in quick succession, every second, gave me pause. After all, these calls aren’t necessarily optimized for low-overhead, like the aforementioned PDH calls are, and I thought I might have to back off on the granularity and simply collect the values once every 15 seconds as an instant value.

Fortunately, the APIs proved to be quite lightweight, and I was able to quickly construct a routine that paralleled our normal PDH collection method, but calling the OpenProcess and GetGuiResources functions instead of the various PdhQuery functions we did for the other, PDH-based counters. The net result was an elegant solution that grabbed the data we needed and integrated it seamlessly with our existing collection model.

And more importantly, it worked – at least at first. Running from within the Visual Studio IDE, our new GDI Object collection logic functioned flawlessly. However, when we took the agent out of the IDE and compiled it to run in its native environment – as a service executing under the LocalSystem account – the GDI Object logic broke down. Instead of getting the desired count values, the GetGuiResources function returned zeros for nearly every process.

I say nearly every process because, for a handful of tasks – most notably, those running as services and which also consumed GDI Objects (i.e. not very many – GDI is mostly for interactive apps) – the function returned what seemed to be valid data. Worse still, the collection code worked perfectly under Windows XP, both interactively and as a service. It only broke down when we deployed the agent under Vista or Windows 7, and then only if we ran the agent as a service under the LocalSystem account.

I didn’t know it at the time, but I was about start down a slippery slope into Win32 API debugging hell. My first theory was that it was a permissions issue. My OpenProcess calls must have been failing due to Vista/7’s tighter IPC security. However, a check of the LastError value showed no faults. And when I subsequently tested to see if I could read other, non-GDI metrics from the process – for example, using GetProcessMemoryInfo to read its Working Set counter – my call’s succeeded each time, using the same handle that was failing with GetGuiResources.

I could even terminate the target process – running as LocalSystem gave me free reign over the system. However, no matter what I tried, I could not get GetGuiResources to return valid data. Another check of LastError, this time for the GetGuiResources call itself, left me even more confused. It reported a result code of INVALID PARAMETER, which made no sense since the only two parameters that the function accepts were the (now confirmed valid) process handle and the requested resource type (GDI or User object count). It was a real hair-pulling moment.

Eventually, I tried enough variations of the above methodology that a pattern began to emerge from the madness. For example, if I ran the code interactively on the desktop, it would dutifully record the GDI Object counts for all of the interactive tasks (e.g. explorer.exe and whatever else was running on the task bar or in the system tray). And when I ran the code as a service – either under the LocalSystem account or using an Administrator-level user account – it would record GDI Object count values only for tasks that were running as non-interactive services.

It was then that the light bulb finally came on. I remembered reading how Vista (and Windows 7) tighten security by moving all interactive user tasks into a second console session (Session 1) and away from the primary console session (Session 0), which was now dedicated solely to non-interactive services. The idea was to eliminate the kind of backdoor vector that led to the infamous “shatter attack” exploit under Windows XP. By isolating service processes to a separate console session, and prohibiting them from interacting with the user’s desktop (which was now running in a different console session), they could suppress such attacks and reduce Windows’ exposed surface area.

Of course, such a radical change introduced some notable compatibility issue. For starters, services that relied on the “Allow service to interact with desktop option” were immediately cut-off from the user’s they were trying to interact with. And, apparently, the move to a dedicated services Session 0 also had the effect of breaking the GetGuiResources API call when executing across session boundaries. So while my agent service running in Session 0 could attach to the processes of, and read data from, tasks running in Session 1 (or any other user session), any attempt to read the GDI Object counter data off of these processes failed – ostensibly because the User and GDI resources these tasks rely on exist solely inside of the separate, isolated user session.

At least that’s my theory so far. The truth is that I’m not sure what the real problem is. While the above analysis seems to fit the facts, there is a dearth of information on the subject. Google searches for “GetGuiResources error” return lots of references to permissions issues and other false leads, but nothing about the call failing across session boundaries.

Fortunately for me, my financial services client is still running Windows XP. They have no plans to move to Windows 7 for at least another year, in large part due to the myriad undocumented incompatibilities they will have to mitigate – like the one I’ve outlined above.

Perhaps I’ll find a workaround some day and finally get Tracker’s GDI Object count logic working under Vista and Windows 7 (I’m open to suggestions – leave your ideas in the Comments section of this post). But regardless, this whole affair was a learning process for me. I gained some valuable new skills, and mastered a few unfamiliar techniques – all part of my quest to quash one very mysterious bug.

So count this as one instance in which a developer took one of those classic Microsoft lemons (i.e. the company breaking Windows in a way that’s both unobvious and difficult for 3rd parties to trace) and turned it into lemonade.

Cheers!

RCK


Figure 1 – Get This and Similar Charts at www.xpnet.com

4 comments:

DrPizza said...

I don't really see any good reason to collect this data (yes, I know what happens when you run out of GDI or USER objects, no, that doesn't seem a good reason to query this information), but I would imagine that maybe you could:
1) iterate through each process
2) get a handle to its WindowStation with GetProcessWindowStation
3) possibly duplicate the handle to give you more access to it
4) set your process to use that WindowStation with SetProcessWindowStation
5) call GetGuiResources

It's possible that you'll have to both set the thread Desktop and the process WindowStation, but one would hope that setting the WindowStation would be sufficient.

Randall C. Kennedy said...

We have a very good reason to collect this data: Our largest paying customer is asking us to. And since they're responsible for a big chunk of our revenue over the past couple of years, we do our best to BOGU whenever they come calling. :-)

As for your suggestion, thanks! I'll likely try that approach in the future. However, for now the customer doesn't care about Windows Vista/7. They're sticking with XP until at least next year, which means I can finally spend time on the major overhaul of our commercial offering I've been putting off for nearly a year.

Screenshots and a new promo web site coming soon, so...stay tuned.

RCK

DrPizza said...

Well, I dunno that I agree. I've always been happy to push back on customer demands to ensure that they're really the right demands. The reality is that a lot of customers don't know what they want, they just think they know what they want (and often enough, they don't even know that).

It's my job as a responsible developer to assist them in constructing the right set of requirements, and sometimes that means questioning them and suggesting other things. "Because I'm getting paid" isn't enough justification to do something. If it's pointless busy-work I ain't gonna do it, and I'm not going to bill them for it. I have better things to be doing with my time, and they have better things to be doing with their money.

Randall C. Kennedy said...

Peter,

Understand, we're talking about a company that operates with a nearly $2 Billion IT budget and has an army of skilled, in-house developers. This isn't some "mom & pop" shop asking blindly for data they "think" they need.

The metric in question - GDI objects allocated per process - is one that has tripped them up repeatedly in the past. They've had both in-house and commercial apps hit the upper limits of this resource and cause system instability. Ditto Handles per process. Given the nature of the environment (live trading floor operating in real-time), such an outage is unacceptable. They have to be proactive and monitor anything they believe might become a threat, and that includes GDI resources allocations.

Bottom Line: They've been looking for a reliable way to collect this data for over a year now. It's not exposed by perfmon/pdh.dll, and going into kernel mode to fetch it is unacceptable. So they asked us to step-in to try and resolve it for them. We've had success in pulling off this sort of near-impossible task in the past - one of the reasons we're in our 9th year supporting them, with our software deployed across thousands of their systems.

Our solution is only a stopgap - we'll have to revisit it down the road when they eventually move to Windows 7. But for now, we're just happy to be able to once again meet their requirements.

RCK