Aborting a Windows service start-up request in .NET
published: Thu, 24-Mar-2005 | updated: Thu, 27-Oct-2005
Yesterday we were debugging a Windows service written in .NET. The issue being debugged was that sometimes (it seemed to happen mostly on Windows 2000 Server, but not every time) the service would hang during stopping. This bug had been observed a while back, when I formulated that there was a thread deadlock causing the hang and spent some time desk-checking the code (and finding a thread-related bug in the process), but annoyingly the hang bug had "disappeared" a week ago only to resurface again yesterday.
First let me tell you that I hate multithreaded bugs with a passion. They're awful to debug because of their non-deterministic nature. We all have Xeons here so multithreaded bugs tend to appear fairly quickly (trying to reproduce these kinds of bugs on a single processor machine is a short trip to the funny farm), but even so they are nowhere near deterministic. Bugs reports tend to be of the form: "run this for a while and it'll crash". And even worse, putting the debugger on the case alters the environment enough that sometimes the bug doesn't even show up.
Well after yesterday I can now say that I hate multithreaded bugs in services with a virulent my-vision-turns-red type of passion. Even contemplating getting to the point where you have to run the bloody service in the debugger is enough to make me toss my cookies.
So, the first thing to do is to run the service in a normal application. It gets rid of the whole Service Control Manager thing for a start and lets you use the debugger in a normal way. We had a small console app that created the service object, called its OnStart() method, and then, on a cue from the user, would call its OnStop() method and dispose the object.
Luckily we did have a small clue: the service made use of data from a SQL Server instance, and if SQL Server wasn't running we would have a much better chance of triggering the hang. An hour's worth more experimentation later (the poor QA guy was getting a little irritated at us monopolizing his server machine) we worked out that if SQL Server was not available as the service started up then it would hang as it was stopped. If SQL Server went down after the service had started up there would be no problem during stopping. An hour's worth of work had actually simplified the issue a great deal.
Now we went back to the console app and ran it in the debugger. We started by putting breakpoints throughout the OnStop() code, but quickly ran into a problem with the design. To ensure that the service responds quickly to the SCM on start up and stop, all of the service's real work was done inside a separate thread (the work thread). That way OnStart() would merely spin up the work thread and return. SCM was happy since the service was really quick at starting up. The OnStop() code was a little more complex: it had to signal a manual reset event (the work thread would wait on it every now and then) and then stop. SCM would then wait for the service's process to go away. It was this that was the problem since it never did as there were threads still active.
Needless to say, one of the work thread's jobs on start up was to create and initialize a thread pool (one we'd written, not the standard one), so on stopping it had to stop all of these threads. All in all there were a myriad threads to look at and track and worry about.
So we put a breakpoint at the statement where the work thread waited on the "service is stopping" manual reset event. It was never hit. Bizarre. We peppered breakpoints throughout the work thread's main processing loop code. They weren't hit either. Umm. It looked as if the work thread wasn't even running, but the threads in the thread pool were ticking over just nicely (actually, they were all waiting on their own "there's a job to be done" manual reset event).
Well that explains the "deadlock" problem; not really a deadlock per se, just a bunch of threads waiting on manual reset events which weren't being fired because the thread that would fire them wasn't even running.
Onto OnStart() to find out why the work thread wasn't being spun up. The OnStart() code was really simple: a couple of messages logged, and the work thread created and started. It certainly looked as if the work thread was running. Obviously something was causing the work thread to be terminated abnormally. Our SQL Server problem?
And indeed that was the case: the work thread was timing out on connecting to its database, a SQL Exception was being caught and logged (and we'd seen that: all it said was that the connection to SQL Server was broken), and the work thread then terminated. The service code was completely unaware of this of course (it had finished its work aeons ago in computer-time).
The work thread was creating its thread pool before it started looking for SQL Server, but it didn't dispose of it if there was an exception. Hence all the threads just sitting there waiting for work that would never come. Bzzzt. Fixing the bug therefore meant altering the work thread so that it would dispose of its ancillary objects (like the thread pool) in the case of an exception.
We actually went a bit further. We decided that the service shouldn't even start up if the SQL Server connection couldn't be established. So the first thing the work thread does is to establish its connection to the database. If it could, it would signal a new manual reset event to say that it succeeded and then continue working. The OnStart() method would be waiting on this event with a timeout. If the event was signalled, the OnStart() method would complete normally. If the wait timed out, the OnStart() event would throw a "cannot start service" exception.
And that last point is important to note. Nowhere that I could find in the MSDN documentation about the ServiceBase class did it mention how to fail a start up request. The documentation always assumes that the service would just start up normally. Indeed the original developer who'd written the service code had assumed that this meant he couldn't even let any exceptions escape out of OnStart() and had written code to swallow them all. It took a good half-hour spelunking with Reflector in the .NET service code to get to the point where I understood that to fail a start-up request you could just throw an exception (there's callback indirection galore going on in there).
So your helpful hint for today is "throwing an exception from OnStart() will abort a service start-up request in .NET".