A distributed system is one in which the failure of a computer you didn't even know existed can render your own computer unusable. -Leslie Lamport
On my way to QCon Tokyo and QCon China, I had some time to kill so I headed over to Delta's Skyclub lounge. I've been a member for a few years now. And why not? I mean, who could pass up tepid coffee, stale party snacks, and a TV permanently locked to CNN? Wait... that actually doesn't sound like such a hot deal.
Oh! I remember, it's for the wifi access. (Well, that plus reliably clean bathrooms, but we need not discuss that.) Being able to count on wifi access without paying for yet another data plan has been pretty helpful for me. (As an aside, I might change my tune once I try a mifi box. Carrying my own hotspot sounds even better.)
Like most wifi providers, the Skyclub has a captive portal. Before you can get a TCP/IP connection to anything, you have to submit a form with a checkbox to agree to 89 pages of terms and conditions. I'm well aware that Delta's lawyers are trying to make sure the company isn't liable if I go downloading bootlegs of every Ally McBeal episode. But I really don't know if these agreements are enforceable. For all I know, page 83 has me agreeing to 7 years indentured servitude cleaning Delta's toilets.
Anyway, Delta has outsourced operations of their wifi network to Concourse Communications. And apparently, they've had an outage all morning that has blocked anyone from using wifi in the Minneapolis Skyclubs. When I submit the form with the checkbox, I get the following error page:
Including this bit of stacktrace:
There's a lot to dislike here.
- Why is this yelling at me, the user? To anyone who isn't a web site developer, this makes it sound like the user did something wrong. There's a ton of scary language here: "instance-specific error", "allow remote connections", "Named Pipes Provider"... heck, this sounds like it's accusing the user of hacking servers. "Stack trace" sure sounds like the Feds are hot on somebody's trail, doesn't it?
- Isn't it fabulous to know that Ken keeps his projects on his D: drive? If I had to lay bets, I'd say that Ken screwed up his configuration string. In fact, the whole problem smells like a failed deployment or poorly executed change. Ken probably pushed some code out late on a Friday afternoon, then boogied out of town. My prediction (totally unverifiable, of course) is that this problem will take less than 5 minutes to resolve, once Ken gets his ass back from the beach.
- We mere users get to see quite a bit of internal information here. Nothing really damaging, unless of course Wilson ORMapper has some security defects or something like that.
- Stepping back from this specific error message, we have the larger question: is it sensible to couple availability of the network to the availability of this check-the-box application? Accessing the network is the primary purpose of this whole system. It is the most critical feature. Is collecting a compulsory boolean "true" from every user really as important as the reason the whole damn thing was built in the first place? Of course not! (As an aside, this is an example of Le Chatelier's Principle: "Complex systems tend to oppose their own proper function.")
We see this kind of operational coupling all the time. Non-critical features are allowed to damage or destroy critical features. Maybe there's a single thread pool that services all kinds of requests, rather than reserving a separate pool for the important things. Maybe a process is overly linearized and doesn't allow for secondary, after-the-fact processing. Or, maybe a critical and a non-critical system both share an enterprise service---producing a common-mode dependency.
Whatever the proximate cause, the underlying problem is lack of diligence in operational decoupling.