|
Riding the Escalator
by Mark K. Mellis
<mkm@mellis.com>
Mark K. Mellis is principal engineer of Mellis and Associates, a
Silicon Valley consulting firm specializing in Unix systems and IP
networking. His interests include computer telephony, reliable network
infrastructure design, jazz, and rubber chickens.
I've always been challenged by the management of flipped-out users
ones who have a problem that, to them, is the most pressing
thing in the world. I "feel their pain," yet I have to balance their
needs in the context of the entire organization that I support. As
usual, I'm not the only one with this problem, and also as usual, the
other folks have a pretty good tool for managing the problem. In my
best "egoless sysadmin" style, I've adopted it for my own use. My new
magic bullet is the defined escalation procedure.
How It Works
You classify your incoming calls (or tickets or issues or whatever your
favorite euphemism might be) by priority. Each classification has
expected service goals associated with it. Those service goals are
published. If a call isn't progressing according to the service goals
for its classification, it moves to a higher classification, and those
involved are notified. This continues until the call is closed or it
reaches the top of the escalation ladder. The goal is never to have to
escalate calls beyond their initial level. In addition to
autoescalation, a call can be escalated at will by the submitter or by
the sysadmin organization. However, the path for each call is defined,
and all parties know in advance what their options are when progress
isn't as apparent as they wished it was. By defining the path through
which a problem is addressed and providing a clear mechanism to
escalate the issue when necessary, you help move frustration away from
the personalities involved and focus on the problem and its resolution.
Implications for the Organization
A defined and published escalation procedure helps relieve users'
stress because it gives them a way to bring their problems to higher
authority when they don't think they are getting the attention they
deserve.
Implicit within the escalation procedure is a defined priority for each
issue. When you have a clear idea of the priority of a problem, you can
allocate resources to it in a fair manner, and if it is escalated, the
increased resources of the higher level become available, assuring that
appropriate resources are channelled to critical problems.
Managers don't like to be blindsided. There's nothing like having your
boss called on the carpet for a problem you haven't told her about yet.
If escalations are actively managed, these situations are less likely
to occur.
Don't forget, you must define the escalation procedure in writing and
publish it throughout your user community and their management. If
users don't know about it, they can't use it.
Like all policies and procedures, defined escalation is unlikely to
succeed as a sysadmin tool without management support. Your boss needs
to "sign up" to be on the escalation path and needs to back you up in a
constructive manner when contacted by a constituent who is escalating
an issue.
Defined escalation is not a project management tool. Project work
should be managed separately.
Here is an example of the escalation procedure for a medium-sized
company:
- Routine request
- -Submit via Web form, email, or telephone call to x4357 (HELP), or call: Duty Sysadmin pager +1 408 555 5554.
- -Worked by duty sysadmin.
- -Routine requests are account creations, file restores, network moves, alias maintenance, and so forth.
- -Status reported at time of request and every 16 business hours thereafter.
- -Escalates to next level in 24 business hours.
- Minor outage, impairment
- -Submit via Web form, email, or telephone call to x4357 (HELP), or call:
Duty Sysadmin pager +1 408 555 5554, Duty manager pager +1 408 555 5555.
- -Worked by duty sysadmin, duty sysadmin manager, submitterÕs manager
automatically notified.
- -Minor outages affect eight or fewer users. For example, a single 10BaseT hub failure or a single workstation failure is a minor outage. Impairments are error conditions that have not yet caused an outage. For example, high error rates on a disk drive or network connection are impairments.
- -Status reported at time of request and every 4 business hours thereafter.
- -Escalates to next level in 8 business hours.
- -Major outage
- -Submit via Web form, email, or telephone call to x4357 (HELP), or call:
Duty Sysadmin pager +1 408 555 5554, Duty Manager pager +1 408 555 5555, IS Director pager +1 408 555 5556.
- -Worked by duty sysadmin and other resources as dispatched by management. Duty Sysadmin manager, IS Director, submitterÕs manager, and director automatically notified.
- -Major outages affect nine or more users or a major service, such as DNS or Internet connectivity. A major fileserver failure or a security incident in progress is a major outage.
- -Status reported at time of request and hourly thereafter.
- -Escalates to next level in 4 business hours.
- Disaster
- -Submit via Web form, email, or telephone call to x4357 (HELP), or call:
- Duty Sysadmin pager +1 408 555 5554, Duty Manager pager +1 408 555 5555, IS Director pager +1 408 555 5556, Office of the President pager +1 408 555 5557.
- -Worked by all available resources. All Sysadmin management, all directors, and the Office of the President automatically notified.
- -Disaster is a company-wide failure of IS infrastructure. A fire in the data center or a major security penetration is a disaster.
- -Status reported at time of request and hourly thereafter.
- -This is the highest escalation level.
Let's walk through a few cases. The network drop for Bill's workstation
fails. He calls x4357 and opens a routine ticket. The ticket number he
gets in return is his initial notification. The duty sysadmin comes by
in 20 minutes, repatches him to a working port, and closes the ticket.
Work flow is normal, so there is no escalation.
Susan needs an alias created to support her latest project and wants it
done immediately. She knows that alias creation is a routine event and
can reasonably take up to three business days. She can escalate it, but
her management will be automatically notified if she does. Susan
chooses to wait. The alias is created in a timely manner and the ticket
is closed.
Jake needs a special CAD package updated on his workstation. He opens a
routine ticket on Monday, but his request gets lost in the workload. On
Thursday, it is autoescalated from routine request to minor outage, and
the appropriate managers are notified. Jake's software is updated early
Friday morning by a chastened sysadmin.
Dorothy requests that her workgroup be moved to its own subnet to
improve performance. Router interfaces are in short supply and the
request can't be fulfilled in the designated time, so the sysadmin
escalates the request. During the subsequent automatic management
review, Dorothy is persuaded that she didn't need her own router
interface after all.
Defined escalation is a tool, and will benefit you only if you use it
properly. It must be accepted by your organization, including your user
community. You have to be willing to live with its consequences. In
return, you may reap the benefits of more businesslike interactions
with your constituency and emergencies that are really "emergent."
|