[IP] more on LA ATC Failure The Risks Digest Volume 23: Issue 54
Begin forwarded message:
Re: LA ATC Failure (RISKS-23.53)
<Paul Cox <pcox@xxxxxxxxxx>>
Thu, 23 Sep 2004 13:03:40 -0700
I'm an air traffic controller in Seattle Center, which is a facility
just
like the one in LA that had the crash.
To do their job, air traffic controllers need one thing above/beyond
all:
They need the ability to communicate with the aircraft they're
controlling.
We can control planes even without radar, because we can get position
reports from the airplanes and provide safe separation via altitude,
spacing, and so forth. But without comm, we're completely and utterly
hosed.
(Some of the FAA spokesflacks had the audacity to suggest that the
system
was still safe, because the radar system continued working just fine.
Sure,
the controllers could still *see* the airplanes; they just couldn't do
anything about it as they watched them get closer, and closer, and
closer... they'd have had a wonderful view of the targets merging as the
passengers were converted instantly a thin pink mist had the planes
collided. But hey, the system was safe.)
The VSCS (Voice Switching Communications System) puts all of our
communications into one spot- ground-to-ground calls to other
facilities,
calls within our own facility to other controllers, and air-to-ground
comm.
It's a purely digital system; all the incoming feeds are converted to
bits
and bytes and switched through a series of servers and such until
they're
turned back into analog and put into the controller's ear through his
headset.
Of course, this means that power to the system is absolutely critical,
and
we've had power failures in the past (see past RISKS for that info).
The VSCS system was designed and built by Harris Corporation, but their
contract ran out some time ago. The FAA, coming to the end of the
contract,
decided to go a much less expensive route- and replace all the servers
with
Dell boxes and their own programming.
In theory, there's nothing wrong with this; do the required
maintenance, and
there's no problem. But the system does have the design flaws referred
to
in the RISKS articles.
Basically, the system needs to be reset about once a month- or more
specifically, once every 30 days or so. I heard a rumor that part of
the
problem in LA was that they'd done the reset at the beginning of
August, but
had put it off for September... and were planning to do it at the end
of the
month.
There's a RISK right there; "once a month" probably means "once every
30 or
so days", not "once in a calendar month" which could leave an interval
as
long as nearly 60 days in between resets.
(On a side note, the voice recordings are only kept for the past 15
days,
and it's done by an entirely separate system. The main reason for the
reset
has to do with file and memory buffers overloading.)
Now, there's a backup system for VSCS. It's called VTABS, and is
basically
a reduced-capability server that normally runs the VSCS system on the
ATC
simulator that's used to train students.
The VTABS system, with much less server power, cannot run the entire
control
room and all of the frequencies that the control center has, so it's a
hassle to go to VTABS.
When the reset on VSCS is done, you have to run on VTABS for a while,
which
usually means it's done on graveyard shifts to reduce the impact on live
traffic. The downside to this is that the VTABS system also doesn't
get a
full workout.
So the next RISK pops up: The backup system isn't really fully checked
out,
and if/when ATC needs it... it might not work.
Sure enough, that happened. When VSCS died, LA Center switched to
VTABS... which also didn't work right. Big trouble, now.
Finally, the FAA (in its infinite wisdom) a while back decided to
remove a
last-ditch backup system called EARS.
EARS was basically a hard-wired, all-analog system that only provided
the
most crucial thing- air-to-ground communications.
EARS required power to run, but the reason it had a big advantage over
VSCS
or VTABS is that if the power died for, say, 20 seconds, as soon as the
power was back on EARS would work with no spool-up startup time. VSCS
takes
up to 45 minutes to completely start up, and VTABS has a significant
delay
in startup time as well.
Seattle Center (where I work) is the only facility of its type that
still
has EARS (our variant is called VEARS). We have it because a fairly
wise
manager asked our technicians to keep the system when it was slated for
removal. The tech side agreed, and have kept VEARS going by moving a
little
money around in their budget (since FAA nationally cut VEARS, they don't
provide any money to maintain the system to the facilities.)
Fortunately (and perhaps a bit unbelievably) VEARS costs very very
little to
maintain, because it's just a set of switches that sit there unused the
huge
majority of the time. We test them for functionality about once a week.
The LA failure was both ridiculous and scary. It's ridiculous on
several
levels; the fact that the system is designed to shut itself down is
silly in
a way, because from the user's perspective the system basically crashes
to
protect itself from crashing.
Well, when suddenly you can't talk to the airplanes, you don't much
give a
damn whether it's an intentional shutdown or an accidental/buggy
shutdown.
Therefore, they might as well remove this intentional design.
It's ridiculous that the technicians weren't doing the reset. This
issue is
NOT NEW, and has been known for some time... and had any of the 10
airplanes
(with 200 passengers each) managed to smack into another plane, you can
bet
that the FAA would have been paying the families for a long, long, long
time.
It's ridiculous that the first backup system didn't work right simply
because people were too lazy/unmotivated to test it properly. VTABS is
an
acceptable backup; it's not perfect, but for the money it cost
(essentially
nothing for hardware, some reprogramming costs for the servers) it's
nearly
ideal.
It's ridiculous that a perfectly good SECOND backup was thrown away by
the
FAA that cost even less. The technology in EARS has been around since,
oh,
about as long as there's been radio; it's tried and true, and it's
pathetic
that there's only one facility in the nation (out of 21) that still has
EARS.
And it's scary to think that this could've happened in an even busier
facility than LA. The morning crush of traffic in New York or Boston or
Indy or Cleveland Centers, for example, where there's even more traffic
packed into even less airspace than out west in LA.
The RISKS here are many and silly, because nearly all of them could have
been easily avoided with some diligence and forethought.
RISK 1) programming the system to shutdown to try and prevent a
shutdown.
If you don't expect it either way, it doesn't matter.
RISK 2) being lazy or not really understanding that "once a month"
actually
means "once every 30 days" and ensuring that a critical job is done, on
time, and correctly.
RISK 3) having a backup system that isn't checked to see if it can
actually
do the job. You rely upon it, it better work, and if/when it doesn't,
you're screwed.
RISK 3) throwing out a perfectly good second backup system because you
think
it's "old fashioned" and that the primary/secondary system you have now
is
so much better. Hey, the new stuff is all digital, it's gotta be
better,
right?
Finally, on a personal note, the manager at Seattle Center who managed
to
talk the technical guys into keeping our VEARS system should be
considered a
hero and an example for the rest of the FAA. He's already a hero to me-
he's my father. :)
Paul Cox, Seattle Center
-------------------------------------
You are subscribed as roessler@xxxxxxxxxxxxxxxxxx
To manage your subscription, go to
http://v2.listbox.com/member/?listname=ip
Archives at: http://www.interesting-people.org/archives/interesting-people/