RE: DOE Releases Interim Report on Blackouts/Power Outages, Focus on Cyber Security
Well, they did specifically discount both current (at the time) Internet
worms/activity, and terrorist activity, as having any part in the blackout. As
for the RTU failures, FE told investigators they believed that was because they
believed the RTU's "started queuing and overloading the terminals buffers".
Given that the EMS Alarm program was already crashed at this stage, its
feasible to see a real-time reporting terminal not know what to do (other than
to page an FE IT person) when its host can't accept its input. Since they refer
to the RTU's connectivity both as "dial-ups" and as "data links", its hard to
say what they were. Nothing else in the EMS system failed until 14:54 when both
primary and backup EMS servers were down, so its unlikely that any "network"
connectivity between RTUs and EMS were interrupted due to the problems with the
EMS systems until that time. Ergo we're left with "comms" problems between RTU
and EMS that led some FE personnel to describe them as "network" problems. It
may all have simply been the fact that the RTUs had stalled, sending no "comms".
Interesting that a page went to FE IT folks when the RTUs stopped, but nothing
went to them with the EMS Alarm program "stalled".
I think the refresh rate of the EMS consoles isn't actually a factor. The alarm
function "stalled", or "froze", and did not produce any alarms. That EMS
consoles were being refreshed after 59 seconds didn't alter the fact operators
weren't seeing new alarms. The lack of alarms coupled with the arrogance of the
staff who insisted reports by others were mistaken led to critical failures in
line load which ultimately left them unable to recover.
During the same period of time MISO's State Estimating system, which was
receiving telemetry from much of FEs network, experienced a normal mis-match in
load calculations. A manual process is used to correct this, and was done
within ~30 minutes of its first occurrence near the FE problem time-frame. An
operator at MISO, however, left the estimating system in manual mode and went
to lunch. It was put back into automatic mode 93 minutes later, at which time
it again had a mis-match solution...so it had to be manually corrected again.
It wasn't back into automatic mode until 16:04. Hard to say it would have made
a big difference if it had been running in automatic mode during this whole
time. Probably yes, but given FE's adamancy they had good data, they may have
spent an equal amount of time arguing over who knew what.
If either of these events occurred independently, its likely the blackout could
have been avoided.
If FE's operators not been so sure of themselves, its likely the blackout could
have been avoided.
Finally, FE's IT staff took 54 minutes to complete their first attempt at
recovering the alarm process, this after both primary and backup servers had
failed (14 minutes after both had failed.) They were obviously relying on the
failure not transferring from the primary to the backup. 34 minutes after the
first warm reboot, and 4 minutes before the EMS crashed again, they discussed
with FE operators the possibility of doing a complete cold boot because only
then were they informed that the alarm function wasn't running (still). FE
operators dissuaded the IT staff from doing so, fearing they'd have less data
then they already had (arrogance again, they had already demonstrated their
inability to perform adequately with the "less" data.)
Unfortunately, nobody tells us how long it would have actually taken to do a
cold boot, and FE's IT staff say they didn't find out that was the only way to
recover the alarm system until after the blackout (meaning the warm boot was a
useless effort in the first place.)
And during all this time there were those damn trees!!!
MISO failed to adequately warn, and FE failed to adequately control its
security space (physically and electronically). And it all happened on a hot
August afternoon.
Cheers,
Russ - NTBugtraq Editor