RE: DOE Releases Interim Report on Blackouts/Power Outages, Focus on Cyber Security

To: "Geoff Shively" <gshively@xxxxxxxx>
Subject: RE: DOE Releases Interim Report on Blackouts/Power Outages, Focus on Cyber Security
From: "Russ" <Russ.Cooper@xxxxxxxx>
Date: Fri, 21 Nov 2003 17:54:41 -0500
Cc: <bugtraq@xxxxxxxxxxxxxxxxx>
List-help: <mailto:bugtraq-help@securityfocus.com>
List-id: <bugtraq.list-id.securityfocus.com>
List-post: <mailto:bugtraq@securityfocus.com>
List-subscribe: <mailto:bugtraq-subscribe@securityfocus.com>
List-unsubscribe: <mailto:bugtraq-unsubscribe@securityfocus.com>
Mailing-list: contact bugtraq-help@xxxxxxxxxxxxxxxxx; run by ezmlm
Thread-index: AcOwXj6dK8+OTulQTZ+aV09jMrl15gAHFhdg
Thread-topic: DOE Releases Interim Report on Blackouts/Power Outages, Focus on Cyber Security

Well, they did specifically discount both current (at the time) Internet 
worms/activity, and terrorist activity, as having any part in the blackout. As 
for the RTU failures, FE told investigators they believed that was because they 
believed the RTU's "started queuing and overloading the terminals buffers". 
Given that the EMS Alarm program was already crashed at this stage, its 
feasible to see a real-time reporting terminal not know what to do (other than 
to page an FE IT person) when its host can't accept its input. Since they refer 
to the RTU's connectivity both as "dial-ups" and as "data links", its hard to 
say what they were. Nothing else in the EMS system failed until 14:54 when both 
primary and backup EMS servers were down, so its unlikely that any "network" 
connectivity between RTUs and EMS were interrupted due to the problems with the 
EMS systems until that time. Ergo we're left with "comms" problems between RTU 
and EMS that led some FE personnel to describe them as "network" problems. It 
may all have simply been the fact that the RTUs had stalled, sending no "comms".

Interesting that a page went to FE IT folks when the RTUs stopped, but nothing 
went to them with the EMS Alarm program "stalled".

I think the refresh rate of the EMS consoles isn't actually a factor. The alarm 
function "stalled", or "froze", and did not produce any alarms. That EMS 
consoles were being refreshed after 59 seconds didn't alter the fact operators 
weren't seeing new alarms. The lack of alarms coupled with the arrogance of the 
staff who insisted reports by others were mistaken led to critical failures in 
line load which ultimately left them unable to recover.

During the same period of time MISO's State Estimating system, which was 
receiving telemetry from much of FEs network, experienced a normal mis-match in 
load calculations. A manual process is used to correct this, and was done 
within ~30 minutes of its first occurrence near the FE problem time-frame. An 
operator at MISO, however, left the estimating system in manual mode and went 
to lunch. It was put back into automatic mode 93 minutes later, at which time 
it again had a mis-match solution...so it had to be manually corrected again. 
It wasn't back into automatic mode until 16:04. Hard to say it would have made 
a big difference if it had been running in automatic mode during this whole 
time. Probably yes, but given FE's adamancy they had good data, they may have 
spent an equal amount of time arguing over who knew what.

If either of these events occurred independently, its likely the blackout could 
have been avoided.

If FE's operators not been so sure of themselves, its likely the blackout could 
have been avoided.

Finally, FE's IT staff took 54 minutes to complete their first attempt at 
recovering the alarm process, this after both primary and backup servers had 
failed (14 minutes after both had failed.) They were obviously relying on the 
failure not transferring from the primary to the backup. 34 minutes after the 
first warm reboot, and 4 minutes before the EMS crashed again, they discussed 
with FE operators the possibility of doing a complete cold boot because only 
then were they informed that the alarm function wasn't running (still). FE 
operators dissuaded the IT staff from doing so, fearing they'd have less data 
then they already had (arrogance again, they had already demonstrated their 
inability to perform adequately with the "less" data.)

Unfortunately, nobody tells us how long it would have actually taken to do a 
cold boot, and FE's IT staff say they didn't find out that was the only way to 
recover the alarm system until after the blackout (meaning the warm boot was a 
useless effort in the first place.)

And during all this time there were those damn trees!!!

MISO failed to adequately warn, and FE failed to adequately control its 
security space (physically and electronically). And it all happened on a hot 
August afternoon.

Cheers,
Russ - NTBugtraq Editor

Prev by Date: Unhackable network really unhackable?
Next by Date: hard links on Linux create local DoS vulnerability and security problems
Previous by thread: Re: DOE Releases Interim Report on Blackouts/Power Outages, Focus on Cyber Security
Next by thread: FreeRADIUS 0.9.2 "Tunnel-Password" attribute Handling Vulnerability
Index(es):
- Date
- Thread