Begin forwarded message: From: "DL Neil, Newsletter account" <IP@xxxxxxxxxxxxxxxxxxxxxx> Date: September 28, 2004 10:01:47 AM EDT To: dave@xxxxxxxxxx Subject: Re: [IP] Microsoft server crash nearly causes 800-plane pile-up Dave,This is reprehensible, not the least that the controllers obviously lost their ability to communicate with planes despite radio and radar still working, so one questions the fault-tolerance of their overall system design, backups, practicing, etc.
Next up is the idea that someone regards the reboots as 'critical' and then makes it (a) a manual process, and (b) an unsupervised process at that.
Next comes questions of the wisdom of running a 24x7 non-stop system on servers which quite plainly cannot run non-stop. Who signed-off and paid that invoice?
Finally are they sure that there is really is such a problem necessitating these timed shutdowns? I recall ages ago that there was a fault with (client) PCs that would 'die' if they ran continuously for too long. The joke at the time was that MS wouldn't rush to fix the problem on the grounds that very few machines stayed up for more than 49.7 MINUTES without falling over for some other (spurious) reason, so the problem wasn't exactly 'high profile' or urgently in need of a fix! Further when I did Y2K verification work I recall this question rearing its ugly head, and that it was necessary to install a patch/service pack download including the fix to this problem in order to make Win9n machines Y2K-compliant.
So sarcasm aside I went alooking and sure enough the Win9n "Computer Hangs after 49.7 days" problem appears in the MS Knowledgebase as http://support.microsoft.com/default.aspx?scid=kb;en-us;216641 and there's even a single-purpose downloadable fix available from http://www.microsoft.com/downloads/details.aspx?FamilyID=5533e105-bb49 -41dd-8dfb-fdd9df4db39a&displaylang=en. The best I could find for WinNT is the problem of losing perf counters, at http://support.microsoft.com/default.aspx?scid=kb;EN-US;169847 which is fixed within Service Packs that should be ancient history (even for those still using NT) - but I can't believe that Civil Aviation would be so bothered about uptime counters that they would reboot whole servers - so perhaps there was once more to the issue/in the MS-KB. The Knowledgebase does not appear to have anything to say about this fault appearing under Win2000 (per the article reported) - even if I have to agree from experience is hardly conclusive evidence.
I don't often find myself in the role of apologist for Microsoft (fortunately) and an anti-softie will quickly point out that the *NIX system replaced by the Dell boxes wouldn't have had this problem (take note!) but it really sounds like the question to be asked is not "why did THE procedure not work?" but "why is this even (still) in the procedure?". In other words it is not a technical person whose head should be on the block, but 'management'! However the root of all evil within Civil Aviation systems rarely seems to be technical design, but more often a result of weak management decisions and faceless compromises. Perhaps a 'management' or 'mission' fix here is more urgent, and will yield more benefit (in the words of the President, "keep our people safe"), than all this chasing after the chimera of pre-identifying terrorists?
Regards, =dn David Farber wrote:
Begin forwarded message: From: Bruce R Koball <bkoball@xxxxxxxx> Date: September 27, 2004 7:33:48 PM EDT To: David Farber <dave@xxxxxxxxxx> Cc: ip@xxxxxxxxxxxxxx Subject: Microsoft server crash nearly causes 800-plane pile-up Dave, This is a week old, but I don't recall seeing it on IP. -brk- http://www.techworld.com/opsys/news/index.cfm?NewsID=2275Microsoft server crash nearly causes 800-plane pile-up Failure to restart system caused data overload. By Matthew Broersma, Techworld A major breakdown in Southern California's air traffic control system last week was partly due to a "design anomaly" in the way Microsoft Windows servers were integrated into the system, according to a report in the Los Angeles Times.The radio system shutdown, which lasted more than three hours, left 800 planes in the air without contact to air traffic control, and led to atleast five cases where planes came too close to one another, according to comments by the Federal Aviation Administration reported in the LA Times and The New York Times. Air traffic controllers were reduced tousing personal mobile phones to pass on warnings to controllers at other facilities, and watched close calls without being able to alert pilots,according to the LA Times report. The failure was ultimately down to a combination of human error and a design glitch in the Windows servers brought in over the past threeyears to replace the radio system's original Unix servers, according tothe FAA. The servers are timed to shut down after 49.7 days of use in order to prevent a data overload, a union official told the LA Times. To avoidthis automatic shutdown, technicians are required to restart the systemmanually every 30 days. An improperly trained employee failed to resetthe system, leading it to shut down without warning, the official said.Backup systems failed because of a software failure, according to a report in The New York Times. The contract for designing the system, called Voice Switching andControl System (VSCS), was awarded to Harris Corporation in 1992 and thesystem was installed in the late 1990s, initially using Unix servers,according to Harris. In 2001, the company completed testing of the VSCSControl Subsystem Upgrade (VCSU), which replaced the original serverswith off-the-shelf Dell hardware running Microsoft Windows 2000 Advanced Server. The upgrade was installed in California last year, according tothe FAA. Soon after installation, however, the FAA discovered that the system design could lead to a radio system shutdown, and put the maintenance procedure into place as a workaround, the LA Times said. The FAA reportedly said it has been working on a permanent fix but has onlyeliminated the problem in Seattle. The FAA is now planning to institutea second workaround - an alert that will warn controllers well before the software shuts down. The shutdown is intended to keep the system from becoming overloaded with data and potentially giving controllers wrong information about flights, according to a software analyst cited by the LA Times. Microsoft told Techworld it was aware of the reports but was not immediately able to comment.------------------------------------- You are subscribed as IP@xxxxxxxxxxxxxxxxxxxxxx To manage your subscription, go to http://v2.listbox.com/member/?listname=ipArchives at: http://www.interesting-people.org/archives/interesting-people/
------------------------------------- You are subscribed as roessler@xxxxxxxxxxxxxxxxxx To manage your subscription, go to http://v2.listbox.com/member/?listname=ip Archives at: http://www.interesting-people.org/archives/interesting-people/