Website and Res System Down

hopefull this is nothing more than a power outage and not a cyber attack ... there have been scores of cyber attacks within the last few months on major corporations ....
 
hopefull this is nothing more than a power outage and not a cyber attack ... there have been scores of cyber attacks within the last few months on major corporations ....
Good point freedom...how do we know it wasn't a conspiracy between the Federal Reserve and illegal aliens??
 
  • Like
Reactions: 1 person
As much as I hate to quote anything from anet, there is always room for an exeption. From a poster:
There is no valid excuse whatsoever for a single datacenter failure to affect a passenger carrying operation. The CIO would be handing me his resignation, or at the very least their VP of Datacenter Ops would be. There should be as many 9s as possible at least in any operational processes. They could allow some things like email to fail, sure, but the phones, revenue systems, gates, ramp, all of that stuff have to be as close to absolute as can be achieved.

If I ever had something like this happen, I would fire myself immediately.


A good point indeed!
 
a valid comparison is what type of experience have US' airline peers had with system reliability over the past several years...
I don't know but I don't think AA, CO, DL, or UA have had systems failures of this magnitude... but I'm not sure.
.
also, while the failure was widespread, it did not appear to last a long period of time... anyone know how long everything was really down?
.
other carriers have had multi-hour failures of separate/component systems....
 
a valid comparison is what type of experience have US' airline peers had with system reliability over the past several years...
I don't know but I don't think AA, CO, DL, or UA have had systems failures of this magnitude... but I'm not sure.
.
also, while the failure was widespread, it did not appear to last a long period of time... anyone know how long everything was really down?
.
other carriers have had multi-hour failures of separate/component systems....

The only one in recent memory was the Comair failure

In December 2004, a glitch developed in Comair’s flight crew scheduling software known as the SBS Legacy System. This forced the company to down all operations during the busy holiday season. 1,100 flights were cancelled and 30,000 passengers were grounded. During the disaster the company maintained that the winter storm that came to the Ohio Valley was main part of the problem and not their IT system. This storm caused Comair to cancel or delay more than 90 percent of their flights between December 22nd and 24th .The storm though was only part of their problem and not the single cause of it. On Christmas day the SBS legacy system, which was nearly two decades old, crashed. What no one at the company knew was that the reason the system crashed was that it had reached its limit. The system had an antiquated counter that logged schedule changes and by that day, it had logged more than its monthly limit of 32,768 changes. The weather caused so many schedule changes that the system had finally reached its limit and shut down. All the flights for December 25th were wiped out and most of those for the 26th. They had no backup system and their software vendor needed to take one full day to repair the system.

By the time the problem was resolved, the damage had already been done. Delta, which acquired Comair in 2000, lost almost all the profits earned by Comair in the previous quarter. They lost $20 million from the system failure.

If I recall one of the Senior Managers of Comair voluntarily tendered his resignation shortly thereafter. Luckily for Comair, US Airways had their legendary "Christmas Meltdown" at the same time which caught the medias attention more than the Comair event did.

When you consider that some of the systems in place have source code that dates back to the early 1960's. In fact Frank Lorenzo's bought Eastern Airlines in order to get what Eastern called "System One" which all of you know as SHARES. There are some US agents floating around the system who actually were trained on System One and can make SHARES purr like a kitten if you, as a customer get stuck. Two that I met are in BOS.

The system failure is yet another example of the high cost of cheap and the spreadsheet mentality of current management.

If you look at US IT from the Beery Era going forward you'll notice a few things.
SHARES was selected based solely on the estimated cost savings over SABRE. Does SHARES now perform as well as SABRE does? From where I sit the answer is NO! Has it improved since US's other IT debacle known as the Res Migration? Yeah it has, mainly because it had no place to go but up. Is it "World Class" or on par with other Star Partners, again I'd have to say NO.

When you run a bare bones IT operation, things like redundancy and disaster recovery take a back seat. US has likely saved many more millions more than this will cost. They say well.. Yeah it cost a few million but over the last 3 quarters we've saved $X by not doing the upgrades and even though this cost us $Y we still saved enough to meet our targets so we all get our bonus.

This is the way this current team operates and you see it at every turn. NO WHERE in their spreadsheet driven world is there room for the Customer and the fact that a great many people who were effected may never fly US Airways ever again. They don't care because they're focused on this quarter and not 2 years from now when Mr & Mrs Volvo and their 2.2 kids go to see Mickey. It doesn't mean they're evil, it's just who they are and like Oprah says, "When people show you who they are BELIEVE THEM"
 
  • Like
Reactions: 1 person
To be fair to IT, I would imagine that they, at some point, recommended system redundancy (aka backup systems). However, it was probably seen as an unnecessary cost. In the years I was in IT at Texaco our backup systems had backups in critical operations.
 
To be fair to IT, I would imagine that they, at some point, recommended system redundancy (aka backup systems). However, it was probably seen as an unnecessary cost. In the years I was in IT at Texaco our backup systems had backups in critical operations.

Excellent point Jim. Backup & redundancy are "Costs" right up until the day AFTER a system failure and the finger pointing starts as the dollars start rolling out the door to address the problem.

Having sold into IT environments it has always amazed me is the wide range of Redundancy, Backup & security that companies have or don't have and the size of the company didn't seem to matter. Over the years I've been in some places that were so lax that I thought to myself, "If I pull that big power cable out of the wall I could likely shut down the company". While others had double and triple redundancy.

There is a company in the Midwest that does disaster recovery and their computer facility is underground in an abandoned Salt mine. Supposedly they can with stand a nuclear blast overhead of the site.

This stuff is tricky in that how much protection is enough and what does it cost?
 
  • Like
Reactions: 2 people
If all of those disparate systems are unavailable, my guess is it's a network problem, which may or may not be within US's control.

That assumption only holds up if US lacks proper redundant systems and they all failed. Could happen--unlikely if implemented properly.

This stuff is tricky in that how much protection is enough and what does it cost?

A properly sized UPS, generator set, and transfer switch plus the diesel for today's fun can be had for less than $250k.

IOW, if US is claiming a "brownout" did it, they either did not spend the money for backup power, did not test it adequately, or had a cascading failure across their power infrastructure. At which point, you can avoid the entire thing by having your important applications clustered or load balanced or replicated to a second data center with a different power feed. This is not rocket science for a Fortune 1000 company.
 

Latest posts