26.txt

Stupid Human Programming
Talk on software development.

Radio Home

Programming Weblog

News You'll Never See

My Home Page

C++ Coding Standards

SciFaiku

Weblogs

Sunday, December 26, 2004

Scale Kills: Comair System Crash

An interesting article in slashdot (http://it.slashdot.org/article.pl?sid=04/12/26/052212):
30,000 people have had their flights cancelled by Comair this weekend thanks to
a computer system shutdown

A couple of posters said they didn't think it could be the software or shouldn't
be the software. This post was a good example:
> Computers don't freak out or get depressed
> when work piles up. Backlogs mean nothing;
> they just keep processing one piece at a
> time until the pieces run out. I think
> someone was speaking imprecisely.

In my experience, it's just the opposite. Systems usually only seriously break when scale increases. That's why unit testing is never even close to good enough coverage. To find scale problems you need to test at scale, and few people want to pay for that. So all hell breaks loose when scale starts happening.

Increases in backlogs may make queues sizes too small which causes drops which causes retransmissions which makes the problem spiral worse. Maybe a OS network stack queue gets full, a queue which you can't control, and you are in a downward spiral.

Or the queues may not be flow protected and your memory use sky rockets which causes a cascade of failures including out-of-memory conditions that may reassert themselves even after a reboot which causes continuous failure.

Any algorithms based on size X are now way too slow for 10X which can cause scaling problems everywhere else or pathologically slow times for certain algorithms.

CPU time is sucked up which again causes push back and scaling problems everywhere else. Priorities that worked with a certain workload may now cause too much work to be done which kills responsiveness and starves other parts of the system which spirals into more problems.

Improperly used mutexes may only be seen under scale problems because the trigger conditions were never created before. It only takes a mutex to be off by one instruction to cause a problem. Maybe the OS/application keeps a common mutex that is now being taken for much longer than before which causes new control flows which can cause deadlock or data structure corruption.

Message packets that assumed a certain size or a certain number of items may start failing because their sizes are exceeded.

Protocols that have never been tested with the different timing, error, and resource conditions may start failing or deadlock.

Counters may start overflowing and critical accounting data and alarm data may be lost.

Timers that may have worked with X timers start becoming very inaccurate at 10X

Parts like network adapters that your were told that has certain bandwidth, error, latency,
and priority characteristics may start not living up to their contracts.

The rates for alarms, logging, transactions, notifications, etc can get so large that there simply aren't enough available resources (memory, disk, database, CPU, network) to handle the increased load.

Scale kills.

I've talked about these problems and possible solutions at http://www.possibility.com/epowiki/Wiki.jsp?page=Scale

comment[]
7:58:13 AM