|
Sunday, December 26, 2004
|
|
|
Scale Kills: Comair System Crash
An interesting article in slashdot (http://it.slashdot.org/article.pl?sid=04/12/26/052212):
30,000 people have had their flights cancelled by Comair this weekend thanks to
a computer system shutdown
A couple of posters said they didn't think it could be the software or shouldn't
be the software. This post was a good example:
> Computers don't freak out or get depressed
> when work piles up. Backlogs mean nothing;
> they just keep processing one piece at a
> time until the pieces run out. I think
> someone was speaking imprecisely.
In my experience, it's just the opposite. Systems usually only
seriously break when scale increases. That's why unit testing is never
even close to good enough coverage. To find scale problems you need to
test at scale, and few people want
to pay for that. So all hell breaks
loose when scale starts happening.
Increases in backlogs may make queues sizes too small which causes
drops which causes retransmissions which makes the problem spiral
worse. Maybe a OS network stack queue gets full, a queue which you
can't control, and you are in a downward spiral.
Or the queues may not be flow protected and your memory use sky rockets
which causes a cascade of failures including out-of-memory conditions
that may reassert themselves even after a reboot which causes
continuous failure.
Any algorithms based on size X are now way too slow for 10X which can
cause scaling problems everywhere else or pathologically slow times for
certain algorithms.
CPU time is sucked up which again causes push back and scaling problems
everywhere else. Priorities that worked with a certain workload may now
cause too much work to be done which kills responsiveness and starves
other parts of the system which spirals into more problems.
Improperly used mutexes may only be seen under scale problems because
the trigger conditions were never created before. It only takes a mutex
to be off by one instruction to cause a problem. Maybe the
OS/application keeps a common mutex that is now being taken
for much longer than before which causes new control flows which can
cause
deadlock or data structure corruption.
Message packets that assumed a certain size or a certain number of items may start failing because their sizes are exceeded.
Protocols that have never been tested with the different timing, error, and resource conditions may start failing or deadlock.
Counters may start overflowing and critical accounting data and alarm data may be lost.
Timers that may have worked with X timers start becoming very inaccurate at 10X
Parts like network adapters that your were told that has certain bandwidth, error, latency,
and priority characteristics may start not living up to their contracts.
The rates for alarms, logging, transactions, notifications, etc can get
so large that there simply aren't enough available resources (memory,
disk, database, CPU, network) to handle the increased load.
Scale kills.
I've talked about these problems and possible solutions at http://www.possibility.com/epowiki/Wiki.jsp?page=Scale
7:58:13 AM
|
|
|
|
© Copyright
2006
todd hoff.
Last update:
7/11/2006; 1:20:35 PM.
|
|
December 2004 |
Sun |
Mon |
Tue |
Wed |
Thu |
Fri |
Sat |
|
|
|
1 |
2 |
3 |
4 |
5 |
6 |
7 |
8 |
9 |
10 |
11 |
12 |
13 |
14 |
15 |
16 |
17 |
18 |
19 |
20 |
21 |
22 |
23 |
24 |
25 |
26 |
27 |
28 |
29 |
30 |
31 |
|
Nov Jan |
|