10.txt

Stupid Human Programming
Talk on software development.

Radio Home

Programming Weblog

News You'll Never See

My Home Page

C++ Coding Standards

SciFaiku

Weblogs

Friday, March 10, 2006

WTF: The Least Used Resource Error

There's an unexpected and often fatal type of error that happens when you add resources to a horizontally scaled architecture. When the new resource comes online all traffic can be immediately redirected to the new resource, because it has the least load, and it just folds up and dies. You are left wondering WTF (what the f*ck) and it is really hard to track to down and even harder to fix.

The idea behind a scaling out horizontally is that you can add new resources to handle load. This sounds great and it works, but it has some subtle and surprising error conditions that you may want to keep in mind.

Imagine you have a load balancing appliance using the least load metric. Now you add same some new slave MySql servers to handle the load. Your load balancer will redirect traffic to the new slaves, but the slaves are trying to sync, yet they can't sink because they are getting hammered by the new traffic. Deadlock.

Imagine you have storage network that is full to the gills with data. You add a new appliance to give yourself more storage. Now what happens? All the new data goes to the new appliance. That means all users are hitting the same appliance for their data. Your performance slows to a crawl because the appliance can't handle that. You sort of expected parallel IO among all your appliances to handle the load. Now you say WTF?

A related idea is the dark side of partitioning. You partition data to get high performance via parallelization. For example, you hash on the user name to a cluster dedicated to handle those users. Unless your system is very flexible you can't scale anymore by adding resources because you can't repartition the data. All users are handled by their cluster. If you want a different organization you would have to redistribute data across all the clusters. Most systems can't handle that and you end not being able to scale out as easily as you wished.

All these problems have solutions of course, but when you first hit them, you get that deep WTF feeling only a Cosco jar size of antacids can soothe.

comment[]
7:30:54 AM