Stupid Human Programming
Talk on software development.








Subscribe to "Stupid Human Programming" in Radio UserLand.

Click to see the XML version of this web page.

Click here to send an email to the editor of this weblog.


Friday, March 10, 2006
 

WTF: The Least Used Resource Error

There's an unexpected and often fatal type of error that happens when you add resources to a horizontally scaled architecture. When the new resource comes online all traffic can be immediately redirected to the new resource, because it has the least load, and it just folds up and dies. You are left wondering WTF (what the f*ck) and it is really hard to track to down and even harder to fix.

The idea behind a scaling out horizontally is that you can add new resources to handle load. This sounds great and it works, but it has some subtle and surprising error conditions that you may want to keep in mind.

Imagine you have a load balancing appliance using the least load metric. Now you add same some new slave MySql servers to handle the load. Your load balancer will redirect traffic to the new slaves, but the slaves are trying to sync, yet they can't sink because they are getting hammered by the new traffic. Deadlock.

Imagine you have storage network that is full to the gills with data. You add a new appliance to give yourself more storage. Now what happens? All the new data goes to the new appliance. That means all users are hitting the same appliance for their data. Your performance slows to a crawl because the appliance can't handle that. You sort of expected parallel IO among all your appliances to handle the load. Now you say WTF?

A related idea is the dark side of partitioning. You partition data to get high performance via parallelization. For example, you hash on the user name to a cluster dedicated to handle those users. Unless your system is very flexible you can't scale anymore by adding resources because you can't repartition the data. All users are handled by their cluster. If you want a different organization you would have to redistribute data across all the clusters. Most systems can't handle that and you end not being able to scale out as easily as you wished.

All these problems have solutions of course, but when you first hit them, you get that deep WTF feeling only a Cosco jar size of antacids can soothe.

comment[]

7:30:54 AM    



Click here to visit the Radio UserLand website. © Copyright 2006 todd hoff.
Last update: 7/13/2006; 9:37:30 PM.
March 2006
Sun Mon Tue Wed Thu Fri Sat
      1 2 3 4
5 6 7 8 9 10 11
12 13 14 15 16 17 18
19 20 21 22 23 24 25
26 27 28 29 30 31  
Feb   Apr