One thing that Steve Loughran talked about at the March 2002 Web Services DevCon was about scheduling loads. His philosophy is this: it takes 10 minutes to load a server, but only 1 minute to find a bug. Therefore, you should load at most once a day, to avoid running in circles.
How true this is. I've had a lesson in load discipline over the past few weeks. In the system I'm working on, it takes about an hour from the time a build is kicked off until the time the code is delivered from the staging area. To load on our "copy" system, where customers develop code against, it takes about 5 minutes each to load 2 app servers and another 15 minutes each to load 2 web servers. And that's the problem: it takes about 90 minutes and at least 3 people (me, the sysadmin who actually executes the install, and the change management coordinator)to build and install the system, and a minute to find the bug. Case in point: this morning, we loaded the system with the output of Friday's daily build. But somebody had checked in a stylesheet and apparently hadn't done much testing, and this completely broke one of the web services,which we found out not 5 minutes after the load was complete. If we'd let the build settle on a testing system, we'd have shaken that problem out there rather than on a (semi-) production system. So the first point is that you have to get a day of testing on some server that doesn't fall under change control.
The second problem is that when you do find a bug, the worst thing to do is to patch the bug and slam an update onto the server you've just loaded. This just puts you further behind the 8-ball; now everybody's life is on hold until the server's up. The better solution is to do some triage:can the server live until tomorrow in its current state? If so, then work towards shaking out the problem on development servers; if not, then roll back the install and fix the problem on development servers. Either way, if a load fails, it's not getting fixed until tomorrow. I don't care who's screaming for a fix; unless somebody's life is at risk, it's not happening until tomorrow.
What's so wrong with doing more than 1 load a day? Number one,it wastes time. Even in a heavily automated environment, a load takesa lot of time, much longer than finding bugs. Number two, with all the wasted time, the amount and quality of testing goes down, and you end up testing on production servers, where it's most expensive to fix. And finally, it puts too much pressure on everyone involved. Developers start to try to squeeze changes in before the build without adequate testing. The build monkey screws up an install script. Sysadmins start making typos. People under the gun make mistakes. This isn't trench warfare, it's software development,so don't go acting like you'll get a medal for heroics. It's far better to get into a rhythm and let the process work itself out.[Gordon Weakliem's Radio Weblog]