The Problem of Scale

Jeremy Allaire's Radio
An exploration of media, communications and applications over the Internet.

The Problem of Scale

Problems in scaling and managing large-scale computing systems and applications are not new at all to the IT industry. Over several decades, enormous investments have been made into software, hardware and network systems that facilitate improved scalability and manageability of computing applications. Through the client/server era, and into the web applications era, major systems vendors have evolved the state-of-the-art in scaling software and hardware.

In particular, the advent of the Internet and a return to host-based applications (e.g. server-based application execution and delivery) forced the hand of the industry and created new companies and ideas centered around clustering and traffic management. Thousands of startups and enterprises alike invested in new hardware and software that would ensure that no visitor would be turned away. The famous "holiday e-commerce blitz's" of 1999 and 2000 underscored the importance of preparing for large-scale application deployments. For the first time in the history of computing, software applications would need to scale to support upwards of hundreds of millions of users throughout a day...this was not the domain of the LAN system administrators that were speckled throughout corporate IT departments.

I'm personally familiar with some of the incredible deployment environments that were and have been stitched together to support these deployments --- hundreds of servers running on clusters of clusters; hacks used for data redundancy and replication; and home grown "systems management" applications. We've all heard and read about how the largest scale "web services" on the planet (Yahoo and Google) deploy their systems on mostly home-grown platforms built out of Linux, FreeBSD and other commodity platforms.

While the mainstream attention on these issues has faded, it continues to be a major focus for the corporate IT world, media companies, and those few surviving Internet pure-plays who continue to grow and draw business to their web sites. Utilization of Internet computing continues to grow both inside of corporations and on the consumer Internet, and we're seeing the advent of new dynamics in software applications that will force the issue even further; distributed applications built around web services standards, and real-time rich media communications applications, which create new types of topologies for application distribution and use, as well as driving up the bandwidth and network requirements as audio and video gain more use.

Diseconomies of Scale

It's useful to step-back from the immediate issue of how one scales an Internet application, and look at broader trends in the use of computing resources in general. At one level, the current models for application delivery and IT infrastructure provisioning strike me as incredibly irrational. We're caught in a cycle of infrastructure investment where corporations have been taught (by experience and by industry dynamics) that they must continue to invest in and operate the hardware, software and network infrastructure need to run their business.

Hundreds of thousands of organizations around the world have been building and operating incredibly complex computing infrastructures. Each company replicating the same infrastructure. Each company (often) creating a surplus in computing resources (bandwidth, CPU, storage, etc.) that is never used. In absolute terms, the worldwide available computing resources (it would be great to have a metric here that was a worldwide composite index of network/cpu/storage) must far far outstrip the actual utilization of those resources.

Is it possible that in fifty years, when we look back, that we'll see this initial era of infrastructure building and provisioning as wildly inefficient and irrational? It's useful to consider historical parallels, the most common being the initial deployment and use of telephone networks. The story is well known --- that in the early years of telephones and phone switches that most companies operated their own phone networks, eventually abandoning those in favor of national and regional networks such as AT&T. Or during the industrial revolution, where factories and mills developed large-scale and very sophisticated energy plants before public networks such as the power grid and gas lines were available. It seems that there is an inevitable shift to public systems that provide a higher-degree of scale than vertical, isolated resources.

Indeed, it is ultimately the economic forces that seem to be driving corporations (and the IT industry that serves them) to focus more attention on how to deliver systems that operate in a more efficient manner and deliver true economies of scale.

Software as Service and Scale

The vision of software as service also plays an important role in defining this landscape. Here, I'm referring to two things, forming a holistic view of web services:

The use of XML-based standards to distribute application logic and data across networks (SOAP, WSDL, UDDI, etc.)
The delivery of rich, compelling software applications over the Internet to end-users.

These are crucial in a couple of ways. At one level, the advent of web services standards provides the potential basis for a more rational, efficient model of computing resource utilization. The theory is that, over time, corporations will be able to leverage application functionality (API's and data) provided by others as a service, offloading computing resources to the service provider and simplifying their own native infrastructure. This appears at face value to be roughly true, though the management and security considerations of these distributed applications has yet to be fully worked out in a standards-based fashion.

The second item -- the delivery of rich applications and experiences -- forms as important a backdrop. This is really the fulfillment of the original vision of hosted applications or ASPs, where both consumer and corporate applications are used as services from remote computers. This could be everything from a CRM application to a niche, regional application for managing a dentist office's appointments and accounting. It does seem clear that rich applications delivered over the network will over time replace desktop-based applications, forcing the same set of scaling and delivery challenges for small to large ISV's (service providers).

Both of these aspects of software as service require a network-based computing environment that can dynamically adapt to resource utilization. Will we see the same level of over-investment and provisioning; the same incredible, irrational redundancy in computing resources as what we've seen in internal IT infrastructure? And as the world of "internal" infrastructure and network-based services collide, what is type of scaling platform can be used?

Computing as a Utility

There are a variety of visions (and efforts) emerging to help address this situation, and all of them share a common theme: that computing resources (networks/CPU/storage) should be increasingly virtualizes to both the application and the operator of those systems.

At one level, basic commodity clustering, where applications are copied onto multiple machines and a hybrid of software and hardware clustering allocates connections to those machines, is a crude form of computing as a utility. Machines get used as they are needed, and the resource load is balanced amongst those machines. That's the primary best practice in Internet applications today.

We've also seen this evolve into more sophisticated systems that combine peer-to-peer computing, load balancing and dynamic application provisioning to help with the challenge. Macromedia has actually been working on some of the most advanced approaches to application server clustering using this approach. Macromedia JRun4 uses a JINI-based clustering architecture for load-balancing and application distribution. Clusters are defined as collections of physical computers. A Cluster Manager operates on each node and is aware of all machines and their state, eliminating any single point of failure. "Services" can be registered as clusterable, and when changes to the "services" (e.g. a Java application package such as a WAR or EAR file) occur, they are automatically replicated and hot deployed into all of the nodes in the cluster. The nodes all act as peers and communicate their state under certain conditions (e.g. a load balancing algorithm). While JRun4 distributed clustering uses JMX services as the basis for clusterable entities, this approach can easily be extended to properly defined WSDL-based services.

While commodity clustering helps with the immediate need of "private scaling", it does not address the broader long-term issue of "public computing" and actually operating computing resources as a utility that multiple applications and companies can execute within.

There's an enormous amount of research going into "grid computing", largely efforts to create software and hardware platforms that can leverage a distributed network of nodes to operate as a single computing image. Much of this is academic research, but it appears to be gaining traction with systems vendors such as IBM and Sun.

Solutions to Scale

I believe there are fundamentally two ways this can go: a Big Bang or Pragmatic Incrementalism

The Big Bang approach reflects efforts to create ground-up platforms that span hardware, software, networks and storage systems, and that provide layers of virtualization for service provisioning and delivery. The two closest approaches to this are Sun's N1 effort, and IBM's Autonomic Computing. While very little is known about these efforts publicly, they both appears to grapple with massive scale application and service provisioning and delivery. They appear to require upgrades to many components in the infrastructure stack.

While I don't doubt that these Big Bang approaches will work, I'm not sure that they fully escape the centralized essence of today's IT infrastructure. They still require that corporations invest in massive infrastructure, though apparently with much less "overhead" comparatively speaking. They look to me like contemporary equivalents of mainframe computing, but with a sensibility towards the heterogeneity of today's computing landscape.

The other approach, Pragmatic Incrementalism, seeks to leverage existing networks, hardware and software platforms in innovative ways, creating a similar end-result. The only example I know of this today is Akamai's Edge platform. In particular, Akamai's efforts to deliver not just content but applications in the edge of the network. Again, not a lot is known publicly about this effort, accept for these press releases with IBM and Microsoft.

Akamai has built a operating platform that exists inside the Internet. They've installed over 13,000 commodity PC servers running Linux and Windows inside thousands of ISP's and network points. Inside that network and those machines, they operate a sophisticated set of software which monitors the stage of the network, machines and storage. In some respects, this massive network is a virtualized delivery platform. That is certainly how it operates today for content delivery. These partnerships with IBM and Microsoft appear to be geared towards leveraging that incredible distributed footprint into application delivery. Presumably, a corporation with a J2EE or .NET application can deploy into a staging environment, and then as the application is used around the world, it is dynamically provisioned into the appropriate nodes. Pricing is unknown, but given this distribution model, it would seem like a likely candidate for utilization-based pricing (e.g. total CPU cycles and memory used worldwide).

This approach begins to reach the vision of software as a service, where service providers merely provision their application into a cloud, and are billed entirely on utilization of resources, avoiding the incredibly costly and inefficient effects of over-provisioning. Ultimately, it may be that the results of the Big Bang R&D become the "nodes in the network" that are contained in a network like Akamai's.

Will we soon see a day when corporations and software providers both big and small can rely on a public computing infrastructure that automatically scales, that is efficient, and that delivers on the vision of computing as a utility?