Impressions of the Apple G5 Dual Supercluster at Virginia TechBy Juha Haataja
Version 0.1 / October 15, 2003
Version 0.2 / October 16, 2003 (re-organize text, correct details, add question about PBS)
Version 0.3 / October 23, 2003 (include preliminary R_max value)
Version 0.4 / October 30, 2003 (updated with info from TenCon keynote)
Version 0.5 / November 3rd, 2003 (updated top-500 supercomputer data)
Version 0.51 / November 5th, 2003 (added some small details)
Version 0.52 / November 17th, 2003 (updated with official top-500 info)
The Terascale Facility at Virginia Tech
Here you find some insights and opinions about the G5 supercomputing cluster at Virginia Tech.
The G5 cluster contains 1100 Apple G5 systems each having two IBM PowerPC 970 processors rated at 2 GHz. This is a rather low-cost system in the top-10 supercomputer class, even when including all the personnel, facilities, and software costs.
The TOP500 list of supercomputers (November 2003) was announced at the SC2003 conference at Phoenix, Arizona. The Apple G5 dual supercluster is listed as the third fastest supercomputer, and the fastest academic supercomputer. The listed R_max speed is 10.3 teraflop/s. The vendor is given as "self-made." (The next self-made system is in position 63.)
Of course, the benchmark performance R_max (maximal LINPACK performance) will be lower (see below for results). Initially, I speculated a speed in the 5-10 teraflop/s range. But, based on experiences of clusters build with commodity PC processors, I predicted an R_max score of about 3 teraflop/s. However, the system performed better than I expected.
Update 1: New York Times reported that the R_max speed of the system is 7.41 teraflop/s, which is reasonable but still rather good. However, the experts at Virginia Tech are still polishing the benchmark setup, so the final result may be higher. Already this value puts the system in the top-10 category of supercomputers. Here is a quote from the NYT story on Low-Cost Supercomputer Put Together From 1,100 PC's:
The Virginia Tech supercomputer, put together from 1,100 Apple Macintosh computers, has been successfully tested in recent days, according to Jack Dongarra, a University of Tennessee computer scientist who maintains a listing of the world's 500 fastest machines.
The official results for the ranking will not be reported until next month at a supercomputer industry event. But the Apple-based supercomputer, which is powered by 2,200 I.B.M. microprocessors, was able to compute at 7.41 trillion operations a second, a speed surpassed by only three other ultra-fast computers.
The fastest computers on the current Top 500 list are the Japanese Earth Simulator; a Los Alamos National Laboratory machine dedicated to weapons design; and another weapons oriented cluster of Intel Pentium 4 microprocessors at the Lawrence Livermore National Laboratories.
Officials at the school said that they were still finalizing their results and that the final speed number might be significantly higher.
Update 2: There is a preliminary report (in pdf, dated October 22, 2003) of the bechmark results in supercomputing compiled by Jack Dongarra. See page 53 for the highly parallel benchmark for supercomputers. On this list the G5 cluster at Virginia is performing at 8164 gigaflop/s, or 8.16 teraflop/s. Thus, the machine would be at position 4 of the top-500 list. And there might still be room for improvement in the benchmark speed.
Update 3: Jack Dongarra has updated his report on supercomputers on October 28, 2003, and now the Apple G5 dual supercluster is third in the list of top 500 supercomputers. The R_max speed is now 9.6 teraflop/s, almost 20% improvement from the previous value last week. Apparently they are still tuning the communications network for getting maximum performance of the system. I wonder if the Infiniband network will be able to match the speed of the G5 processors.
Update 4: On the report dated November 2nd the G5 supercluster has achieved a speed of over 10 teraflop/s.
Here is a listing (updated November 2nd, 2003) of the top-5 systems from the draft report:
All the speeds are in gigaflop/s (10^9 floating point operations per second).
System Nr of procs R_max R_peak Earth Simulator 5120 35860 40960 ASCI Q AlphaServer 8160 13880 20480 Apple G5 dual 2200 10280 17600 HP RX2600 Itanium 2 1936 8633 11616 ASCI Q AlphaServer 4096 7727 10240
The Institute for Computational Science and Engineering (ICSE) at Virginia Tech has impressive facilities in addition to the recent Apple G5 supercluster with 2200 processors. The facilities include a Linux Opteron cluster with 400 processors, a SGI Origin 2000 with 20 processors, a CAVE, and network connections to the National Lambda Rail. This is serious hardware for anyone interested in cutting-edge computational science.
Virginia Tech to upgrade supercomputer to Xserve G5: "The new system, which went online toward the end of last year and which Virginia Tech said was the most powerful supercomputer at any university in the world at the time, will be completed by May. By moving to the thinner servers, the supercomputer will consume less power and generate less heat [...] The price of the upgrade has not yet settled on, but Varadarajan said it would be minimal compared to the cost of building a new supercomputer from scratch." [The Macintosh News Network]
This was to be expected. The new PowerPC 970FX's in the Xserve G5 require only half of the power of the original 970 processors. This is due to moving from 130 nm to 90 nm manufacturing process at IBM. The upgrade is not a small matter, but if someone can make it happen, Virginia Tech is the place to try it first.
Why I am writing this
I have worked for 15 years in computational science and supercomputing. As a long-time Mac user I'm interested in seeing how the G5 system performs with real application codes and in a production environment.
Highlights from a presentation by Dr. Srinidhi Varadarajan
Dr. Srinidhi Varadarajan is the director of the Terascale Computing Facility at Virginia Tech. He made a presentation at TenCon (see discussion at MacSlash) with a lot of relevant details. Writing these notes was a really nice contribution by Tom Bridge to all who are interested in the G5 supercluster.
What was the cost of the system?
The total cost of the asset, including systems, memory, storage, primary and secondary communications fabrics and cables is $5.2mil. Facilities upgrade was $2mil. 1mil for the upgrades, 1mil for the UPS and generators. Arguably the cheapest world class supercomputer.What about the timetable? --- The facility is targeted to be available for applications at the middle of November, 2003. Any user with operational HPC (MPI) codes can access the facility at this point. Full production is targeted at January 1st, 2004.
1100 Dual Apple G5 2Ghz CPU based nodes. Each node has 4GB of main memory and 160GB of Serial ATA storage. 176TB total secondary storage. 4 head nodes for compilations/job startup. 1 Management node.It seems that they are using MVAPICH (MPI for InfiniBand on VAPI Layer) for parallel communications with MPI. This was developed in the Network-Based Computing Laboratory (NBCL), based on MPICH and MVICH. Here are some benchmark results.
Each G5 has 2 double precision FPUs. Each unit can complete 1 fused multiply add operation per cycle. This is the most common op in numerical computations. Thus, each processor can deliver 2 DP unites * 2 flops/cycle = 8GFlops. That's more than one Cray X1 Node. In a desktop.
Here are details about the fast communication network:
First version of the Mellanox driver and Verbs API was delivered in mid-August. Infiniband achieved 800MBps with MP performances 700MBps (MPI latency 8-14µs). Changes to PCI-X timing have increased Infiniband performance to 870MBps over the Verbs API.The latency of about 10 µs at the MPI level is good. However, you can do a lot of computations on a G5 in 10 µs: 8 Gflop/s * 10 µs = 80000 flop. So, to get good load balancing the tasks have to execute more than 80 000 floating point operations between communication steps.
This shows how to get superior performance on a G5:
They used the BLAS libraries, the core routine has GEMM efficiency of 84.1% (fairly phenomenal). Their benchmark used a mix of Goto's libs and Apple's veclib framework. IBM is nowhere near this good. Goto has the fastest library in pretty much every proc.About the disk space:
Currently they're at 9.555Teraflops. They want another 10% boost pretty quick, crossing the 10Teraflop line being the first academic machine to do so. That makes them #3. Worldwide. Period.
We're not sure yet. 40-50 TB eventually.Some questions and answers:
What's the cost on Infiniband?
All the switches and cards $1.6 mil. $176k for the cables.
How did use the G5 instead of the Opteron or Itanium?
Both are fairly nice, but they're expensive. First, it didn't pass the price/performance ratio test. Opteron doesn't do what the G5 does. 4Gflops at peak, the G5 is twice that. The Itanium is phenomenally efficient, but only at 1.5Ghz, not the 2Ghz. The #4 is a 8.6Terafllop Itanium II cluster (on 2000 procs)
You built it all to 10.2.7, are you planning on upgrading to Panther?
They're upgrading to Panther in the next few weeks. The driver runs, the memory manager runs, everything else, no problem.
There's a lot of interest in departmental clusters, Is there documentation anywhere?
We hope to put up a full fledged package to duplicate this from 64 nodes and above. They hope to see many after this one.
How much coke and how many pizzas?
Some reactions to the G5 cluster
Here is a pointer to BBC reporting on the cluster:
Apple powers college supercomputer: '[S]taff and students at Virginia Tech have built one of the world's most powerful supercomputers for just $5m by plugging together hundreds of the latest computers from Apple. The project involved placing 1,100 brand new Apple G5 towers side by side, making it the world's most powerful "homebuilt" system.' [BBC News | Technology | UK Edition]Wired writes:
The brand new "Big Mac" supercomputer at Virginia Tech could be the second most powerful supercomputer on the planet, according to preliminary numbers.Here is some background in the partnership on Top 10 InfiniBand SuperComputer for $5M:
Early benchmarks of Virginia Tech's brand new supercomputer -- which is strung together from 1,100 dual-processor Power Mac G5s -- may vault the machine into second place in the rankings of the worlds' fastest supercomputers, second only to Japan's monstrously big and expensive Earth Simulator.
The Big Mac's final score on the Linpack Benchmark won't be officially revealed until Nov. 17, when the rankings of the Top 500 supercomputer sites are made known at the International Supercomputer Conference.
But Jack Dongarra, one of the compilers of a Top 500 list, said Tuesday that preliminary numbers submitted to him suggest Big Mac could be ranked as high as second place. "They're getting about 80 percent of the theoretical peak," Dongarra said. "If it holds, and it's unclear if it will, it has the potential to be the world's second most powerful machine."
Dongarra said in terms of the number of processors, Big Mac's closest analog is a cluster of 2,300 2.4 GHz Xeon processors at Lawrence Livermore National Laboratory. Clocked at 7.6 teraflops, the cluster is currently ranked third. "It will be interesting to see where the G5 comes in comparison to this machine," he said.
Virginia Tech's partners for building this supercomputer in less than three months are Apple, Mellanox, Cisco, and Liebert. Mellanox is the leading provider of the InfiniBand semiconductor technology, the primary communications fabric, drivers, cards, and switches for the project. Cisco's Gigabit Ethernet switches were the choice for the secondary communications fabric to interconnect the cluster. Cisco provided a significant educational discount to support the project. Liebert, a division of Emerson Network Power, known for its comprehensive range of protection systems for sensitive electronics, provided the cooling system.The Virginia supercluster will be discussed at the O'Reilly Mac OS X Conference 2003, which will have a G5 supercluster session.
Here is a review of the dual G5 system from the end-user perspective:
Dual G5 review now online: "[The] Dual G5 is an amazingly quick and powerful computer. It runs quietly (unless you boot it into single user mode and use it that way!) and is capable of some truly impressive number crunching. It also represents the first steps into a whole new performance realm for Apple, and that bodes well for the future." [macosxhints]Here are some additional opinions about the Apple G5 systems: Apple takes a powerful leap forward with new G5, Crunch time: The dual G5 tackles life sciences data, and Second look: Apple's dual 2-GHz G5 by the numbers.
Big Mac by Apple vs. Lonestar by Dell in supercomputer race
At the same time, Dell joins UT's $38M supercomputer project: "With the purchase of 300 computer servers from Round Rock-based Dell Inc., the new "Lonestar" computing cluster gives UT scientists and engineers the power of more than 3 trillion computer operations per second, or 3 teraflops." See also: Cost of supercomputer only part of $38 million.
Comparisons between supercomputer-level clusters build with Apple and Dell technology have generated some dispute. However, both of these systems are in the same ballpark when all costs are taken into account (buying the hardware is just a part of the total costs).
It certainly is time to forget the notion that Apple hardware is more expensive than other brands. When comparing similarly configured high-end systems, Apple may turn up cheaper, because a lot of the required functionality comes as options on other systems.
Questions about the technology in the G5 cluster
I have tried to find as much information on the system as possible, but a lot of the actual details are still missing. So, here are some answers, and several open questions.
There seems to be a problem with the power consumption requirement of 3 MW, which is stated in the presentation material. Half is this, 1.5 MW, is said to be needed for the G5 cluster. If you estimate that each of the 2200 cpus consumes 100 W, that makes only 0.22 MW, almost a factor of 10 difference. Where is the rest of the electrical power needed?
I was a bit sceptical about the latency and speed of the communications network, but there is some serious technology involved. According to an expert on clusters and grids, the Infiniband technology is in the same class as Myrinet: in the middle category both in speed and price. The MPI-level latency is about 10 µs, which is good. And the communication network scales up rather well, as seen in the R_max benchmark figure.
However, to repeat what I wrote above: to get good load balancing the tasks have to execute more than 80 000 floating point operations between communication steps. Thus, this machine is not really suited for fine-grained parallelism.
Compilers and the programming environment
According to the presentation material, the system will run Mac OS X, which is a FreeBSD-based system with a "typical" Unix environment.
The networking will consist both of standard Ethernet for "non-computational" tasks, and special Infiniband hardware for high-bandwidth and low-latency communications. It will be interesting to see how reliable and fast the drivers and the middleware will be.
Apparently a MPI-2 (MPICH-2) implementation from Argonne will be used for parallel programming.
The standard GCC 3.3 compiler of Mac OS X 10.3 will be available, as well as IBM xlc. I suspect that the IBM compiler will generate faster code than GCC.
Also, two different Fortran 90/95 compilers will be available, from IBM and NAG. The NAG compiler is cutting-edge on Fortran 2000 features, but the speed will not be in the same level as with the IBM compiler. Probably all users will want to use the IBM compiler, which is robust and generates fast code on a G5.
A question: how much the licenses of the IBM C and Fortran compilers will cost on a system of this size?
Deja vu: migrating parallel MPI jobs
The software "Deja vu" was reported to make it possible to carry on computations even when a node of the system fails. It seems that Deja vu is a migratable version of MPI, and allows a task to be migrated to another node of the system. (See a pdf presentation about this.)
On the Cray T3E supercomputer you could migrate tasks between nodes, and also suspend task when a higher-priority task had to be started. (An example: the weather forecast system has to be executed several times a day within a certain time limit.) If Deja vu really works and does not need programmer intervention, this is really great news for all computational clusters.
The G5 system at Virginia Tech is impressive, although there is little detail about the actual performance and realiability at the moment.
Big open questions are
- How easy it is to port computational codes to this system?
- How well the communications network scales in actual parallel applications? (The R_max figure suggests that scaling is possible.)
- What parallel batch system will be used? OpenPBS, LSF, or some other? It seems that MVAPICH is used for parallel computing, but how exactly are the runs scheduled? Making the batch system work reliably and efficiently will be the biggest hurdle in getting good performance of the system in operational use.
- Is there support for parallel I/O? What is the performance when using, e.g., a NFS-mounted disk on several nodes simultaneously? In principle MPI-2 supports parallel I/O, but there are few working examples of this is real-world applications.
- How much work is required to maintain the system in a production environment? For example, how are software updates and installations automated, and is there a tool for locating failed systems? Well, it seems that at least an upgrade to Panther (Mac OS X 10.3) will happen shortly.
- How well the cooling facilies and power supply will work in the long run?
I have been involved with supercomputing since 1988. I have worked with vector supercomputers (IBM 3090 VF, Cray X-MP, Cray C94, Convex C3800), parallel supercomputers (Cray T3E, SGI Origin 2000, IBM SP, IBM SC), and clusters (PC and Alpha).
My responsibilities have usually included documentation, used support, training, and consultation, for example in code optimization, parallel computing, and numerical methods. The user guides which we have been written for these systems have been used also on other sites besides ours. (See a listing of some books available on the web.)
Click here to send an email to Juha Haataja. Or copy the email address from here: