Bill Camp & Jim Tomkins, from Sandia National Laboratories, have published a 77-page document about the architecture of the Red Storm supercluster built by Cray Inc. The new nickname for the 40 teraflops system is "Thor's Hammer."
Please read the full presentation if you have the time (PDF format, 3.54 MB).
Here are the major characteristics of the system, which will be operated for classified work ("red" nodes) or unclassified research ("black" nodes).
- True MPP, designed to be a single system
- Distributed memory MIMD parallel supercomputer
- Fully connected 3D mesh interconnect
- 108 compute node cabinets and 10,368 compute node processors (AMD Opteron running at 2.0 GHz)
- [Note: the full system layout appears on page 68 of the presentation.]
- About 10 TB of DDR memory at 333MHz
- 8 Service and I/O cabinets on each end (256 processors for each color)
- 240 TB of disk storage (120 TB per color)
- Partitioned Operating System (OS): LINUX on service and I/O nodes, LW (Catamount, from Sandia) on compute nodes, stripped down LINUX on RAS nodes
- Less than 2 MW total power and cooling
- Less than 3,000 square feet of floor space
- Fully operational by August 2004
Obviously, with such a system, which will be expanded to 30,000 processors in the future, scalability is a primary concern. And the report comes back to supercomputing roots by noting that in a good scalable system, the memory bandwidth needs to match the processor speed.
The scalability is also limited by Amdahl's law, which states that the speedup on N processors is equal to [1 + fs]/[1/N +fs], where fs is the serial (non-parallelizable) fraction of the work to be done.
So, in order to achieve a speedup of 8,000 on 10,000 processors, or a 80% parallel efficiency, fs must be less than 0.000025. This looks extremely small, but well-written parallel codes can easily do it.
The paper discusses another aspect of Amdahl's law, by looking at the actual scaled speedup achieved when the time used for communications is taken into account. This time the previous speedup has to be divided by [1 + fcomm x Rp/c], where fcomm is the fraction of work devoted to communications and Rp/c is the ratio of processor speed to communications speed.
It is easy to see that if this ratio is equal to 1, the global architecture will not be greatly affected by communications, which is one reason they decided on a custom interconnect.
Needless to say, such a system needs to be reliable. Sandia would like to reach an MTBI (Mean Time Between Interrupts for hardware and system software) of 100 hours to be sure to get useful work done in chunks of at least 50 hours.
So, what are the performances reached by the system? The answers appear on pages 70, 71, 74 and 75. Here are the major numbers:
- Peak of 41.47 teraflops, based on 2 floating point instruction issues per clock
- Expected MP-Linpack performance greater than 20 teraflops
- Aggregate system memory bandwidth of around 55 TB/s
- Maximum latency between nodes of 2 µs (neighbor) and 5 µs (full machine)
- Sustained file system bandwidth of 50 GB/s for each color
- Sustained external network bandwidth of 25 GB/s for each color
Well, I wish good luck to all the members of the project and I hope they reach their goals.
Source: William J. Camp and James L. Tomkins, CCIM, Sandia National Laboratories
3:24:24 PM Permalink