TOP500 Supercomputers

Page 1

TOP500

ANDREW FECHEYR LIPPENS

TMC CM-5 (1993) #1 on the first top 500

CAC Actividad 1: Top 500 Supercomputers Sites “Anyone can build a fast CPU. The trick is to build a fast system” -- Seymour Cray In the early 1990s, a new definition of supercomputer was needed to produce meaningful statistics. After experimenting with metrics based on processor count in 1992, the idea was born at the University of Mannheim to compose a detailed listing of installed systems and use it as the basis. This was the beginning of the TOP500 list. In early 1993 Jack Dongarra was convinced to join the project with his Linpack benchmark. A first test version was produced in May 1993, partially based on data available on the Internet. Since June 1993 the TOP500 is produced bi-annually based on site and vendor submissions only. Today, the list is compiled by Hans Meuer of the University of Mannheim, Germany, Jack Dongarra of the University of Tennessee, Knoxville, Erich Strohmaier and Horst Simon of NERSC/Lawrence Berkeley National Laboratory. The list is updated twice a year. The first of these updates always coincides with the International Supercomputer Conference in June, the second one is presented in November at the IEEE Super Computer Conference in the USA. With two lists compiled every year from 1993 up until now, it brings the total number to 33 lists. Next we will take a closer look at the last iteration of the list, the June 2009 edition.

834

683(5&20387(5 6,7(6

Content table Current TOP500 The World’s Fastest Computer Previous #1 Machines Evolution of Supercomputers The Linpack Benchmark

2 3 4 4 5

The TOP500 project ranks and details the 500 most powerful known computer systems in the world. The project was started in 1993 and publishes an updated list of the supercomputers twice a year. The project aims to provide a reliable basis for tracking and detecting trends in high-performance computing and uses a portable implementation of the Linpack benchmark


# Name

OS

Arch

Interconnect

1 RoadRunner U.S.A. IBM

129600 1105000 1456700

2483 AMD64 Opteron PowerXCell 8i

Linux

Cluster

Infiniband

2 Jaguar Cray Inc.

U.S.A.

150152 1059000 1381400

6951 AMD64 Opteron Quad Core

CNL

MPP

XT4 Internal Interconnect

3 Jugene IBM

Germany 294912

825500 1002700

2268 PowerPC 450

CNK/SLES 9

MPP

Proprietary

4 Pleiades SGI

U.S.A.

51200

487005

608829

2090 Intel EM64T Xeon E54xx

SLES10 + SGI ProPack 5

MPP

Infiniband

5 BlueGene/L U.S.A. IBM

212992

478200

596378

2330 PowerPC 440

CNK/SLES 9

MPP

Proprietary

U.S.A.

66000

463300

607200

CNL

MPP

XT4 Internal Interconnect

7 BlueGene/P U.S.A. IBM

163840

458611

557056

1260 PowerPC 450

CNK/SLES 9

MPP

Proprietary

6 Kraken XT5 Cray Inc.

Country

Cores

RMax

RPeak Power Processor Arch

AMD64 Opteron Quad Core

8 Ranger Sun

U.S.A.

62976

433200

579379

2000 AMD64 Opteron Quad Core

Linux

Cluster

Infiniband

9 Dawn IBM

U.S.A.

147456

415700

501350

1134 PowerPC 450

CNK/SLES 9

MPP

Proprietary

10 Juropa Bull SA

Germany

26304

274800

308283

1549 Intel EM64T Xeon X55xx

SUSE Linux

Cluster

Infiniband QDR Sun M9 / Mellanox / ParTec

The June 2009 TOP500 Ranking The 33rd edition of the TOP500 list of the world’s most powerful supercomputers is still led by Roadrunner, but shows that two of the top 10 positions are now claimed by new systems in Germany. The latest listing also includes a brand-new player, an IBM BlueGene/P system at the King Abdullah University of Science and Technology (KAUST) in Saudi Arabia, ranked at No. 14.

10 system, JUROPA, which is built from Bull Novascale and Sun SunBlade x6048 servers and achieved 274,8 teraflops.

Maintaining its hold on second place is the Cray XT5 Jaguar system installed at the DOE’s Oak Ridge National Laboratory. Jaguar reached 1,059 petaflops shortly after its installation but due to its heavy workload no further measurements were possible.

The U.S. is clearly the leading consumer of HPC systems with 291 of the 500 systems. Followed by Europe (145 systems) and Asia (49 systems). The bubbles are a graphical representation of the computing power of the most represented countries.

In third place, a new contender has emerged: a new IBM BlueGene/P system called JUGENE and installed at the Forschungszentrum Juelich (FZJ) in Germany. It achieved 825,5 teraflop/s on the Linpack benchmark and has a theoretical peak performance of just above 1 petaflops. FZJ is also home to the new No.

Looking at the architectures used in the whole TOP500, we notice that most machines are classified as Clusters (82%), a smaller set as MPP (17,6%) and a tiny remainder as the aging Constellations architecture (0,4%). We will go more into detail on the trends in supercomputering after we discussed Roadrunner.

MPP

Massive parallel processing (MPP) refers to a computer system with many independent arithmetic units or entire microprocessors, that run in parallel. The term massive connotes hundreds if not thousands of such units. All of the processing elements are connected together to be one very large computer.

Cluster

A group of PCs connected though a switched network, working together closely to solve the same problems. Clusters are often built from commercial off-the-shelf components to produce a cost-effective alternative to an MPP supercomputer.

Constellation

A cluster of shared-memory multiprocessors. “If there are more microprocessors in a node than there are nodes in the commodity cluster, it is referred to as a constellation” -- Jack Dongarra et al. 2003


The fastest computer in the world: IBM Roadrunner The fastest computer in the world, built by IBM for the U.S. Department of Energy’s Los Alamos National Laboratory is called “Roadrunner”. It achieved a performance of 1,026 petaflops in May 2008 becoming the first supercomputer ever to reach the petaflop milestone. It captured the top spot on the June 2008 TOP500 list, beating IBM BlueGene/L at DOE’s Lawrence Livermore National Laboratory. BlueGene/L, with a performance of 478.2 teraflop/s is now ranked No. 5 after holding the top position from November 2004 until June 2008. Roadrunner differs from many contemporary supercomputers in that it is a hybrid system, using two different processor architectures. The hybrid design consists of dual-core Opteron server processors manufactured by AMD using the standard AMD64 architecture. Attached to each Opteron core is a Cell processor manufactured by IBM using the Power architecture. As a supercomputer, Roadrunner is considered an Opteron cluster with Cell accelerators. The machine takes up 560 square meters, uses 92 kilometers of fiber optic cable, weighs in at 270.000 kilograms and requires 2,9 megawatts of power. The cluster is composed of specially designed TriBlade servers connected by Infiniband. Logically, a TriBlade consists of two dualcore Opterons and four PowerXCell 8i CPUs. Physically, a TriBlade consists of one LS21 Opteron blade, an expansion blade, and two QS22 Cell blades. The LS21 has two 1.8 GHz dual-core Opterons with 16 GB memory. Each QS22 has two PowerXCell 8i CPUs, running at 3.2 GHz and 8 GB memory. The expansion blade connects the two QS22 via four PCIe x8 links to the LS21, two links for each QS22. It also provides outside connectivity via an Infiniband 4x DDR adapter. This makes a total width of four slots for a single TriBlade. Three TriBlades fit into one BladeCenter H chassis. A Connected Unit is 60 BladeCenter H full of TriBlades, that is 180 TriBlades. All TriBlades are connected to a 288-port Voltaire ISR2012 Infiniband switch. Each CU also has access to the Panasas file system through twelve System x3755 servers. The final cluster is made up of 18 connected units, which are connected via eight additional (second-stage) Infiniband ISR2012 switches.

International Business Machines Corporation, also know as “Big Blue”, is a multinational computer technology and IT consulting corporation. IBM is the leader in supercomputing as provider of 35 of the world's 100 most powerful supercomputers in the TOP500 list and is also leading energy-efficiency rankings with 19 of 20 highest megaflops per watt systems in the Green500 list.

LANL has always been an early adopter of transformational high performance computing (HPC) technology. For example, in the 1970s when HPC was scalar; LANL adopted vector (Cray 1). In the 1980s when HPC was vector; LANL adopted data parallel (TMC CM-1). In the 1990s when HPC was data parallel; LANL adopted distributed memory (TMC CM-5). In the 2000s HPC was distributed memory; LANL adopted hybrid (Roadrunner).


Evolution Ten machines have occupied the first spot on the list since 1993 1. TMC CM-5 The Thinking Machines Corporation CM5 at LANL was #1 on the first list in ‘93, and used a MIMD architecture based on a fat tree network of SuperSPARC I 32 MHz processors. 2. Fujitsu Numerical Wind Tunnel Conquered #1 on the November ‘93 list and stayed on top until ’95. The Numerical Wind Tunnel used a Full distributed crossbar to connect 140 Fujitsu 105 MHz cores. 3. Intel Paragon XP/S140 Briefly took the #1 spot in June ’94. This MPP had an astonishing 3680 Intel 80860 50 MHz cores and used a 2D mesh interconnect. 4. Hitachi SR2201 In June ‘96 this machine captured the top spot with 220,4 Gflops. 1024 PARISC HARP-1E 150 MHz processors, using a Hyper crossbar . 5. Hitachi CP-PACS Top performer on the November ’96 list. Doubled the amount of PA-RISC HARP-1E 150 MHz processors of it’s predecessor. 6. Intel ASCI Red The ASCI Red was a mesh-based MIMD MPP machine initially consisting of 4510 Intel Pentium Pro processors @ 200MHz. First to break the teraflop barrier and stayed on top until June 2000. 7. IBM ASCI White Nov 2000. Based on IBM's RS/6000 SP computer. 512 nodes of 16 POWER3 375 MHz processors. 8. NEC Earth Simulator Jun ’02. NEC’s ES runs global climate models at 35,8 teraflops using 640 nodes of 8 vector processors. 9. IBM Blue Gene/L Nov ’04. The last top ranked MPP. 70,72 teraflops with PowerPC cores. 10. IBM Roadrunner Current #1 since June ‘08. A hybrid (Opteron + Cell) cluster with Infiniband interconnects.

Evolution of architecture types 100%

SMP MPP

Constellations SIMD

Cluster Single Processor

75%

50%

25%

jun 93

jun 95

jun 97

jun 99

jun 01

jun 03

jun 05

jun 07

jun 09

On the graph above, the different architecture types are plotted by share of the total computing power (% of total Rmax) since the first list was compiled in June 1993. In the early years supercomputers were often SMP or MPP machines, with the latter of both architectures dominating the list for almost 16 years, only to be challenged by the recent and often more cost-efficient Cluster architecture. Clusters started to appear in the TOP500 list in 1997. The Berkeley NOW (Network of Workstations), a Myrinet cluster of 100 UltraSPARC-I computers, ranked #344 in the June 1997 TOP500 list with a LINPACK performance of 10,140 Gflops. The adoption of clusters, collections of workstations/PCs connected by a local network, has virtually exploded since the introduction of the first Beowulf cluster in 1994. The attraction lies in the (potentially) low cost of both hardware and software and the control that builders/users have over their system. Clusters represent 88% of the systems count and 58,6% of the processing power in the latest TOP500 edition.

Evolution of interconnections 100% 75% 50% 25%

jun 93

jun 95

jun 97

jun 99

Cray Interconnect Infiniband Gigabit Ethernet Crossbar

jun 01

jun 03

NUMAlink Quadrics Proprietary

jun 05

jun 07

jun 09

SP Switch Myrinet Fat Tree

The choice of interconnect has a major impact on the efficiency (Rmax/ Rpeak) of a supercomputer. Even more so for clusters with a huge amount of commodity nodes, as communication between them can easily become the bottleneck. The first supercomputers to occupy the TOP500 lists made use of Fat Trees to communicate. During the ‘90s both the proprietary Cray


Interconnect and Crossbars were hugely popular, followed by SP Switches during the crossing of the millennium. Myrinet, the low latency interconnect, was highly in demand starting from 2001 until the higher bandwidth Infiniband technology gained the upper hand in 2007. Nowadays most supercomputers are interconnected using conventional Gigabit Ethernet (56% of all systems) and Infiniband (30%), which is often used on higher ranked systems.

The Linpack Benchmark As a yardstick of performance the TOP500 list uses the ‘best’ performance as measured by the LINPACK Benchmark. LINPACK is a software library for performing numerical linear algebra on digital computers. It was written in Fortran by Jack Dongarra, Jim Bunch, Cleve Moler, and Gilbert Stewart, and was intended for use on supercomputers in the 1970s and early 1980s. It makes use of the BLAS (Basic Linear Algebra Subprograms) libraries for performing basic vector and matrix operations. The LINPACK Benchmark was introduced to the TOP500 list by Jack Dongarra and chosen because it is widely used and performance numbers are available for almost all relevant systems. The benchmark measures a system's floating point computing power: it calculates how fast a computer solves a dense N by N system of linear equations Ax = b, which is a common task in engineering. The solution is obtained by Gaussian elimination with partial pivoting, with 2/3 N3 + O(N2) floating point operations. The result is reported in millions of floating point operations per second (Mflops or Gflops). For the TOP500, a version of the benchmark is used that allows the user to scale the size of the problem and to optimize the software in order to achieve the best performance for a given machine. This performance does not reflect the overall performance of a given system, as no single number ever can. It does, however, reflect the performance of a dedicated system for solving a dense system of linear equations. Since the problem is very regular, the performance achieved is quite high, and the performance numbers give a good correction of peak performance. The TOP500 list uses the “Portable Implementation of the High-Performance Linpack Benchmark for Distributed-Memory Computers” (HPL) which can be found at http://www.netlib.org/benchmark/hpl/. By measuring the actual performance for different problem sizes n, a user can get not only the maximal achieved performance Rmax for the problem size Nmax but also the problem size N1/2 where half of the performance Rmax is achieved. These numbers together with the theoretical peak performance Rpeak are the numbers given in the TOP500. In an attempt to obtain uniformity across all computers in performance reporting, the algorithm used in solving the system of equations in the benchmark procedure must conform to LU factorization with partial pivoting. In particular, the operation count for the algorithm must be 2/3 N3 + O(N2) double precision floating point operations. This excludes the use of a fast matrix multiply algorithm like "Strassen's Method" or algorithms which compute a solution in a precision lower than full precision (64 bit floating point arithmetic) and refine the solution using an iterative approach.

Running HPL Machine To put the numbers from the TOP500 list in perspective, I ran the HPL Linpack Benchmark on a 4 year old Supermicro 4020C-T Server. The server has been in use since 2005 as a web application server and consists of two Opteron 246 processors running at 2.0Ghz. The server has access to 2GB of DDR400 ECC Registered memory, which could not be fully used during the benchmark, as normal operations of the server could not be shutdown. The server is running the Gentoo Linux distribution. The OpenMPI, Blas, Lapack and HPL source code was compiled with the default settings. No time was spend trying to optimize the compilation of the binaries.

Commands As the reference Blas version is written in fortran, I first had to recompile GCC with fortran support. USE="fortran" emerge -av gcc Downloading and compiling HPL and it’s dependencies is done automatically in Gentoo with the emerge command. emerge -av sys-cluster/hpl Copy the /usr/share/hpl/HPL.dat configuration file into a working directory and edit it. To start the benchmark with 2 processes run this command: mpirun -np 2 /usr/bin/xhpl

Achieved results To find a decent Rmax I tried some combinations of the N (problem size) and NB (block size) parameters of HPL. With N=550 and NB=12 the system managed to achieve

1,509 Gflops. In 1993 this result would have put this dual processor AMD Opteron system on position #180 of the TOP500 list.


Bibliography http://www.top500.org/static/lists/1993/06/top500_199306.pdf http://en.wikiquote.org/wiki/Seymour_Cray http://en.wikipedia.org/wiki/Top500 http://en.wikipedia.org/wiki/IBM_Roadrunner http://www.ibm.com/systems/deepcomputing/top500.html http://www.top500.org/project/linpack http://www.gentoo.org/proj/en/science/blas-lapack.xml http://www.top500.org/lists/2008/11 http://en.wikipedia.org/wiki/LINPACK http://www.top500.org/orsc/2006/clusters http://now.cs.berkeley.edu Jack Dongarra, Thomas Sterling, Horst Simon, Erich Strohmaier, "High-Performance Computing: Clusters, Constellations, MPPs, and Future Directions," Computing in Science and Engineering, vol. 7, no. 2, pp. 51-59, March/April, 2005.


Turn static files into dynamic content formats.

Create a flipbook
Issuu converts static files into: digital portfolios, online yearbooks, online catalogs, digital photo albums and more. Sign up and create your flipbook.