1
2
Contents Preface
4
Computers and computing in Moscow State University
5
MSU supercomputers: “Lomonosov”
12
MSU supercomputers: SKIF MSU “Chebyshev”
16
MSU supercomputers: IBM Blue Gene/P
18
MSU supercomputers: Hewlett-Packard “GraphIT!”
20
Perspective supercomputing technology: reconfigurable supercomputers
22
3
Preface
The history of computers at Moscow State Univer-
supercomputers are IBM Blue Gene/P, Hewlett-
sity goes back to the mid-fifties of the 20th century
Packard “GraphIT!” and FPGA-based RVS-5. The
when Research Computing Center of Moscow
major computing facility of the Center is the “Lo-
State University was founded in 1955 and equipped
monosov” supercomputer with a recently increased
with up-to-date computing hardware. This made
peak performance up to 1.3 PFlops.
it possible for university researchers to solve many challenging problems in meteorology, satellite and
Today more than 500 scientific groups from
manned space flights, aerodynamics, structural
Moscow State University, institutes of the Russian
analysis, mathematical economy, and other fields
Academy of Sciences, and other educational and
of science. Between 1955 and the early 1990s, more
scientific organizations of Russia are the users of
than 25 mainframe computers of various architec-
Moscow University Supercomputing Center. The
ture and performance were installed and actively
main areas of fundamental research with super-
used at Moscow State University.
computer applications are magnetohydrodynamics, quantum chemistry, seismology, drug design,
Since the end of the 1990s, Moscow State University
geology, material science, global climatic changes,
has begun to exploit high-performance comput-
nanotechnology, cryptography, bioinformatics,
ing systems based on cluster technologies. The
bioengineering, astronomy, etc. In recent years, the
first high-performance cluster installed at Moscow
range of supercomputer applications has expanded
State University in 1999 was the first one in Rus-
incredibly and Moscow State University is looking
sian education and science institutions. It was able
forward to reach exaflops frontiers.
to perform 18 billion operations per second. Now, Moscow State University Supercomputing Center has two systems included in the Top500 list: “Lomonosov” and SKIF MSU “Chebyshev”. Other MSU
4
Computers and computing in Moscow State University In 1956, Research Computing Center (RCC) of
ing approximately 2000 operations per second. It
Moscow State University received its first computer
had a clock cycle of 500 microseconds, RAM of
“Strela”. It was the first serially manufactured main-
2048 words with 43 bits each, energy consumption
frame in the USSR. A total of seven mainframes
of 150 KW. The computer occupied up to 300 square
were produced, the one supplied to RCC had
meters.
number 4. “Strela” mainframe functioned with a three-address instruction set capable of implement-
5
Computer “Setun” was designed in RCC, with
a bit, can exist not in two, but in three states: 1,0,-1.
N.P. Brusentsov as a chief designer. In 1959, RCC
The “Setun” computer took up to 25-30 square me-
launched “Setun” prototype and in 1961 “Setun”
ters, and required no special cooling. Its frequency
started to be manufactured serially. It was an im-
was 200 kHz. Fifty computers were produced from
pressive and extraordinary computer, being the first
1961 to 1965.
one in the world that was based on ternary, not binary, logic. Trit, having capacity superior to that of
6
In May 1961, M-20 computer was installed in
solving complicated algebraic problems that al-
RCC. It’s worth mentioning, that mainframes of
lowed dealing with systems of any rank and used
“M” series (М-20, М-220, М-222), built under the
only 300 words of RAM was developed specially for
supervision of distinguished academician S.A. Leb-
these mainframes. Using this method both matrix
edev, were widely-spread in the USSR. Mainframe
and system’s right-hand side vector fit into a slow
M-20 provided 20000 operations per second. It
memory but nevertheless problems were solved
had ferrite core-based RAM with capacity of 4096
almost as fast as if all data were stored in RAM.
words, with external memory stored on drums
The programs based on this technology were rather
and magnetic tapes. These common and efficient
efficient: it took only 9 minutes to solve algebraic
mainframes had essential influence on the develop-
systems of rank 200 on M-20.
ment of computational mathematics in the former Soviet Union. For instance, a block method for
BESM-4 computer became a part of RCC computational facilities in 1966. BESM-4 ferrite cores memory capacity varied from 4096 to 8192 words with 45 bits each. Numbers were represented in floating-point mode in binary system, while the range of absolute values was from 2-63 to 263. Its memory cycle was 10 microseconds, total storage space on drum memory was 65536 words (4 drums of 16384 words each), external memory capacity via magnetic tapes contained 8 blocks of 2 million words each. BESM-4 occupied three cabinets using 65 square meters. It required 8 kW for functioning and had an automatic internal air cooling system.
7
BESM-6 computer was and is still considered to be
being at different stages could be processed. Buffers
of great importance to Russian history of computer
for intermediate storage of instructions and data
development. The chief designer of this model was
allowed three subsystems of RAM modules, control
again S.A. Lebedev. Designing of BESM-6 was com-
and arithmetic units to work in parallel and asyn-
pleted in 1967 and its serial production was started
chronously. Content-addressable memory on fast
in 1968. Same year RCC received its first BESM-6
registers (a predecessor of cache memory) allowed
computer, and despite its serial number 13 it proved
this computer to memorize most frequently used
to be lucky for the Center. As a result RCC installed
operands and thus to decrease a number of refer-
its second BESM-6 computer in 1975, and then the
ences to RAM. Interleaving RAM allowed simul-
third and the forth ones in 1979. During this period
taneous access to separate modules of RAM from
total number of 355 BESM-6 mainframes was pro-
different parts of mainframe.
duced in the USSR. BESM-6 had RAM on ferrite cores capable of
8
Parallel processing of computer instructions was
storing 32 000 of 50-bit words. This number was
widely used in the architecture of BESM-6 com-
later increased to 128 000 words. The BESM-6 peak
puter: simultaneously 14 single-address instructions
performance was one million instructions per
second. The computer had about 60000 transistors
was equipped with two ES-1022, two MIR-2 and
and three times more diodes. It had a frequency of
MINSK-32 computers. In 1984, two-processor
10 MHz, occupied up to 150-200 square meters and
ES-1045 was installed. Since 1986, RCC has used a
consumed 30 KW of energy supply.
series of minicomputers: SM-3, SM-4 and SM-1420.
RCC has also used mainframes from other series. In 1981, along with four BESM-6 mainframes RCC
9
Since 1999, Research Computing Center has decid-
In 2002, the second cluster with a standard low-cost
ed to focus its main attention on cluster supercom-
and effective Fast Ethernet technology for com-
puters. The result of this decision wasn’t obvious at
munication and control was installed. This cluster
that time, but later it has proved to be the right one.
contained 20 nodes of one type (2 x Intel Pentium
The first cluster consisted of 18 compute nodes con-
III/850 MHz, 1 GB, 2 x HDD 15 GB) along with 24
nected via a high-speed SCI network. Each node
nodes of another type (2 x Intel Pentium III/1 GHz,
contained two Intel Pentium III/500 MHz proces-
1 GB, HDD 20 GB). With a total number of 88 pro-
sors, 1 GB of RAM and a 3.2 GB HDD. The system
cessors, it had peak performance of 82 GFlops.
peak performance was 18 GFlops. The SCI network with a high data transfer rate (80 MB/s) and low
In 2004, in the frame of a joint project of three
latency (5.5 ns) made this system very effective for
departments of Moscow State University (Research
solving a wide range of problems. Research groups
Computing Center, Skobeltsyn Institute of Nuclear
formed around the first cluster started using a new
Physics and Faculty of Computational Mathematics
type of technology – parallel computers with dis-
and Cybernetics) new data storage was installed. It
tributed memory in order to boost their research.
included Hewlett-Packard XP-1024 disk array along with an automated tape library Hewlett-Packard
10
ESL 9595 with a total capacity of 40 TB. In the same
Now Moscow State University Supercomputing
year a new Hewlett-Packard cluster with 160 AMD
Center exploits “Lomonosov”, SKIF MSU “Che-
Opteron 2.2 GHz processors and a new InfiniBand
byshev”, “GraphIT!”, IBM Blue Gene/P super-
network technology was launched in the super-
computers and several small HPC clusters, with a
computing center. This cluster peak performance
peak performance of the “Lomonosov” flagship at
exceeded 700 GFlops. By that time more than 50
1.3 PFlops. Taking the supercomputing road more
research groups from MSU, Russian Academy of
than ten years ago Moscow State University Super-
Sciences and other Russian universities had become
computing Center is planning to move forward to
active users of MSU supercomputing facilities.
exaflops and further in the future.
11
MSU supercomputers: “Lomonosov”
12
Moscow State University hosts a number of HPC
a necessity and MSU decided to acquire a new,
systems. SKIF MSU “Chebyshev” supercomputer
much more powerful system enabling research-
has been the most powerful one until recently. This
ers to expand computations and to perform more
60 TFlops supercomputer was installed in 2008;
accurate simulations. It became evident that the
and after deployment it became very soon clear
new supercomputer would have to contribute to the
that the demand for computing power far exceeded
growth of Russia’s overall competitiveness by foster-
its capabilities. By 2009 a significant expansion
ing discoveries and innovations in leading research
of MSU supercomputing facilities had become
centers of the country.
Robust price/performance, scalability, and fault
The primary compute nodes generating over 94%
tolerance were the key requirements to the new
of x86 part performance are based on T-Platforms
system. “Lomonosov” supercomputer delivered by
T-Blade2 system. Using six-core Intel Xeon X5670
the Russian company T-Platforms currently has 1.3
Westmere processors, T-Blade2 brings up to 27
PFlops peak performance.
TFlops of compute power in a standard 42U rack. “Lomonosov” also contains a number of T-Blade 1.1
“Lomonosov” is divided into 2 partitions by nodes
compute nodes with increased amount of RAM and
architecture: x86 part with peak performance of 510
local disk storage for memory-intensive applica-
TFlops and GPU part with peak performance of
tions. The 3rd type of compute nodes is based on
863 TFlops. In general, “Lomonosov” uses 6 types
T-Platforms PeakCell S platform using PowerXCell
of compute nodes and incorporates processors of
8i processors.
different architecture. The resulting hybrid installation has enough flexibility for enabling optimum
GPU part of “Lomonosov” supercomputer is based
performance for a wide range of applications.
on the next generation of T-Platforms blade systems TB2-TL. The TB2-TL system is based on the newest
13
TL-blade design. With 16 TL blades, it packs 32 Tesla X2070 GPUs and 32 Intel Xeon 5630 CPUs to deliver 17.8TF of peak DP performance per single TB2 enclosure. With 6 TB2-TL systems installed into a 42U rack cabinet, total performance of 106.6 TFlops per rack is reached. “Lomonosov” uses 40 Gb/s QDR Infiniband technology as a primary interconnect. To ensure fast data transfer and to reduce network congestion, T-Blade2 chassis incorporates excess InfiniBand external ports, providing impressive 1.6 TB/s of
14
The supercomputer uses 3-level storage system: • 500 TB of T-Platforms ReadyStorage SAN 7998 external storage with Lustre parallel file system. The solution enables parallel access of compute nodes to data with sustained aggregated read throughput of 30 GB/s and sustained aggregated write throughput of 24 GB/s; • 300 TB high availability NAS storage for users home directories; • 1 PB tape library with hierarchical storage software.
the overall external bandwidth of QDR InfiniBand
A very high degree of fault tolerance is a necessity
integrated switches. The dedicated global barrier
for installations of such scale. To this end, redun-
network of T-Blade2 allows fast synchronization of
dancy of all critical subsystems and components
computing jobs running on separate nodes, while
was implemented – from cooling fans and power
the global interrupt network significantly reduces
supplies on compute nodes to the entire engineer-
the influence of OS jitter by synchronizing the pro-
ing infrastructure. To ensure even greater reliability,
cess scheduling over the entire system. As a result,
primary compute nodes have neither hard discs
processors communicate much more efficiently,
nor cables inside the chassis, and contain a number
enabling high scalability of the most demanding
of special hardware features such as fault-tolerant
parallel applications.
memory module slots.
“Lomonosov” Peak performance
1 373 TFlops
Linpack performance
674 TFlops
Linpack efficiency
49%
Primary / secondary compute nodes
T-Blade2, TB2-TL / T-Blade1.1, PeakCell S
4- core Intel Хеоn 5570 2.93 GHz CPUs
8 8 40
6 - core Intel Xeon 5670 2.93 GHz CPUs
1 3 60
4- core Intel Xeon 5630 2.53 GHz CPUs
1 5 54
NVIDIA X2070 GPUs
1 5 54
Other processor types
PowerXCell 8i
Total RAM
85 TB
Total number of cores
94 172
Primary / secondary interconnect
QDR Infiniband 4x / 10G Ethernet, Gigabit Ethernet
External storage
3-level storage: • 500 TB T-Platforms ReadyStorage SAN 7998/Lustre; • 300 TB NAS storage; • 1 PB tape library
Operating system
Clustrx T-Platforms Edition
Total area (supercomputer)
252 m 2
Power consumption
2.8 MW
15
MSU supercomputers: SKIF MSU “Chebyshev” On March 19, 2008 Moscow State University,
The supercomputer is based on T-Blade modules
T- Platforms company, Program Systems Institute of
developed by T-Platforms. T-Blade incorporates up
Russian Academy of Sciences and Intel Corporation
to 20 Intel Xeon quad-core processors (3.0 GHz,
announced the deployment of the most powerful
45 nm) in a 5U enclosure, which at the moment of
supercomputer in Russia, CIS and Eastern Eu-
system delivery provided the best computing den-
rope SKIF MSU “Chebyshev” that was built in the
sity among all Intel-based blade solutions presented
framework of the supercomputer program “SKIF-
on the market. The system network is based on the
GRID” sponsored by the Union State of Russia and
DDR InfiniBand technology with Mellanox 4th
Belarus. The peak performance of the supercom-
generation microchips.
puter based on 1 250 Intel Xeon E5472 quad-core processors, is 60 TFlops. The Linpack performance
The T-Platforms ReadyStorage ActiveScale Clus-
of 47.17 TFlops (78.6% of peak performance) had
ter storage system specifically designed for Linux
become the best efficiency result among all quad-
clusters provides direct parallel access to data
core Xeon-based systems in the top hundred of the
for all compute nodes eliminating bottlenecks of
June 2008 edition of the Top500 list where SKIF
traditional network storage. Data storage capacity
MSU “Chebyshev” was ranked №36. It was ranked
of SKIF MSU “Chebyshev” is 60 TB. The unique
№5 in the recent (March 2011) edition of Top50 rat-
feature of the T-Platforms ReadyStorage ActiveScale
ing list of the most powerful supercomputers in the
Cluster system is its scalability: when new storage
Commonwealth of Independent States.
modules are added, not only storage capacity but also the overall network performance is increased.
16
SKIF MSU “Chebyshev” Peak performance
60 TFlops
Linpack performance
47 TFlops
Linpack efficiency
78.6%
Compute racks / total racks
14 / 42
Blade enclosure / blade nodes
63 / 625
Number of CPUS / cores
1 2 50 / 5 000
Processor type
4-core Intel Хеоn 5472 3.0 GHz
Total RAM
5.5 TB
Primary / secondary interconnect
DDR Infiniband / Gigabit Ethernet
Power consumption
330 KW
Top500 position
36 (2008.VI)
17
MSU supercomputers: IBM Blue Gene/P Since 2008 the IBM Blue Gene/P supercomputer
energy and space in comparison with the earlier
has been operating at the Faculty of Computational
systems.
Mathematics and Cybernetics of MSU. The MSU Blue Gene/P computer was one of the first systems
The configuration of MSU Blue Gene/P includes
of this series in the world. Blue Gene architecture
two racks, containing totally 2 048 compute nodes,
has been developed by IBM in the framework
each consisting of 4 PowerPC 450 cores, working at
of the project seeking for new solutions in high-
850 MHz frequency. The peak performance of the
performance computing. MSU Blue Gene/P was at
system is 27.9 TFlops.
the 128-th place in the Top500 issued in November 2008. It was ranked #15 in the March 2011 Top50 list
The Blue Gene/P architecture has been developed
of the CIS most powerful supercomputers.
for programs that scale well up to hundreds and thousands of processes. Individual cores work at a
18
The IBM Blue Gene/P system is a representative of a
relatively low frequency, but applications being able
supercomputer family providing high performance,
to effectively use large numbers of processor units
scalability, and facility to process large datasets
demonstrate higher performance as compared to
and at the same time consuming significantly less
many others supercomputers.
IBM Blue Gene/P Peak performance
27.9 TFlops
Linpack performance
23.9 TFlops
Number of racks
2
Number of compute nodes / I/O nodes
2 0 48 / 32
CPU model
4-core PowerPC 850 MHz
Number of CPUs / cores
2 0 48 / 8 192
Total RAM
4 TB
Programming technologies
MPI, OpenMP/pthreads, POSIX I/O
Performance per watt
372 MFlops/W
Top500 position
128 (2008.XI)
19
MSU supercomputers: Hewlett-Packard “GraphIT!” “GraphIT!” is the first cluster of MSU Super-
“GraphIT!” was originally envisioned as a pilot
computing Center based on GPU, an innovative
GPU-based cluster which can be used as a testbed
supercomputing architecture. GPUs, originally
for practicing with hybrid programming tech-
designed for real-time 3D graphics acceleration, are
nologies. It was required to be small enough to fit
now widely used to accelerate HPC. Compared to
into existing server room but powerful enough to
traditional CPUs, GPUs provide higher parallelism,
be used for real-world applications. As a result,
higher FLops and memory bandwidth per chip, and
configuration based on 4 HP S6500 4U chassis, oc-
also have higher cost- and energy-efficiency.
cupying a total of 2 racks was chosen. Each chassis
Hewlett-Packard “GraphIT!”
20
Peak performance (CPU / G PU / C PU+GPU)
2.04 / 2 4.72 /26.76 TFlops
Linpack performance
11.98 TFlops
Racks / compute nodes
2 / 16
Node type
DL380G6
Number of 6 -cores Intel Xeon X5650 CPUs
32
CPUs per node
2
Number of GPUs
48
GPU type
Nvidia «Fermi» Tesla M2050
Total CPU RAM / GPU RAM
768 GB / 144 GB
Per node CPU RAM / GPU RAM
48 GB / 9 GB
Data storage capacity
12 ТB
Primary / secondary interconnect
QDR Infiniband 4x / Gigabit Ethernet
Power consumption
22 KW
has 4 nodes, and each node has 3 NVidia “Fermi” Tesla M2050 CUDA-enabled GPUs, for a total of 16 compute nodes and 48 GPUs in the cluster. All compute nodes are connected by a high-speed 4x QDR InfiniBand network. This provides a total performance of 26.76 TFlops, of which 24.72 TFlops, or more than 92%, are due to GPU. It achieves Linpack performance of 11.98 TFlops, with 44% efficiency.
“GraphIT!” cluster is used to solve problems on molecular dynamics, cryptoanalysis, quantum physics, climate modeling, as well as other computationally intensive problems which benefit from GPU usage. It is used by researchers from various MSU departments as well as other research institutions. 21
Perspective supercomputing technology: reconfigurable supercomputers
Reconfigurable supercomputer RVS-5 installed in Research Computing Center of MSU is one of the most powerful reconfigurable computing systems in the world. This system was designed in Research Institute of Multiprocessor Computing Systems, Southern Federal University (Taganrog, Russia). The heads of the design team were Prof. I. Kaliaev and Dr. I. Levin. 22
The main computational element of the RVS-5
sibility of using a large number of FPGAs for any
computer is a base module Alkor. Each Alkor mod-
program (all FPGAs of a rack).
ule contains 16 FPGA Xilinx Virtex-5 chips. Base modules are connected together via LVDS channels
Various scientific applications have been success-
which allow several base modules to be effectively
fully implemented on RVS-5. Among them are:
assigned to a program. Four base modules form
• Tomographic researches of near-surface layers
a computational block, four blocks per each rack.
of the Earth using acoustic and electromagnetic waves;
Reconfigurable computing system RVS-5 outper-
• Modeling and forecasting the hydrophysical
forms all known general purpose FPGA-based
and biogeochemical processes in the Sea of
computing systems. Most programs for this su-
Azov;
percomputer are written in the high-level Colamo
• Modeling natural objects and processes in the
language, which has been created by developers of
functioning area of the Rostov atomic power
RVS-5. The main features of this language are high
station;
efficiency of programs written in Colamo and pos-
• Modeling astrophysical processes and adjustment of instrumental distortion of optical images; • Creation of fundamentally new drugs and new generation materials.
“RVS-5” FPGA system FPGA model
Xilinx Virtex-5
Number of racks
5
Number of FPGAs (11 mil. gates)
1 280
Total size of dynamic memory
100 GB
Power consumption
24 KW
Base Module Features
Number of processor elements
512
Memory size
2 GB
Performance, SP (DP)
200 (100) GFlops
Board frequency
330 MHz
Frequency of information exchange
1 200 MHz
Size
6U
Power consumption
190 W
23