High Performance Cluster Computing: Architectures and Systems Book Editor: Rajkumar Buyya Slides: Hai Jin and Raj Buyya Internet and Cluster Computing Center
Resource Hungry Applications Solving grand challenge applications using computer modeling, simulation and analysis
Internet & Ecommerce
Life Sciences
Digital Biology
Introduction Scalable Parallel Computer Architecture Towards Low Cost Parallel Computing and Motivations Windows of Opportunity A Cluster Computer and its Architecture Clusters Classifications Commodity Components for Clusters Network Service/Communications SW Cluster Middleware and Single System Image Resource Management and Scheduling (RMS) Programming Environments and Tools Cluster Applications Representative Cluster Systems Cluster of SMPs (CLUMPS) Summary and Conclusions http://www.buyya.com/cluster/
How to Run Applications Faster ? There are 3 ways to improve performance: Work Harder Work Smarter Get Help
Computer Analogy
Aerospace
CAD/CAM
Cluster Computing at a Glance Chapter 1: by M. Baker and R. Buyya
Military Applications
Using faster hardware Optimized algorithms and techniques used to solve computational tasks Multiple computers to solve a particular task
Two Eras of Computing
Era of Computing Rapid technical advances the recent advances in VLSI technology software technology
Architectures System Software/Compiler Applications P.S.Es Architectures System Software Applications P.S.Es
Sequential Era
OS, PL, development methodologies, & tools
grand challenge applications have become the main driving force
Parallel computing one of the best ways to overcome the speed bottleneck of a single processor good price/performance ratio of a small clusterbased parallel computer
Parallel Era
1940
50
60
70
80
90
2000
2030
Commercialization R&D
Commodity
1
Scalable (Parallel) Computer Architectures
Scalable Parallel Computer Architectures MPP
Taxonomy based on how processors, memory & interconnect are laid out, resources are managed
Massively Parallel Processors (MPP) Symmetric Multiprocessors (SMP) Cache-Coherent Non-Uniform Memory Access (CC-NUMA) Clusters Distributed Systems – Grids/P2P
Scalable Parallel Computer Architectures
A large parallel processing system with a sharednothing architecture Consist of several hundred nodes with a high-speed interconnection network/switch Each node consists of a main memory & one or more processors Runs a separate copy of the OS
SMP 2-64 processors today Shared-everything architecture All processors share all the global resources available Single copy of the OS runs on these systems
Key Characteristics of Scalable Parallel Computers
CC-NUMA a scalable multiprocessor system having a cache-coherent nonuniform memory access architecture every processor has a global view of all of the memory
Clusters a collection of workstations / PCs that are interconnected by a high-speed network work as an integrated collection of resources have a single system image spanning all its nodes
Distributed systems considered conventional networks of independent computers have multiple system images as each node runs its own OS the individual machines could be combinations of MPPs, SMPs, clusters, & individual computers
In Summary Need more computing power Improve the operating speed of processors & other components constrained by the speed of light, thermodynamic laws, & the high financial costs for processor fabrication
Connect multiple processors together & coordinate their computational efforts parallel computers allow the sharing of a computational task among multiple processors
Technology Trends... Performance of PC/Workstations components has almost reached performance of those used in supercomputers‌ Microprocessors (50% to 100% per year) Networks (Gigabit SANs); Operating Systems (Linux,...); Programming environment (MPI,‌); Applications (.edu, .com, .org, .net, .shop, .bank);
The rate of performance improvements of commodity systems is much rapid compared to specialized systems.
2
Rise and Fall of Computer Architectures Vector Computers (VC) - proprietary system: provided the breakthrough needed for the emergence of computational science, buy they were only a partial answer.
Massively Parallel Processors (MPP) -proprietary systems: high cost and a low performance/price ratio.
Symmetric Multiprocessors (SMP): suffers from scalability
Distributed Systems: difficult to use and hard to extract parallel performance.
Clusters - gaining popularity: High Performance Computing - Commodity Supercomputing High Availability Computing - Mission Critical Applications
What is Cluster ? A cluster is a type of parallel or distributed processing system, which consists of a collection of interconnected stand-alone computers cooperatively working together as a single, integrated computing resource. A node a single or multiprocessor system with memory, I/O facilities, & OS generally 2 or more computers (nodes) connected together in a single cabinet, or physically separated & connected via a LAN appear as a single system to users and applications provide a cost-effective way to gain features and benefits
Why PC/WS Clustering Now ? Individual PCs/workstations are becoming increasing powerful Commodity networks bandwidth is increasing and latency is decreasing PC/Workstation clusters are easier to integrate into existing networks Typical low user utilization of PCs/WSs Development tools for PCs/WS are more mature PC/WS clusters are a cheap and readily available Clusters can be easily grown
Cluster Architecture Parallel Applications Parallel Applications Parallel Applications
Sequential Applications Sequential Applications Sequential Applications
Parallel Programming Environment Cluster Middleware
(Single System Image and Availability Infrastructure) PC/Workstation
PC/Workstation
PC/Workstation
PC/Workstation
Communications
Communications
Communications
Communications
Software
Software
Software
Software
Network Interface Hardware
Network Interface Hardware
Network Interface Hardware
Network Interface Hardware
Cluster Interconnection Network/Switch
Cluster Design Issues •
Enhanced Performance (performance @ low cost)
•
Enhanced Availability (failure management)
•
Single System Image (look-and-feel of one system)
•
Size Scalability (physical & application)
•
Fast Communication (networks & protocols)
•
Load Balancing (CPU, Net, Memory, Disk)
•
Security and Encryption (clusters of clusters)
•
Distributed Environment (Social issues)
•
Manageability (admin. And control)
•
Programmability (simple API if required)
•
Applicability (cluster-aware and non-aware app.)
Prominent Components of Cluster Computers High Performance Networks/Switches Ethernet (10Mbps), Fast Ethernet (100Mbps), Gigabit Ethernet (1Gbps) SCI (Scalable Coherent Interface- MPI- 12µsec latency) ATM (Asynchronous Transfer Mode) Myrinet (1.2Gbps) QsNet (Quadrics Supercomputing World, 5µsec latency for MPI messages) Digital Memory Channel FDDI (fiber distributed data interface) InfiniBand
3
Prominent Components of Cluster Computers Fast Communication Protocols and Services (User Level Communication): Active Messages (Berkeley) Fast Messages (Illinois) U-net (Cornell) XTP (Virginia) Virtual Interface Architecture (VIA)
Commodity Components for Clusters Cluster Interconnects Communicate over high-speed networks using a standard networking protocol such as TCP/IP or a low-level protocol such as AM Standard Ethernet 10 Mbps cheap, easy way to provide file and printer sharing bandwidth & latency are not balanced with the computational power
Ethernet, Fast Ethernet, and Gigabit Ethernet Fast Ethernet – 100 Mbps Gigabit Ethernet preserve Ethernet’s simplicity deliver a very high bandwidth to aggregate multiple Fast Ethernet segments
Advanced Network Services/ Communication SW Communication infrastructure support protocol for Bulk-data transport Streaming data Group communications
Communication service provide cluster with important QoS parameters Latency Bandwidth Reliability Fault-tolerance Jitter control
Network service are designed as hierarchical stack of protocols with relatively low-level communication API, provide means to implement wide range of communication methodologies RPC DSM Stream-based and message passing interface (e.g., MPI, PVM)
Cluster Interconnects: Comparison (created in 2000) Myrinet
QSnet
Giganet
ServerNet2
SCI
Gigabit Ethernet
Bandwidth (MBytes/s)
140 – 33MHz 215 – 66 Mhz
208
~105
165
~80
30 - 50
MPI Latency (µs)
16.5 – 33Nhz 11 – 66 Mhz
5
~20 - 40
20.2
6
100 - 200
List price/port
$1.5K
$6.5K
~$1.5K
$1.5K
Hardware Availability
Now
Now
Now
Q2‘00
Linux Support
Now
Now
Now
Maximum #nodes
1000’s
1000’s
Protocol Implementation
Firmware on adapter
VIA support
Soon
MPI support
3rd
party
~$1.5K
~$1.5K
Now
Now
Q2‘00
Now
Now
1000’s
1000’s
64K
1000’s
Firmware on adapter
Firmware on adapter
Implemented in hardware
None
NT/Linux
Done in hardware
Quadrics/ Compaq
3rd
Compaq/3rd
Party
party
Firmware on adapter Software TCP/IP, VIA
3rd Party
Implemented in hardware NT/Linux
MPICH – TCP/IP
Commodity Components for Clusters Cluster Interconnects Myrinet 1.28 Gbps full duplex interconnection network Use low latency cut-through routing switches, which is able to offer fault tolerance by automatic mapping of the network configuration Support both Linux & NT Advantages Very low latency (5 s, one-way point-to-point) Very high throughput Programmable on-board processor for greater flexibility
Disadvantages Expensive: $1500 per host Complicated scaling: switches with more than 16 ports are unavailable
Cluster Programming Environments Shared Memory Based DSM Threads/OpenMP (enabled for clusters) Java threads (IBM cJVM)
Message Passing Based PVM (PVM) MPI (MPI)
Parametric Computations Nimrod-G and Gridbus Data Grid Broker
Automatic Parallelising Compilers Parallel Libraries & Computational Kernels (e.g., NetSolve)
4
Levels of Parallelism PVM/MPI
Threads
Compilers
Task Taski-l i-l
Task Taskii
func1 func1( () ) {{ .... .... .... .... }}
func2 func2( () ) {{ .... .... .... .... }}
aa( (00) )=.. =.. bb( (00) )=.. =..
func3 func3( () ) {{ .... .... .... .... }}
aa( (11)=.. )=.. bb( (11)=.. )=..
++
CPU
Task Taski+1 i+1
aa( (22)=.. )=.. bb( (22)=.. )=..
xx
Fine grain (data level) Loop (Compiler)
IPC
IPC
channel
channel Processor Processor BB
M E M B O U R S Y
M E M B O U R S Y
Memory Memory System System AA
Memory Memory System System BB
M E M B O U R S Y
Medium grain (control level) Function (thread)
Distributed Memory MIMD Processor Processor AA
Processor Processor AA
Code-Granularity Code Item Large grain (task level) Program
Very fine grain (multiple issue) With hardware
Load Load
Shared Memory MIMD
Processor Processor CC
M E M B O U R S Y
Processor Processor BB
M E M B O U R S Y
Processor Processor CC
M E M B O U R S Y
Global GlobalMemory MemorySystem System Comm: Source PE writes data to GM & destination retrieves it Easy to build, conventional OSes of SISD can be easily be ported Limitation : reliability & expandibility. A memory component or any processor failure affects the whole system. Increase of processors leads to memory contention. Ex. : Silicon graphics supercomputers....
Programming Environments and Tools Message Passing Systems (MPI and PVM) Allow efficient parallel programs to be written for distributed memory systems 2 most popular high-level message-passing systems – PVM & MPI PVM both an environment & a message-passing library
●
● ●
Memory Memory System SystemCC
Communicat ion : I PC (I nt er -Pr ocess Communicat ion) via High Speed Network. Network can be configured to ... Tree, Mesh, Cube, etc. Unlike Shared MIMD easily/ readily expandable Highly reliable (any CPU failure does not affect the whole system)
Programming Environments and Tools Distributed Shared Memory (DSM) Systems Message-passing the most efficient, widely used, programming paradigm on distributed memory system complex & difficult to program
Shared memory systems offer a simple and general programming model but suffer from scalability
DSM on distributed memory system alternative cost-effective solution Software DSM Usually built as a separate layer on top of the comm interface Take full advantage of the application characteristics: virtual pages, objects, & language types are units of sharing TreadMarks, Linda
Hardware DSM Better performance, no burden on user & SW layers, fine granularity of sharing, extensions of the cache coherence scheme, & increased HW complexity DASH, Merlin
MPI a message passing specification, designed to be standard for distributed memory parallel computing using explicit message passing attempt to establish a practical, portable, efficient, & flexible standard for message passing generally, application developers prefer MPI, as it is fast becoming the de facto standard for message passing
Summary: Cluster Advantage Price/performance ratio is low when compared with a dedicated parallel supercomputer. Incremental growth that often matches with the demand patterns. The provision of a multipurpose system Scientific, commercial, Internet applications
Have become mainstream enterprise computing systems: In 2003 List of Top 500 Supercomputers, over 50% of them are based on clusters and many of them are deployed in industries.
5
This document was created with Win2PDF available at http://www.daneprairie.com. The unregistered version of Win2PDF is for evaluation or non-commercial use only.