Sistemi Operativi Avanzati

Page 1

High Performance Cluster Computing: Architectures and Systems Book Editor: Rajkumar Buyya Slides: Hai Jin and Raj Buyya Internet and Cluster Computing Center

Resource Hungry Applications Solving grand challenge applications using computer modeling, simulation and analysis

Internet & Ecommerce

Life Sciences

Digital Biology

Introduction Scalable Parallel Computer Architecture Towards Low Cost Parallel Computing and Motivations Windows of Opportunity A Cluster Computer and its Architecture Clusters Classifications Commodity Components for Clusters Network Service/Communications SW Cluster Middleware and Single System Image Resource Management and Scheduling (RMS) Programming Environments and Tools Cluster Applications Representative Cluster Systems Cluster of SMPs (CLUMPS) Summary and Conclusions http://www.buyya.com/cluster/

How to Run Applications Faster ? There are 3 ways to improve performance: Work Harder Work Smarter Get Help

Computer Analogy

Aerospace

CAD/CAM

Cluster Computing at a Glance Chapter 1: by M. Baker and R. Buyya

Military Applications

Using faster hardware Optimized algorithms and techniques used to solve computational tasks Multiple computers to solve a particular task

Two Eras of Computing

Era of Computing Rapid technical advances the recent advances in VLSI technology software technology

Architectures System Software/Compiler Applications P.S.Es Architectures System Software Applications P.S.Es

Sequential Era

OS, PL, development methodologies, & tools

grand challenge applications have become the main driving force

Parallel computing one of the best ways to overcome the speed bottleneck of a single processor good price/performance ratio of a small clusterbased parallel computer

Parallel Era

1940

50

60

70

80

90

2000

2030

Commercialization R&D

Commodity

1


Scalable (Parallel) Computer Architectures

Scalable Parallel Computer Architectures MPP

Taxonomy based on how processors, memory & interconnect are laid out, resources are managed

Massively Parallel Processors (MPP) Symmetric Multiprocessors (SMP) Cache-Coherent Non-Uniform Memory Access (CC-NUMA) Clusters Distributed Systems – Grids/P2P

Scalable Parallel Computer Architectures

A large parallel processing system with a sharednothing architecture Consist of several hundred nodes with a high-speed interconnection network/switch Each node consists of a main memory & one or more processors Runs a separate copy of the OS

SMP 2-64 processors today Shared-everything architecture All processors share all the global resources available Single copy of the OS runs on these systems

Key Characteristics of Scalable Parallel Computers

CC-NUMA a scalable multiprocessor system having a cache-coherent nonuniform memory access architecture every processor has a global view of all of the memory

Clusters a collection of workstations / PCs that are interconnected by a high-speed network work as an integrated collection of resources have a single system image spanning all its nodes

Distributed systems considered conventional networks of independent computers have multiple system images as each node runs its own OS the individual machines could be combinations of MPPs, SMPs, clusters, & individual computers

In Summary Need more computing power Improve the operating speed of processors & other components constrained by the speed of light, thermodynamic laws, & the high financial costs for processor fabrication

Connect multiple processors together & coordinate their computational efforts parallel computers allow the sharing of a computational task among multiple processors

Technology Trends... Performance of PC/Workstations components has almost reached performance of those used in supercomputers‌ Microprocessors (50% to 100% per year) Networks (Gigabit SANs); Operating Systems (Linux,...); Programming environment (MPI,‌); Applications (.edu, .com, .org, .net, .shop, .bank);

The rate of performance improvements of commodity systems is much rapid compared to specialized systems.

2


Rise and Fall of Computer Architectures Vector Computers (VC) - proprietary system: provided the breakthrough needed for the emergence of computational science, buy they were only a partial answer.

Massively Parallel Processors (MPP) -proprietary systems: high cost and a low performance/price ratio.

Symmetric Multiprocessors (SMP): suffers from scalability

Distributed Systems: difficult to use and hard to extract parallel performance.

Clusters - gaining popularity: High Performance Computing - Commodity Supercomputing High Availability Computing - Mission Critical Applications

What is Cluster ? A cluster is a type of parallel or distributed processing system, which consists of a collection of interconnected stand-alone computers cooperatively working together as a single, integrated computing resource. A node a single or multiprocessor system with memory, I/O facilities, & OS generally 2 or more computers (nodes) connected together in a single cabinet, or physically separated & connected via a LAN appear as a single system to users and applications provide a cost-effective way to gain features and benefits

Why PC/WS Clustering Now ? Individual PCs/workstations are becoming increasing powerful Commodity networks bandwidth is increasing and latency is decreasing PC/Workstation clusters are easier to integrate into existing networks Typical low user utilization of PCs/WSs Development tools for PCs/WS are more mature PC/WS clusters are a cheap and readily available Clusters can be easily grown

Cluster Architecture Parallel Applications Parallel Applications Parallel Applications

Sequential Applications Sequential Applications Sequential Applications

Parallel Programming Environment Cluster Middleware

(Single System Image and Availability Infrastructure) PC/Workstation

PC/Workstation

PC/Workstation

PC/Workstation

Communications

Communications

Communications

Communications

Software

Software

Software

Software

Network Interface Hardware

Network Interface Hardware

Network Interface Hardware

Network Interface Hardware

Cluster Interconnection Network/Switch

Cluster Design Issues •

Enhanced Performance (performance @ low cost)

Enhanced Availability (failure management)

Single System Image (look-and-feel of one system)

Size Scalability (physical & application)

Fast Communication (networks & protocols)

Load Balancing (CPU, Net, Memory, Disk)

Security and Encryption (clusters of clusters)

Distributed Environment (Social issues)

Manageability (admin. And control)

Programmability (simple API if required)

Applicability (cluster-aware and non-aware app.)

Prominent Components of Cluster Computers High Performance Networks/Switches Ethernet (10Mbps), Fast Ethernet (100Mbps), Gigabit Ethernet (1Gbps) SCI (Scalable Coherent Interface- MPI- 12µsec latency) ATM (Asynchronous Transfer Mode) Myrinet (1.2Gbps) QsNet (Quadrics Supercomputing World, 5µsec latency for MPI messages) Digital Memory Channel FDDI (fiber distributed data interface) InfiniBand

3


Prominent Components of Cluster Computers Fast Communication Protocols and Services (User Level Communication): Active Messages (Berkeley) Fast Messages (Illinois) U-net (Cornell) XTP (Virginia) Virtual Interface Architecture (VIA)

Commodity Components for Clusters Cluster Interconnects Communicate over high-speed networks using a standard networking protocol such as TCP/IP or a low-level protocol such as AM Standard Ethernet 10 Mbps cheap, easy way to provide file and printer sharing bandwidth & latency are not balanced with the computational power

Ethernet, Fast Ethernet, and Gigabit Ethernet Fast Ethernet – 100 Mbps Gigabit Ethernet preserve Ethernet’s simplicity deliver a very high bandwidth to aggregate multiple Fast Ethernet segments

Advanced Network Services/ Communication SW Communication infrastructure support protocol for Bulk-data transport Streaming data Group communications

Communication service provide cluster with important QoS parameters Latency Bandwidth Reliability Fault-tolerance Jitter control

Network service are designed as hierarchical stack of protocols with relatively low-level communication API, provide means to implement wide range of communication methodologies RPC DSM Stream-based and message passing interface (e.g., MPI, PVM)

Cluster Interconnects: Comparison (created in 2000) Myrinet

QSnet

Giganet

ServerNet2

SCI

Gigabit Ethernet

Bandwidth (MBytes/s)

140 – 33MHz 215 – 66 Mhz

208

~105

165

~80

30 - 50

MPI Latency (µs)

16.5 – 33Nhz 11 – 66 Mhz

5

~20 - 40

20.2

6

100 - 200

List price/port

$1.5K

$6.5K

~$1.5K

$1.5K

Hardware Availability

Now

Now

Now

Q2‘00

Linux Support

Now

Now

Now

Maximum #nodes

1000’s

1000’s

Protocol Implementation

Firmware on adapter

VIA support

Soon

MPI support

3rd

party

~$1.5K

~$1.5K

Now

Now

Q2‘00

Now

Now

1000’s

1000’s

64K

1000’s

Firmware on adapter

Firmware on adapter

Implemented in hardware

None

NT/Linux

Done in hardware

Quadrics/ Compaq

3rd

Compaq/3rd

Party

party

Firmware on adapter Software TCP/IP, VIA

3rd Party

Implemented in hardware NT/Linux

MPICH – TCP/IP

Commodity Components for Clusters Cluster Interconnects Myrinet 1.28 Gbps full duplex interconnection network Use low latency cut-through routing switches, which is able to offer fault tolerance by automatic mapping of the network configuration Support both Linux & NT Advantages Very low latency (5 s, one-way point-to-point) Very high throughput Programmable on-board processor for greater flexibility

Disadvantages Expensive: $1500 per host Complicated scaling: switches with more than 16 ports are unavailable

Cluster Programming Environments Shared Memory Based DSM Threads/OpenMP (enabled for clusters) Java threads (IBM cJVM)

Message Passing Based PVM (PVM) MPI (MPI)

Parametric Computations Nimrod-G and Gridbus Data Grid Broker

Automatic Parallelising Compilers Parallel Libraries & Computational Kernels (e.g., NetSolve)

4


Levels of Parallelism PVM/MPI

Threads

Compilers

Task Taski-l i-l

Task Taskii

func1 func1( () ) {{ .... .... .... .... }}

func2 func2( () ) {{ .... .... .... .... }}

aa( (00) )=.. =.. bb( (00) )=.. =..

func3 func3( () ) {{ .... .... .... .... }}

aa( (11)=.. )=.. bb( (11)=.. )=..

++

CPU

Task Taski+1 i+1

aa( (22)=.. )=.. bb( (22)=.. )=..

xx

Fine grain (data level) Loop (Compiler)

IPC

IPC

channel

channel Processor Processor BB

M E M B O U R S Y

M E M B O U R S Y

Memory Memory System System AA

Memory Memory System System BB

M E M B O U R S Y

Medium grain (control level) Function (thread)

Distributed Memory MIMD Processor Processor AA

Processor Processor AA

Code-Granularity Code Item Large grain (task level) Program

Very fine grain (multiple issue) With hardware

Load Load

Shared Memory MIMD

Processor Processor CC

M E M B O U R S Y

Processor Processor BB

M E M B O U R S Y

Processor Processor CC

M E M B O U R S Y

Global GlobalMemory MemorySystem System Comm: Source PE writes data to GM & destination retrieves it Easy to build, conventional OSes of SISD can be easily be ported Limitation : reliability & expandibility. A memory component or any processor failure affects the whole system. Increase of processors leads to memory contention. Ex. : Silicon graphics supercomputers....

Programming Environments and Tools Message Passing Systems (MPI and PVM) Allow efficient parallel programs to be written for distributed memory systems 2 most popular high-level message-passing systems – PVM & MPI PVM both an environment & a message-passing library

● ●

Memory Memory System SystemCC

Communicat ion : I PC (I nt er -Pr ocess Communicat ion) via High Speed Network. Network can be configured to ... Tree, Mesh, Cube, etc. Unlike Shared MIMD easily/ readily expandable Highly reliable (any CPU failure does not affect the whole system)

Programming Environments and Tools Distributed Shared Memory (DSM) Systems Message-passing the most efficient, widely used, programming paradigm on distributed memory system complex & difficult to program

Shared memory systems offer a simple and general programming model but suffer from scalability

DSM on distributed memory system alternative cost-effective solution Software DSM Usually built as a separate layer on top of the comm interface Take full advantage of the application characteristics: virtual pages, objects, & language types are units of sharing TreadMarks, Linda

Hardware DSM Better performance, no burden on user & SW layers, fine granularity of sharing, extensions of the cache coherence scheme, & increased HW complexity DASH, Merlin

MPI a message passing specification, designed to be standard for distributed memory parallel computing using explicit message passing attempt to establish a practical, portable, efficient, & flexible standard for message passing generally, application developers prefer MPI, as it is fast becoming the de facto standard for message passing

Summary: Cluster Advantage Price/performance ratio is low when compared with a dedicated parallel supercomputer. Incremental growth that often matches with the demand patterns. The provision of a multipurpose system Scientific, commercial, Internet applications

Have become mainstream enterprise computing systems: In 2003 List of Top 500 Supercomputers, over 50% of them are based on clusters and many of them are deployed in industries.

5


This document was created with Win2PDF available at http://www.daneprairie.com. The unregistered version of Win2PDF is for evaluation or non-commercial use only.


Turn static files into dynamic content formats.

Create a flipbook
Issuu converts static files into: digital portfolios, online yearbooks, online catalogs, digital photo albums and more. Sign up and create your flipbook.